I'm currently in the process of learning C++, and because I'm still learning, I keep making mistakes.
With a language as permissive as C++, it often takes a long time to figure out exactly what's wrong — because the compiler lets me get away with a lot. I realize that this flexibility is one of C++'s major strengths, but it makes it difficult to learn the basic language.
Is there some tool I can use to analyze my code and make suggestions based on best practices or just sensible coding? Preferably as an Eclipse plugin or linux application.
C++ code analysis tools
I'm currently in the process of learning C++, and because I'm still learning, I keep making mistakes.
If your goal is to use a profiler, use one of the suggested ones.
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So, that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
- They don't summarize at the instruction level, and
- They give confusing summaries in the presence of recursion.
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
Added: Let me make a Bayesian explanation of how it works. Suppose there is some instruction
I (call or otherwise) which is on the call stack some fraction
f of the time (and thus costs that much). For simplicity, suppose we don't know what
f is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
Then suppose we take just 2 stack samples, and we see instruction
I on both samples, designated observation
o=2/2. This gives us new estimates of the frequency
I, according to this:
Prior P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2) 0.1 1 1 0.1 0.1 0.25974026 0.1 0.9 0.81 0.081 0.181 0.47012987 0.1 0.8 0.64 0.064 0.245 0.636363636 0.1 0.7 0.49 0.049 0.294 0.763636364 0.1 0.6 0.36 0.036 0.33 0.857142857 0.1 0.5 0.25 0.025 0.355 0.922077922 0.1 0.4 0.16 0.016 0.371 0.963636364 0.1 0.3 0.09 0.009 0.38 0.987012987 0.1 0.2 0.04 0.004 0.384 0.997402597 0.1 0.1 0.01 0.001 0.385 1 P(o=2/2) 0.385
The last column says that, for example, the probability that
f >= 0.5 is 92%, up from the prior assumption of 60%.
Suppose the prior assumptions are different. Suppose we assume
P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that
I is cheap. Then we get:
Prior P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2) 0.001 1 1 0.001 0.001 0.072727273 0.001 0.9 0.81 0.00081 0.00181 0.131636364 0.001 0.8 0.64 0.00064 0.00245 0.178181818 0.001 0.7 0.49 0.00049 0.00294 0.213818182 0.001 0.6 0.36 0.00036 0.0033 0.24 0.001 0.5 0.25 0.00025 0.00355 0.258181818 0.001 0.4 0.16 0.00016 0.00371 0.269818182 0.001 0.3 0.09 0.00009 0.0038 0.276363636 0.001 0.2 0.04 0.00004 0.00384 0.279272727 0.991 0.1 0.01 0.00991 0.01375 1 P(o=2/2) 0.01375
Now it says
P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of
I. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
Yet another way to look at it is called the Rule Of Succession.
If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin?
The respected way to answer is to say that it's a Beta distribution, with average value
(number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.
(The key is that we see
I more than once. If we only see it once, that doesn't tell us much except that
f > 0.)
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If
n samples are taken, and
f is the cost, then
I will appear on
nf+/-sqrt(nf(1-f)) samples. Example,
f=0.3, that is
Added: To give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes. So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.
On modern computers, only the lowest level memory structures (the registers) can move data around in single clock cycles. However, registers are very expensive and most computer cores have less than a few dozen registers. At the other end of the memory spectrum (DRAM), the memory is very cheap (i.e. literally millions of times cheaper) but takes hundreds of cycles after a request to receive the data. To bridge this gap between super fast and expensive and super slow and cheap are the cache memories, named L1, L2, L3 in decreasing speed and cost. The idea is that most of the executing code will be hitting a small set of variables often, and the rest (a much larger set of variables) infrequently. If the processor can't find the data in L1 cache, then it looks in L2 cache. If not there, then L3 cache, and if not there, main memory. Each of these "misses" is expensive in time.
(The analogy is cache memory is to system memory, as system memory is to hard disk storage. Hard disk storage is super cheap but very slow).
Caching is one of the main methods to reduce the impact of latency. To paraphrase Herb Sutter (cfr. links below): increasing bandwidth is easy, but we can't buy our way out of latency.
Data is always retrieved through the memory hierarchy (smallest == fastest to slowest). A cache hit/miss usually refers to a hit/miss in the highest level of cache in the CPU -- by highest level I mean the largest == slowest. The cache hit rate is crucial for performance since every cache miss results in fetching data from RAM (or worse ...) which takes a lot of time (hundreds of cycles for RAM, tens of millions of cycles for HDD). In comparison, reading data from the (highest level) cache typically takes only a handful of cycles.
In modern computer architectures, the performance bottleneck is leaving the CPU die (e.g. accessing RAM or higher). This will only get worse over time. The increase in processor frequency is currently no longer relevant to increase performance. The problem is memory access. Hardware design efforts in CPUs therefore currently focus heavily on optimizing caches, prefetching, pipelines and concurrency. For instance, modern CPUs spend around 85% of die on caches and up to 99% for storing/moving data!
There is quite a lot to be said on the subject. Here are a few great references about caches, memory hierarchies and proper programming:
- Agner Fog's page. In his excellent documents, you can find detailed examples covering languages ranging from assembly to C++.
- If you are into videos, I strongly recommend to have a look at Herb Sutter's talk on machine architecture (youtube) (specifically check 12:00 and onwards!).
- Slides about memory optimization by Christer Ericson (director of technology @ Sony)
- LWN.net's article "What every programmer should know about memory"
Main concepts for cache-friendly code
A very important aspect of cache-friendly code is all about the principle of locality, the goal of which is to place related data close in memory to allow efficient caching. In terms of the CPU cache, it's important to be aware of cache lines to understand how this works: How do cache lines work?
The following particular aspects are of high importance to optimize caching:
- Temporal locality: when a given memory location was accessed, it is likely that the same location is accessed again in the near future. Ideally, this information will still be cached at that point.
- Spatial locality: this refers to placing related data close to each other. Caching happens on many levels, not just in the CPU. For example, when you read from RAM, typically a larger chunk of memory is fetched than what was specifically asked for because very often the program will require that data soon. HDD caches follow the same line of thought. Specifically for CPU caches, the notion of cache lines is important.
Use appropriate c++ containers
A simple example of cache-friendly versus cache-unfriendly is c++'s
std::list. Elements of a
std::vector are stored in contiguous memory, and as such accessing them is much more cache-friendly than accessing elements in a
std::list, which stores its content all over the place. This is due to spatial locality.
A very nice illustration of this is given by Bjarne Stroustrup in this youtube clip (thanks to @Mohammad Ali Baydoun for the link!).
Don't neglect the cache in data structure and algorithm design
Whenever possible, try to adapt your data structures and order of computations in a way that allows maximum use of the cache. A common technique in this regard is cache blocking (Archive.org version), which is of extreme importance in high-performance computing (cfr. for example ATLAS).
Know and exploit the implicit structure of data
Another simple example, which many people in the field sometimes forget is column-major (ex. fortran,matlab) vs. row-major ordering (ex. c,c++) for storing two dimensional arrays. For example, consider the following matrix:
1 2 3 4
In row-major ordering, this is stored in memory as
1 2 3 4; in column-major ordering, this would be stored as
1 3 2 4. It is easy to see that implementations which do not exploit this ordering will quickly run into (easily avoidable!) cache issues. Unfortunately, I see stuff like this very often in my domain (machine learning). @MatteoItalia showed this example in more detail in his answer.
When fetching a certain element of a matrix from memory, elements near it will be fetched as well and stored in a cache line. If the ordering is exploited, this will result in fewer memory accesses (because the next few values which are needed for subsequent computations are already in a cache line).
For simplicity, assume the cache comprises a single cache line which can contain 2 matrix elements and that when a given element is fetched from memory, the next one is too. Say we want to take the sum over all elements in the example 2x2 matrix above (lets call it
Exploiting the ordering (e.g. changing column index first in c++):
M (memory) + M (cached) + M (memory) + M (cached) = 1 + 2 + 3 + 4 --> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):
M (memory) + M (memory) + M (memory) + M (memory) = 1 + 3 + 2 + 4 --> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice, the performance difference can be much larger.
Avoid unpredictable branches
Modern architectures feature pipelines and compilers are becoming very good at reordering code to minimize delays due to memory access. When your critical code contains (unpredictable) branches, it is hard or impossible to prefetch data. This will indirectly lead to more cache misses.
This is explained very well here (thanks to @0x90 for the link): Why is processing a sorted array faster than processing an unsorted array?
Avoid virtual functions
In the context of c++,
virtual methods represent a controversial issue with regard to cache misses (a general consensus exists that they should be avoided when possible in terms of performance). Virtual functions can induce cache misses during look up, but this only happens if the specific function is not called often (otherwise it would likely be cached), so this is regarded as a non-issue by some. For reference about this issue, check out: What is the performance cost of having a virtual method in a C++ class?
A common problem in modern architectures with multiprocessor caches is called false sharing. This occurs when each individual processor is attempting to use data in another memory region and attempts to store it in the same cache line. This causes the cache line -- which contains data another processor can use -- to be overwritten again and again. Effectively, different threads make each other wait by inducing cache misses in this situation. See also (thanks to @Matt for the link): How and when to align to cache line size?
An extreme symptom of poor caching in RAM memory (which is probably not what you mean in this context) is so-called thrashing. This occurs when the process continuously generates page faults (e.g. accesses memory which is not in the current page) which require disk access.
Enable maximum compiler warnings (that's the
-Walloption if you're using the Gnu compiler).
'Lint' is the archetypical static analysis tool.
valgrindis a good run-time analyzer.