Caveat: It is not necessary to put the implementation in the header file, see the alternative solution at the end of this answer.
Anyway, the reason your code is failing is that, when instantiating a template, the compiler creates a new class with the given template argument. For example:
template<typename T>
struct Foo
{
T bar;
void doSomething(T param) {/* do stuff using T */}
};
// somewhere in a .cpp
Foo<int> f;
When reading this line, the compiler will create a new class (let's call it FooInt
), which is equivalent to the following:
struct FooInt
{
int bar;
void doSomething(int param) {/* do stuff using int */}
}
Consequently, the compiler needs to have access to the implementation of the methods, to instantiate them with the template argument (in this case int
). If these implementations were not in the header, they wouldn't be accessible, and therefore the compiler wouldn't be able to instantiate the template.
A common solution to this is to write the template declaration in a header file, then implement the class in an implementation file (for example .tpp), and include this implementation file at the end of the header.
Foo.h
template <typename T>
struct Foo
{
void doSomething(T param);
};
#include "Foo.tpp"
Foo.tpp
template <typename T>
void Foo<T>::doSomething(T param)
{
//implementation
}
This way, implementation is still separated from declaration, but is accessible to the compiler.
Alternative solution
Another solution is to keep the implementation separated, and explicitly instantiate all the template instances you'll need:
Foo.h
// no implementation
template <typename T> struct Foo { ... };
Foo.cpp
// implementation of Foo's methods
// explicit instantiations
template class Foo<int>;
template class Foo<float>;
// You will only be able to use Foo with int or float
If my explanation isn't clear enough, you can have a look at the C++ Super-FAQ on this subject.
Culprit: False Data Dependency (and the compiler isn't even aware of it)
On Sandy/Ivy Bridge and Haswell processors, the instruction:
popcnt src, dest
appears to have a false dependency on the destination register dest
. Even though the instruction only writes to it, the instruction will wait until dest
is ready before executing. This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)
Skylake fixed this for lzcnt
and tzcnt
.
Cannon Lake (and Ice Lake) fixed this for popcnt
.
bsf
/bsr
have a true output dependency: output unmodified for input=0. (But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)
(Yes, these instructions all run on the same execution unit).
This dependency doesn't just hold up the 4 popcnt
s from a single loop iteration. It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.
The unsigned
vs. uint64_t
and other tweaks don't directly affect the problem. But they influence the register allocator which assigns the registers to the variables.
In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.
- 13 GB/s has a chain:
popcnt
-add
-popcnt
-popcnt
→ next iteration
- 15 GB/s has a chain:
popcnt
-add
-popcnt
-add
→ next iteration
- 20 GB/s has a chain:
popcnt
-popcnt
→ next iteration
- 26 GB/s has a chain:
popcnt
-popcnt
→ next iteration
The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing. Either way, the processor starts to hit other bottlenecks once you reach this speed.
To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want. I also split up the count
variable to break all other dependencies that might mess with the benchmarks.
Here are the results:
Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)
- GCC 4.6.3:
g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native
- Ubuntu 12
Different Registers: 18.6195 GB/s
.L4:
movq (%rbx,%rax,8), %r8
movq 8(%rbx,%rax,8), %r9
movq 16(%rbx,%rax,8), %r10
movq 24(%rbx,%rax,8), %r11
addq $4, %rax
popcnt %r8, %r8
add %r8, %rdx
popcnt %r9, %r9
add %r9, %rcx
popcnt %r10, %r10
add %r10, %rdi
popcnt %r11, %r11
add %r11, %rsi
cmpq $131072, %rax
jne .L4
Same Register: 8.49272 GB/s
.L9:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# This time reuse "rax" for all the popcnts.
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L9
Same Register with broken chain: 17.8869 GB/s
.L14:
movq (%rbx,%rdx,8), %r9
movq 8(%rbx,%rdx,8), %r10
movq 16(%rbx,%rdx,8), %r11
movq 24(%rbx,%rdx,8), %rbp
addq $4, %rdx
# Reuse "rax" for all the popcnts.
xor %rax, %rax # Break the cross-iteration dependency by zeroing "rax".
popcnt %r9, %rax
add %rax, %rcx
popcnt %r10, %rax
add %rax, %rsi
popcnt %r11, %rax
add %rax, %r8
popcnt %rbp, %rax
add %rax, %rdi
cmpq $131072, %rdx
jne .L14
So what went wrong with the compiler?
It seems that neither GCC nor Visual Studio are aware that popcnt
has such a false dependency. Nevertheless, these false dependencies aren't uncommon. It's just a matter of whether the compiler is aware of it.
popcnt
isn't exactly the most used instruction. So it's not really a surprise that a major compiler could miss something like this. There also appears to be no documentation anywhere that mentions this problem. If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.
(Update: As of version 4.9.2, GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)
Why does the CPU have such a false dependency?
We can speculate: it runs on the same execution unit as bsf
/ bsr
which do have an output dependency. (How is POPCNT implemented in hardware?). For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified. AMD documents this behaviour.
Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.
AMD processors do not appear to have this false dependency.
The full test code is below for reference:
#include <iostream>
#include <chrono>
#include <x86intrin.h>
int main(int argc, char* argv[]) {
using namespace std;
uint64_t size=1<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer=reinterpret_cast<char*>(buffer);
for (unsigned i=0;i<size;++i) charbuffer[i]=rand()%256;
uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %4 \n\t"
"add %4, %0 \n\t"
"popcnt %5, %5 \n\t"
"add %5, %1 \n\t"
"popcnt %6, %6 \n\t"
"add %6, %2 \n\t"
"popcnt %7, %7 \n\t"
"add %7, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "No Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Chain 4 \t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
uint64_t c0 = 0;
uint64_t c1 = 0;
uint64_t c2 = 0;
uint64_t c3 = 0;
startP = chrono::system_clock::now();
for( unsigned k = 0; k < 10000; k++){
for (uint64_t i=0;i<size/8;i+=4) {
uint64_t r0 = buffer[i + 0];
uint64_t r1 = buffer[i + 1];
uint64_t r2 = buffer[i + 2];
uint64_t r3 = buffer[i + 3];
__asm__(
"xor %%rax, %%rax \n\t" // <--- Break the chain.
"popcnt %4, %%rax \n\t"
"add %%rax, %0 \n\t"
"popcnt %5, %%rax \n\t"
"add %%rax, %1 \n\t"
"popcnt %6, %%rax \n\t"
"add %%rax, %2 \n\t"
"popcnt %7, %%rax \n\t"
"add %%rax, %3 \n\t"
: "+r" (c0), "+r" (c1), "+r" (c2), "+r" (c3)
: "r" (r0), "r" (r1), "r" (r2), "r" (r3)
: "rax"
);
}
}
count = c0 + c1 + c2 + c3;
endP = chrono::system_clock::now();
duration=chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "Broken Chain\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
free(charbuffer);
}
An equally interesting benchmark can be found here: http://pastebin.com/kbzgL8si
This benchmark varies the number of popcnt
s that are in the (false) dependency chain.
False Chain 0: 41959360000 0.57748 sec 18.1578 GB/s
False Chain 1: 41959360000 0.585398 sec 17.9122 GB/s
False Chain 2: 41959360000 0.645483 sec 16.2448 GB/s
False Chain 3: 41959360000 0.929718 sec 11.2784 GB/s
False Chain 4: 41959360000 1.23572 sec 8.48557 GB/s
Best Solution
It looks like I've sorted out the problem. As I expected, it's more of a lack-of-experience issue, than anything.
As it turns out, gcov was only finding a subset of the available tracefiles, and was therefore reporting only part of the total results. Finding and adding the rest of the tracefiles resolved the issue.
For any that are curious, the problematic header file still shows up in the output. In fact, all of the other header files do too, even though they contain no executable code. However, they all show 90+% coverage.