Environment: Win32, C/C++
All three (3) can be used for a thread to signal to main() that it has completed an operation for example.
But which one is the fastest signal of all?
hmm…
c++postmessagesendmessagewinapi
Environment: Win32, C/C++
All three (3) can be used for a thread to signal to main() that it has completed an operation for example.
But which one is the fastest signal of all?
hmm…
Here is a real world example: Fixed point multiplies on old compilers.
These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float
).
Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see
uint64_t
for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128
for efficient code on 64-bit systems.C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:
// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.
long long product = a_long * b; // perform multiplication
return (int) (product >> 16); // shift by the fixed point bias
}
The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.
x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).
So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.
If you rewrite the same code in (inline) assembler you can gain a significant speed boost.
In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.
Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.
For reference: The end-result for the fixed-point mul for the VS.NET compiler is:
int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}
The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.
Using Visual C++ 2013 gives the same assembly code for both ways.
gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)
See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)
Modern CPUs can do things C doesn't have operators for at all, like popcnt
or bit-scan to find the first or last set bit. (POSIX has a ffs()
function, but its semantics don't match x86 bsf
/ bsr
. See https://en.wikipedia.org/wiki/Find_first_set).
Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt
instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt
in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32
from <immintrin.h>
.
Or in C++, assign to a std::bitset<32>
and use .count()
. (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.
Similarly, ntohl
can compile to bswap
(x86 32-bit byte swap for endian conversion) on some C implementations that have it.
Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;
, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.
Just about every modern operating system will recover all the allocated memory space after a program exits. The only exception I can think of might be something like Palm OS where the program's static storage and runtime memory are pretty much the same thing, so not freeing might cause the program to take up more storage. (I'm only speculating here.)
So generally, there's no harm in it, except the runtime cost of having more storage than you need. Certainly in the example you give, you want to keep the memory for a variable that might be used until it's cleared.
However, it's considered good style to free memory as soon as you don't need it any more, and to free anything you still have around on program exit. It's more of an exercise in knowing what memory you're using, and thinking about whether you still need it. If you don't keep track, you might have memory leaks.
On the other hand, the similar admonition to close your files on exit has a much more concrete result - if you don't, the data you wrote to them might not get flushed, or if they're a temp file, they might not get deleted when you're done. Also, database handles should have their transactions committed and then closed when you're done with them. Similarly, if you're using an object oriented language like C++ or Objective C, not freeing an object when you're done with it will mean the destructor will never get called, and any resources the class is responsible might not get cleaned up.
Best Solution
All three options require a thread context switch to actually signal the receiving thread. It's quite likely that the overhead of the context switch will overwhelm any difference in processing cost in any of the APIs.
The choice is likely best driven by the nature of the receiving thread, e.g. is it a UI thread, and/or does it carry out a message loop. That said, some fine detail includes:
SendMessage is useful when the receiving thread is a UI thread, churning inside a message loop. The sending thread will block until the recipient processes the message. But it may handle unqueued messages during that time. That logic could potentially slow things down, as additional context switches may be involved, making SendMessage the slowest of the three.
PostMessage is also useful when the recipient is inside a message loop. The difference from SendMessage is that it doesn't wait for the recipient to process the message, thus incurring less overhead.
SetEvent is useful when the receiving thread can wait on an event object, e.g. with WaitForSingleObject(). It incurs no marshaling or message processing overhead, and is likely to respond quicker than the others.