The source referenced by the OP has some credibility ...but what about Microsoft - what is the stance on struct usage? I sought some extra learning from Microsoft, and here is what I found:
Consider defining a structure instead of a class if instances of the
type are small and commonly short-lived or are commonly embedded in
other objects.
Do not define a structure unless the type has all of the following characteristics:
- It logically represents a single value, similar to primitive types (integer, double, and so on).
- It has an instance size smaller than 16 bytes.
- It is immutable.
- It will not have to be boxed frequently.
Microsoft consistently violates those rules
Okay, #2 and #3 anyway. Our beloved dictionary has 2 internal structs:
[StructLayout(LayoutKind.Sequential)] // default for structs
private struct Entry //<Tkey, TValue>
{
// View code at *Reference Source
}
[Serializable, StructLayout(LayoutKind.Sequential)]
public struct Enumerator :
IEnumerator<KeyValuePair<TKey, TValue>>, IDisposable,
IDictionaryEnumerator, IEnumerator
{
// View code at *Reference Source
}
*Reference Source
The 'JonnyCantCode.com' source got 3 out of 4 - quite forgivable since #4 probably wouldn't be an issue. If you find yourself boxing a struct, rethink your architecture.
Let's look at why Microsoft would use these structs:
- Each struct,
Entry
and Enumerator
, represent single values.
- Speed
Entry
is never passed as a parameter outside of the Dictionary class. Further investigation shows that in order to satisfy implementation of IEnumerable, Dictionary uses the Enumerator
struct which it copies every time an enumerator is requested ...makes sense.
- Internal to the Dictionary class.
Enumerator
is public because Dictionary is enumerable and must have equal accessibility to the IEnumerator interface implementation - e.g. IEnumerator getter.
Update - In addition, realize that when a struct implements an interface - as Enumerator does - and is cast to that implemented type, the struct becomes a reference type and is moved to the heap. Internal to the Dictionary class, Enumerator is still a value type. However, as soon as a method calls GetEnumerator()
, a reference-type IEnumerator
is returned.
What we don't see here is any attempt or proof of requirement to keep structs immutable or maintaining an instance size of only 16 bytes or less:
- Nothing in the structs above is declared
readonly
- not immutable
- Size of these struct could be well over 16 bytes
Entry
has an undetermined lifetime (from Add()
, to Remove()
, Clear()
, or garbage collection);
And ...
4. Both structs store TKey and TValue, which we all know are quite capable of being reference types (added bonus info)
Hashed keys notwithstanding, dictionaries are fast in part because instancing a struct is quicker than a reference type. Here, I have a Dictionary<int, int>
that stores 300,000 random integers with sequentially incremented keys.
Capacity: 312874
MemSize: 2660827 bytes
Completed Resize: 5ms
Total time to fill: 889ms
Capacity: number of elements available before the internal array must be resized.
MemSize: determined by serializing the dictionary into a MemoryStream and getting a byte length (accurate enough for our purposes).
Completed Resize: the time it takes to resize the internal array from 150862 elements to 312874 elements. When you figure that each element is sequentially copied via Array.CopyTo()
, that ain't too shabby.
Total time to fill: admittedly skewed due to logging and an OnResize
event I added to the source; however, still impressive to fill 300k integers while resizing 15 times during the operation. Just out of curiosity, what would the total time to fill be if I already knew the capacity? 13ms
So, now, what if Entry
were a class? Would these times or metrics really differ that much?
Capacity: 312874
MemSize: 2660827 bytes
Completed Resize: 26ms
Total time to fill: 964ms
Obviously, the big difference is in resizing. Any difference if Dictionary is initialized with the Capacity? Not enough to be concerned with ... 12ms.
What happens is, because Entry
is a struct, it does not require initialization like a reference type. This is both the beauty and the bane of the value type. In order to use Entry
as a reference type, I had to insert the following code:
/*
* Added to satisfy initialization of entry elements --
* this is where the extra time is spent resizing the Entry array
* **/
for (int i = 0 ; i < prime ; i++)
{
destinationArray[i] = new Entry( );
}
/* *********************************************** */
The reason I had to initialize each array element of Entry
as a reference type can be found at MSDN: Structure Design. In short:
Do not provide a default constructor for a structure.
If a structure defines a default constructor, when arrays of the
structure are created, the common language runtime automatically
executes the default constructor on each array element.
Some compilers, such as the C# compiler, do not allow structures to
have default constructors.
It is actually quite simple and we will borrow from Asimov's Three Laws of Robotics:
- The struct must be safe to use
- The struct must perform its function efficiently, unless this would violate rule #1
- The struct must remain intact during its use unless its destruction is required to satisfy rule #1
...what do we take away from this: in short, be responsible with the use of value types. They are quick and efficient, but have the ability to cause many unexpected behaviors if not properly maintained (i.e. unintentional copies).
Here is a real world example: Fixed point multiplies on old compilers.
These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float
).
Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see
C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:
// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.
long long product = a_long * b; // perform multiplication
return (int) (product >> 16); // shift by the fixed point bias
}
The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.
x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).
So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.
If you rewrite the same code in (inline) assembler you can gain a significant speed boost.
In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.
Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.
For reference: The end-result for the fixed-point mul for the VS.NET compiler is:
int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}
The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.
Using Visual C++ 2013 gives the same assembly code for both ways.
gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)
See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)
Modern CPUs can do things C doesn't have operators for at all, like popcnt
or bit-scan to find the first or last set bit. (POSIX has a ffs()
function, but its semantics don't match x86 bsf
/ bsr
. See https://en.wikipedia.org/wiki/Find_first_set).
Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt
instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt
in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32
from <immintrin.h>
.
Or in C++, assign to a std::bitset<32>
and use .count()
. (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.
Similarly, ntohl
can compile to bswap
(x86 32-bit byte swap for endian conversion) on some C implementations that have it.
Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;
, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.
Best Answer
\d
checks all Unicode digits, while[0-9]
is limited to these 10 characters. For example, Persian digits,۱۲۳۴۵۶۷۸۹
, are an example of Unicode digits which are matched with\d
, but not[0-9]
.You can generate a list of all such characters using the following code:
Which generates: