Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1
, which is 1/10
) whose denominator is not a power of two cannot be exactly represented.
For 0.1
in the standard binary64
format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625
in decimal, or
0x1.999999999999ap-4
in C99 hexfloat notation.
In contrast, the rational number 0.1
, which is 1/10
, can be written exactly as
0.1
in decimal, or
0x1.99999999999999...p-4
in an analogue of C99 hexfloat notation, where the ...
represents an unending sequence of 9's.
The constants 0.2
and 0.3
in your program will also be approximations to their true values. It happens that the closest double
to 0.2
is larger than the rational number 0.2
but that the closest double
to 0.3
is smaller than the rational number 0.3
. The sum of 0.1
and 0.2
winds up being larger than the rational number 0.3
and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }
.
where abs
is the absolute value. myToleranceValue
needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.
Ok, here's what I've come up with for a fixed-point struct, based on the link in my original question but also including some fixes to how it was handling division and multiplication, and added logic for modules, comparisons, shifts, etc:
public struct FInt
{
public long RawValue;
public const int SHIFT_AMOUNT = 12; //12 is 4096
public const long One = 1 << SHIFT_AMOUNT;
public const int OneI = 1 << SHIFT_AMOUNT;
public static FInt OneF = FInt.Create( 1, true );
#region Constructors
public static FInt Create( long StartingRawValue, bool UseMultiple )
{
FInt fInt;
fInt.RawValue = StartingRawValue;
if ( UseMultiple )
fInt.RawValue = fInt.RawValue << SHIFT_AMOUNT;
return fInt;
}
public static FInt Create( double DoubleValue )
{
FInt fInt;
DoubleValue *= (double)One;
fInt.RawValue = (int)Math.Round( DoubleValue );
return fInt;
}
#endregion
public int IntValue
{
get { return (int)( this.RawValue >> SHIFT_AMOUNT ); }
}
public int ToInt()
{
return (int)( this.RawValue >> SHIFT_AMOUNT );
}
public double ToDouble()
{
return (double)this.RawValue / (double)One;
}
public FInt Inverse
{
get { return FInt.Create( -this.RawValue, false ); }
}
#region FromParts
/// <summary>
/// Create a fixed-int number from parts. For example, to create 1.5 pass in 1 and 500.
/// </summary>
/// <param name="PreDecimal">The number above the decimal. For 1.5, this would be 1.</param>
/// <param name="PostDecimal">The number below the decimal, to three digits.
/// For 1.5, this would be 500. For 1.005, this would be 5.</param>
/// <returns>A fixed-int representation of the number parts</returns>
public static FInt FromParts( int PreDecimal, int PostDecimal )
{
FInt f = FInt.Create( PreDecimal, true );
if ( PostDecimal != 0 )
f.RawValue += ( FInt.Create( PostDecimal ) / 1000 ).RawValue;
return f;
}
#endregion
#region *
public static FInt operator *( FInt one, FInt other )
{
FInt fInt;
fInt.RawValue = ( one.RawValue * other.RawValue ) >> SHIFT_AMOUNT;
return fInt;
}
public static FInt operator *( FInt one, int multi )
{
return one * (FInt)multi;
}
public static FInt operator *( int multi, FInt one )
{
return one * (FInt)multi;
}
#endregion
#region /
public static FInt operator /( FInt one, FInt other )
{
FInt fInt;
fInt.RawValue = ( one.RawValue << SHIFT_AMOUNT ) / ( other.RawValue );
return fInt;
}
public static FInt operator /( FInt one, int divisor )
{
return one / (FInt)divisor;
}
public static FInt operator /( int divisor, FInt one )
{
return (FInt)divisor / one;
}
#endregion
#region %
public static FInt operator %( FInt one, FInt other )
{
FInt fInt;
fInt.RawValue = ( one.RawValue ) % ( other.RawValue );
return fInt;
}
public static FInt operator %( FInt one, int divisor )
{
return one % (FInt)divisor;
}
public static FInt operator %( int divisor, FInt one )
{
return (FInt)divisor % one;
}
#endregion
#region +
public static FInt operator +( FInt one, FInt other )
{
FInt fInt;
fInt.RawValue = one.RawValue + other.RawValue;
return fInt;
}
public static FInt operator +( FInt one, int other )
{
return one + (FInt)other;
}
public static FInt operator +( int other, FInt one )
{
return one + (FInt)other;
}
#endregion
#region -
public static FInt operator -( FInt one, FInt other )
{
FInt fInt;
fInt.RawValue = one.RawValue - other.RawValue;
return fInt;
}
public static FInt operator -( FInt one, int other )
{
return one - (FInt)other;
}
public static FInt operator -( int other, FInt one )
{
return (FInt)other - one;
}
#endregion
#region ==
public static bool operator ==( FInt one, FInt other )
{
return one.RawValue == other.RawValue;
}
public static bool operator ==( FInt one, int other )
{
return one == (FInt)other;
}
public static bool operator ==( int other, FInt one )
{
return (FInt)other == one;
}
#endregion
#region !=
public static bool operator !=( FInt one, FInt other )
{
return one.RawValue != other.RawValue;
}
public static bool operator !=( FInt one, int other )
{
return one != (FInt)other;
}
public static bool operator !=( int other, FInt one )
{
return (FInt)other != one;
}
#endregion
#region >=
public static bool operator >=( FInt one, FInt other )
{
return one.RawValue >= other.RawValue;
}
public static bool operator >=( FInt one, int other )
{
return one >= (FInt)other;
}
public static bool operator >=( int other, FInt one )
{
return (FInt)other >= one;
}
#endregion
#region <=
public static bool operator <=( FInt one, FInt other )
{
return one.RawValue <= other.RawValue;
}
public static bool operator <=( FInt one, int other )
{
return one <= (FInt)other;
}
public static bool operator <=( int other, FInt one )
{
return (FInt)other <= one;
}
#endregion
#region >
public static bool operator >( FInt one, FInt other )
{
return one.RawValue > other.RawValue;
}
public static bool operator >( FInt one, int other )
{
return one > (FInt)other;
}
public static bool operator >( int other, FInt one )
{
return (FInt)other > one;
}
#endregion
#region <
public static bool operator <( FInt one, FInt other )
{
return one.RawValue < other.RawValue;
}
public static bool operator <( FInt one, int other )
{
return one < (FInt)other;
}
public static bool operator <( int other, FInt one )
{
return (FInt)other < one;
}
#endregion
public static explicit operator int( FInt src )
{
return (int)( src.RawValue >> SHIFT_AMOUNT );
}
public static explicit operator FInt( int src )
{
return FInt.Create( src, true );
}
public static explicit operator FInt( long src )
{
return FInt.Create( src, true );
}
public static explicit operator FInt( ulong src )
{
return FInt.Create( (long)src, true );
}
public static FInt operator <<( FInt one, int Amount )
{
return FInt.Create( one.RawValue << Amount, false );
}
public static FInt operator >>( FInt one, int Amount )
{
return FInt.Create( one.RawValue >> Amount, false );
}
public override bool Equals( object obj )
{
if ( obj is FInt )
return ( (FInt)obj ).RawValue == this.RawValue;
else
return false;
}
public override int GetHashCode()
{
return RawValue.GetHashCode();
}
public override string ToString()
{
return this.RawValue.ToString();
}
}
public struct FPoint
{
public FInt X;
public FInt Y;
public static FPoint Create( FInt X, FInt Y )
{
FPoint fp;
fp.X = X;
fp.Y = Y;
return fp;
}
public static FPoint FromPoint( Point p )
{
FPoint f;
f.X = (FInt)p.X;
f.Y = (FInt)p.Y;
return f;
}
public static Point ToPoint( FPoint f )
{
return new Point( f.X.IntValue, f.Y.IntValue );
}
#region Vector Operations
public static FPoint VectorAdd( FPoint F1, FPoint F2 )
{
FPoint result;
result.X = F1.X + F2.X;
result.Y = F1.Y + F2.Y;
return result;
}
public static FPoint VectorSubtract( FPoint F1, FPoint F2 )
{
FPoint result;
result.X = F1.X - F2.X;
result.Y = F1.Y - F2.Y;
return result;
}
public static FPoint VectorDivide( FPoint F1, int Divisor )
{
FPoint result;
result.X = F1.X / Divisor;
result.Y = F1.Y / Divisor;
return result;
}
#endregion
}
Based on the comments from ShuggyCoUk, I see that this is in Q12 format. That's reasonably precise for my purposes. Of course, aside from the bugfixes, I already had this basic format before I asked my question. What I was looking for were ways to calculate Sqrt, Atan2, Sin, and Cos in C# using a structure like this. There aren't any other things that I know of in C# that will handle this, but in Java I managed to find the MathFP library by Onno Hommes. It's a liberal source software license, so I've converted some of his functions to my purposes in C# (with a fix to atan2, I think). Enjoy:
#region PI, DoublePI
public static FInt PI = FInt.Create( 12868, false ); //PI x 2^12
public static FInt TwoPIF = PI * 2; //radian equivalent of 260 degrees
public static FInt PIOver180F = PI / (FInt)180; //PI / 180
#endregion
#region Sqrt
public static FInt Sqrt( FInt f, int NumberOfIterations )
{
if ( f.RawValue < 0 ) //NaN in Math.Sqrt
throw new ArithmeticException( "Input Error" );
if ( f.RawValue == 0 )
return (FInt)0;
FInt k = f + FInt.OneF >> 1;
for ( int i = 0; i < NumberOfIterations; i++ )
k = ( k + ( f / k ) ) >> 1;
if ( k.RawValue < 0 )
throw new ArithmeticException( "Overflow" );
else
return k;
}
public static FInt Sqrt( FInt f )
{
byte numberOfIterations = 8;
if ( f.RawValue > 0x64000 )
numberOfIterations = 12;
if ( f.RawValue > 0x3e8000 )
numberOfIterations = 16;
return Sqrt( f, numberOfIterations );
}
#endregion
#region Sin
public static FInt Sin( FInt i )
{
FInt j = (FInt)0;
for ( ; i < 0; i += FInt.Create( 25736, false ) ) ;
if ( i > FInt.Create( 25736, false ) )
i %= FInt.Create( 25736, false );
FInt k = ( i * FInt.Create( 10, false ) ) / FInt.Create( 714, false );
if ( i != 0 && i != FInt.Create( 6434, false ) && i != FInt.Create( 12868, false ) &&
i != FInt.Create( 19302, false ) && i != FInt.Create( 25736, false ) )
j = ( i * FInt.Create( 100, false ) ) / FInt.Create( 714, false ) - k * FInt.Create( 10, false );
if ( k <= FInt.Create( 90, false ) )
return sin_lookup( k, j );
if ( k <= FInt.Create( 180, false ) )
return sin_lookup( FInt.Create( 180, false ) - k, j );
if ( k <= FInt.Create( 270, false ) )
return sin_lookup( k - FInt.Create( 180, false ), j ).Inverse;
else
return sin_lookup( FInt.Create( 360, false ) - k, j ).Inverse;
}
private static FInt sin_lookup( FInt i, FInt j )
{
if ( j > 0 && j < FInt.Create( 10, false ) && i < FInt.Create( 90, false ) )
return FInt.Create( SIN_TABLE[i.RawValue], false ) +
( ( FInt.Create( SIN_TABLE[i.RawValue + 1], false ) - FInt.Create( SIN_TABLE[i.RawValue], false ) ) /
FInt.Create( 10, false ) ) * j;
else
return FInt.Create( SIN_TABLE[i.RawValue], false );
}
private static int[] SIN_TABLE = {
0, 71, 142, 214, 285, 357, 428, 499, 570, 641,
711, 781, 851, 921, 990, 1060, 1128, 1197, 1265, 1333,
1400, 1468, 1534, 1600, 1665, 1730, 1795, 1859, 1922, 1985,
2048, 2109, 2170, 2230, 2290, 2349, 2407, 2464, 2521, 2577,
2632, 2686, 2740, 2793, 2845, 2896, 2946, 2995, 3043, 3091,
3137, 3183, 3227, 3271, 3313, 3355, 3395, 3434, 3473, 3510,
3547, 3582, 3616, 3649, 3681, 3712, 3741, 3770, 3797, 3823,
3849, 3872, 3895, 3917, 3937, 3956, 3974, 3991, 4006, 4020,
4033, 4045, 4056, 4065, 4073, 4080, 4086, 4090, 4093, 4095,
4096
};
#endregion
private static FInt mul( FInt F1, FInt F2 )
{
return F1 * F2;
}
#region Cos, Tan, Asin
public static FInt Cos( FInt i )
{
return Sin( i + FInt.Create( 6435, false ) );
}
public static FInt Tan( FInt i )
{
return Sin( i ) / Cos( i );
}
public static FInt Asin( FInt F )
{
bool isNegative = F < 0;
F = Abs( F );
if ( F > FInt.OneF )
throw new ArithmeticException( "Bad Asin Input:" + F.ToDouble() );
FInt f1 = mul( mul( mul( mul( FInt.Create( 145103 >> FInt.SHIFT_AMOUNT, false ), F ) -
FInt.Create( 599880 >> FInt.SHIFT_AMOUNT, false ), F ) +
FInt.Create( 1420468 >> FInt.SHIFT_AMOUNT, false ), F ) -
FInt.Create( 3592413 >> FInt.SHIFT_AMOUNT, false ), F ) +
FInt.Create( 26353447 >> FInt.SHIFT_AMOUNT, false );
FInt f2 = PI / FInt.Create( 2, true ) - ( Sqrt( FInt.OneF - F ) * f1 );
return isNegative ? f2.Inverse : f2;
}
#endregion
#region ATan, ATan2
public static FInt Atan( FInt F )
{
return Asin( F / Sqrt( FInt.OneF + ( F * F ) ) );
}
public static FInt Atan2( FInt F1, FInt F2 )
{
if ( F2.RawValue == 0 && F1.RawValue == 0 )
return (FInt)0;
FInt result = (FInt)0;
if ( F2 > 0 )
result = Atan( F1 / F2 );
else if ( F2 < 0 )
{
if ( F1 >= 0 )
result = ( PI - Atan( Abs( F1 / F2 ) ) );
else
result = ( PI - Atan( Abs( F1 / F2 ) ) ).Inverse;
}
else
result = ( F1 >= 0 ? PI : PI.Inverse ) / FInt.Create( 2, true );
return result;
}
#endregion
#region Abs
public static FInt Abs( FInt F )
{
if ( F < 0 )
return F.Inverse;
else
return F;
}
#endregion
There are a number of other functions in Dr. Hommes' MathFP library, but they were beyond what I needed, and so I have not taken the time to translate them to C# (that process was made extra difficult by the fact that he was using a long, and I am using the FInt struct, which makes the conversion rules are a bit challenging to see immediately).
The accuracy of these functions as they are coded here is more than enough for my purposes, but if you need more you can increase the SHIFT AMOUNT on FInt. Just be aware that if you do so, the constants on Dr. Hommes' functions will then need to be divided by 4096 and then multiplied by whatever your new SHIFT AMOUNT requires. You're likely to run into some bugs if you do that and aren't careful, so be sure to run checks against the built-in Math functions to make sure that your results aren't being put off by incorrectly adjusting a constant.
So far, this FInt logic seems as fast, if not perhaps a bit faster, than the equivalent built in .net functions. That would obviously vary by machine, since the fp coprocessor would determine that, so I have not run specific benchmarks. But they are integrated into my game now, and I've seen a slight decrease in processor utilization compared to before (this is on a Q6600 quad core -- about a 1% drop in usage on average).
Thanks again to everyone who commented for your help. No one pointed me directly to what I was looking for, but you gave me some clues that helped me find it myself on google. I hope this code turns out to be useful for someone else, since there doesn't seem to be anything comparable in C# posted publicly.
Best Solution
Ok, a quick course in Matrix/Vector calculation:
A matrix is a collection of numbers ordered in a rectangular grid like:
The above matrix has 4 rows and 3 columns and as such is a 4 x 3 matrix. A vector is a matrix with 1 row (a row vector) or 1 column (a column vector). Normal numbers are called scalars to contrast with matrices.
It is also common to use capital letters for matrices and lowercase letters for scalars.
We can do basic calculation with matrices but there are some conditions.
Addition
Matrices can be added if they have the same dimensions. So a 2x2 matrix can be added to a 2x2 matrix but not to a 3x5 matrix.
You see that by addition each number at each cell is added to the number on the same position in the other matrix.
Matrix multiplication
Matrices can be multiplied, but this is a bit more complex. In order to multiply matrix A with matrix B, you need to multiply the numbers in each row if matrix A with each column in matrix B. This means that if you multiply an a x b matrix with a c x d matrix, b and c must be equal and the resulting matrix is a x d:
As you can see, with matrixes, A x B differs from B x A.
Matrix scalar multiplication
You can multiply a matrix with a scalar. In that case, each cell is multiplied with that number:
Inverting a matrix Matrix division is not possible, but you can create an inversion of a matrix such that A x A-inv is a matrix with all zero's except for that main diagonal:
Inverting a matrix can only be done with square matrices and it is a complex job that does not neccesary have a result.
Start with matrix A:
We add 3 extra columns and fill them with the unit matrix:
Now we start with the first column. We need to subtract the first row from each other row such that the first column contains only zeroes except for the first row. In order to do that we subtract the first row once from the second and twice from the third:
Now we repeat this with the second column (twice from the first row and once from the third)
For the third column, we have a slight problem. The pivot number is -6 and not 1. But we can solve this by multiplying the entire row with -1/6:
And now we can subtract the third row from the first and the second:
Ok now we have the inverse of A:
We can write this as:
Hope this helps a bit.