How do you implement the fastest possible Gaussian blur algorithm?

I am going to implement it in Java, so GPU solutions are ruled out. My application, planetGenesis, is cross platform, so I don't want JNI.

Skip to content
# Java – Fastest Gaussian blur implementation

#### Related Solutions

###### Related Question

filteringgaussianimage-processingjava

How do you implement the fastest possible Gaussian blur algorithm?

I am going to implement it in Java, so GPU solutions are ruled out. My application, planetGenesis, is cross platform, so I don't want JNI.

I figured out a method that works ~35% faster than your 6bits+Carmack+sqrt code, at least with my CPU (x86) and programming language (C/C++). Your results may vary, especially because I don't know how the Java factor will play out.

My approach is threefold:

- First, filter out obvious answers. This includes negative numbers and looking at the last 4 bits. (I found looking at the last six didn't help.) I also answer yes for 0. (In reading the code below, note that my input is
`int64 x`

.)`if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) ) return false; if( x == 0 ) return true;`

- Next, check if it's a square modulo 255 = 3 * 5 * 17. Because that's a product of three distinct primes, only about 1/8 of the residues mod 255 are squares. However, in my experience, calling the modulo operator (%) costs more than the benefit one gets, so I use bit tricks involving 255 = 2^8-1 to compute the residue. (For better or worse, I am not using the trick of reading individual bytes out of a word, only bitwise-and and shifts.)

To actually check if the residue is a square, I look up the answer in a precomputed table.`int64 y = x; y = (y & 4294967295LL) + (y >> 32); y = (y & 65535) + (y >> 16); y = (y & 255) + ((y >> 8) & 255) + (y >> 16); // At this point, y is between 0 and 511. More code can reduce it farther.`

`if( bad255[y] ) return false; // However, I just use a table of size 512`

- Finally, try to compute the square root using a method similar to Hensel's lemma. (I don't think it's applicable directly, but it works with some modifications.) Before doing that, I divide out all powers of 2 with a binary search:

At this point, for our number to be a square, it must be 1 mod 8.`if((x & 4294967295LL) == 0) x >>= 32; if((x & 65535) == 0) x >>= 16; if((x & 255) == 0) x >>= 8; if((x & 15) == 0) x >>= 4; if((x & 3) == 0) x >>= 2;`

The basic structure of Hensel's lemma is the following. (Note: untested code; if it doesn't work, try t=2 or 8.)`if((x & 7) != 1) return false;`

The idea is that at each iteration, you add one bit onto r, the "current" square root of x; each square root is accurate modulo a larger and larger power of 2, namely t/2. At the end, r and t/2-r will be square roots of x modulo t/2. (Note that if r is a square root of x, then so is -r. This is true even modulo numbers, but beware, modulo some numbers, things can have even more than 2 square roots; notably, this includes powers of 2.) Because our actual square root is less than 2^32, at that point we can actually just check if r or t/2-r are real square roots. In my actual code, I use the following modified loop:`int64 t = 4, r = 1; t <<= 1; r += ((x - r * r) & t) >> 1; t <<= 1; r += ((x - r * r) & t) >> 1; t <<= 1; r += ((x - r * r) & t) >> 1; // Repeat until t is 2^33 or so. Use a loop if you want.`

The speedup here is obtained in three ways: precomputed start value (equivalent to ~10 iterations of the loop), earlier exit of the loop, and skipping some t values. For the last part, I look at`int64 r, t, z; r = start[(x >> 3) & 1023]; do { z = x - r * r; if( z == 0 ) return true; if( z < 0 ) return false; t = z & (-z); r += (z & t) >> 1; if( r > (t >> 1) ) r = t - r; } while( t <= (1LL << 33) );`

`z = r - x * x`

, and set t to be the largest power of 2 dividing z with a bit trick. This allows me to skip t values that wouldn't have affected the value of r anyway. The precomputed start value in my case picks out the "smallest positive" square root modulo 8192.

Even if this code doesn't work faster for you, I hope you enjoy some of the ideas it contains. Complete, tested code follows, including the precomputed tables.

```
typedef signed long long int int64;
int start[1024] =
{1,3,1769,5,1937,1741,7,1451,479,157,9,91,945,659,1817,11,
1983,707,1321,1211,1071,13,1479,405,415,1501,1609,741,15,339,1703,203,
129,1411,873,1669,17,1715,1145,1835,351,1251,887,1573,975,19,1127,395,
1855,1981,425,453,1105,653,327,21,287,93,713,1691,1935,301,551,587,
257,1277,23,763,1903,1075,1799,1877,223,1437,1783,859,1201,621,25,779,
1727,573,471,1979,815,1293,825,363,159,1315,183,27,241,941,601,971,
385,131,919,901,273,435,647,1493,95,29,1417,805,719,1261,1177,1163,
1599,835,1367,315,1361,1933,1977,747,31,1373,1079,1637,1679,1581,1753,1355,
513,1539,1815,1531,1647,205,505,1109,33,1379,521,1627,1457,1901,1767,1547,
1471,1853,1833,1349,559,1523,967,1131,97,35,1975,795,497,1875,1191,1739,
641,1149,1385,133,529,845,1657,725,161,1309,375,37,463,1555,615,1931,
1343,445,937,1083,1617,883,185,1515,225,1443,1225,869,1423,1235,39,1973,
769,259,489,1797,1391,1485,1287,341,289,99,1271,1701,1713,915,537,1781,
1215,963,41,581,303,243,1337,1899,353,1245,329,1563,753,595,1113,1589,
897,1667,407,635,785,1971,135,43,417,1507,1929,731,207,275,1689,1397,
1087,1725,855,1851,1873,397,1607,1813,481,163,567,101,1167,45,1831,1205,
1025,1021,1303,1029,1135,1331,1017,427,545,1181,1033,933,1969,365,1255,1013,
959,317,1751,187,47,1037,455,1429,609,1571,1463,1765,1009,685,679,821,
1153,387,1897,1403,1041,691,1927,811,673,227,137,1499,49,1005,103,629,
831,1091,1449,1477,1967,1677,697,1045,737,1117,1737,667,911,1325,473,437,
1281,1795,1001,261,879,51,775,1195,801,1635,759,165,1871,1645,1049,245,
703,1597,553,955,209,1779,1849,661,865,291,841,997,1265,1965,1625,53,
1409,893,105,1925,1297,589,377,1579,929,1053,1655,1829,305,1811,1895,139,
575,189,343,709,1711,1139,1095,277,993,1699,55,1435,655,1491,1319,331,
1537,515,791,507,623,1229,1529,1963,1057,355,1545,603,1615,1171,743,523,
447,1219,1239,1723,465,499,57,107,1121,989,951,229,1521,851,167,715,
1665,1923,1687,1157,1553,1869,1415,1749,1185,1763,649,1061,561,531,409,907,
319,1469,1961,59,1455,141,1209,491,1249,419,1847,1893,399,211,985,1099,
1793,765,1513,1275,367,1587,263,1365,1313,925,247,1371,1359,109,1561,1291,
191,61,1065,1605,721,781,1735,875,1377,1827,1353,539,1777,429,1959,1483,
1921,643,617,389,1809,947,889,981,1441,483,1143,293,817,749,1383,1675,
63,1347,169,827,1199,1421,583,1259,1505,861,457,1125,143,1069,807,1867,
2047,2045,279,2043,111,307,2041,597,1569,1891,2039,1957,1103,1389,231,2037,
65,1341,727,837,977,2035,569,1643,1633,547,439,1307,2033,1709,345,1845,
1919,637,1175,379,2031,333,903,213,1697,797,1161,475,1073,2029,921,1653,
193,67,1623,1595,943,1395,1721,2027,1761,1955,1335,357,113,1747,1497,1461,
1791,771,2025,1285,145,973,249,171,1825,611,265,1189,847,1427,2023,1269,
321,1475,1577,69,1233,755,1223,1685,1889,733,1865,2021,1807,1107,1447,1077,
1663,1917,1129,1147,1775,1613,1401,555,1953,2019,631,1243,1329,787,871,885,
449,1213,681,1733,687,115,71,1301,2017,675,969,411,369,467,295,693,
1535,509,233,517,401,1843,1543,939,2015,669,1527,421,591,147,281,501,
577,195,215,699,1489,525,1081,917,1951,2013,73,1253,1551,173,857,309,
1407,899,663,1915,1519,1203,391,1323,1887,739,1673,2011,1585,493,1433,117,
705,1603,1111,965,431,1165,1863,533,1823,605,823,1179,625,813,2009,75,
1279,1789,1559,251,657,563,761,1707,1759,1949,777,347,335,1133,1511,267,
833,1085,2007,1467,1745,1805,711,149,1695,803,1719,485,1295,1453,935,459,
1151,381,1641,1413,1263,77,1913,2005,1631,541,119,1317,1841,1773,359,651,
961,323,1193,197,175,1651,441,235,1567,1885,1481,1947,881,2003,217,843,
1023,1027,745,1019,913,717,1031,1621,1503,867,1015,1115,79,1683,793,1035,
1089,1731,297,1861,2001,1011,1593,619,1439,477,585,283,1039,1363,1369,1227,
895,1661,151,645,1007,1357,121,1237,1375,1821,1911,549,1999,1043,1945,1419,
1217,957,599,571,81,371,1351,1003,1311,931,311,1381,1137,723,1575,1611,
767,253,1047,1787,1169,1997,1273,853,1247,413,1289,1883,177,403,999,1803,
1345,451,1495,1093,1839,269,199,1387,1183,1757,1207,1051,783,83,423,1995,
639,1155,1943,123,751,1459,1671,469,1119,995,393,219,1743,237,153,1909,
1473,1859,1705,1339,337,909,953,1771,1055,349,1993,613,1393,557,729,1717,
511,1533,1257,1541,1425,819,519,85,991,1693,503,1445,433,877,1305,1525,
1601,829,809,325,1583,1549,1991,1941,927,1059,1097,1819,527,1197,1881,1333,
383,125,361,891,495,179,633,299,863,285,1399,987,1487,1517,1639,1141,
1729,579,87,1989,593,1907,839,1557,799,1629,201,155,1649,1837,1063,949,
255,1283,535,773,1681,461,1785,683,735,1123,1801,677,689,1939,487,757,
1857,1987,983,443,1327,1267,313,1173,671,221,695,1509,271,1619,89,565,
127,1405,1431,1659,239,1101,1159,1067,607,1565,905,1755,1231,1299,665,373,
1985,701,1879,1221,849,627,1465,789,543,1187,1591,923,1905,979,1241,181};
bool bad255[512] =
{0,0,1,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,
1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,
0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,1,0,1,
1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,0,1,1,1,1,1,1,
1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,
1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,
1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,
1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,
0,0,1,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,
1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,
0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,1,0,1,
1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,0,1,1,1,1,1,1,
1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,
1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,
1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,
1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,
0,0};
inline bool square( int64 x ) {
// Quickfail
if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) )
return false;
if( x == 0 )
return true;
// Check mod 255 = 3 * 5 * 17, for fun
int64 y = x;
y = (y & 4294967295LL) + (y >> 32);
y = (y & 65535) + (y >> 16);
y = (y & 255) + ((y >> 8) & 255) + (y >> 16);
if( bad255[y] )
return false;
// Divide out powers of 4 using binary search
if((x & 4294967295LL) == 0)
x >>= 32;
if((x & 65535) == 0)
x >>= 16;
if((x & 255) == 0)
x >>= 8;
if((x & 15) == 0)
x >>= 4;
if((x & 3) == 0)
x >>= 2;
if((x & 7) != 1)
return false;
// Compute sqrt using something like Hensel's lemma
int64 r, t, z;
r = start[(x >> 3) & 1023];
do {
z = x - r * r;
if( z == 0 )
return true;
if( z < 0 )
return false;
t = z & (-z);
r += (z & t) >> 1;
if( r > (t >> 1) )
r = t - r;
} while( t <= (1LL << 33) );
return false;
}
```

Since Java 5 you can use `Arrays.toString(arr)`

or `Arrays.deepToString(arr)`

for arrays within arrays. Note that the `Object[]`

version calls `.toString()`

on each object in the array. The output is even decorated in the exact way you're asking.

Examples:

### Simple Array:

`String[] array = new String[] {"John", "Mary", "Bob"}; System.out.println(Arrays.toString(array));`

Output:

`[John, Mary, Bob]`

### Nested Array:

`String[][] deepArray = new String[][] {{"John", "Mary"}, {"Alice", "Bob"}}; System.out.println(Arrays.toString(deepArray)); //output: [[Ljava.lang.String;@106d69c, [Ljava.lang.String;@52e922] System.out.println(Arrays.deepToString(deepArray));`

Output:

`[[John, Mary], [Alice, Bob]]`

`double`

Array:`double[] doubleArray = { 7.0, 9.0, 5.0, 1.0, 3.0 }; System.out.println(Arrays.toString(doubleArray));`

Output:

`[7.0, 9.0, 5.0, 1.0, 3.0 ]`

`int`

Array:`int[] intArray = { 7, 9, 5, 1, 3 }; System.out.println(Arrays.toString(intArray));`

Output:

`[7, 9, 5, 1, 3 ]`

- Java – How to create a generic array in Java
- Java – How to break out of nested loops in Java
- Java – How to get the current stack trace in Java
- How to gaussian blur an image without using any in-built gaussian functions
- Python – Peak detection in a 2D array
- Java – What exactly is Apache Camel
- Java – How to install Java 8 on Mac

## Best Solution

You should use the fact that a Gaussian kernel is separable, i. e. you can express a 2D convolution as a combination of two 1D convolutions.

If the filter is large, it may also make sense to use the fact that convolution in the spatial domain is equivalent to multiplication in the frequency (Fourier) domain. This means that you can take the Fourier transform of the image and the filter, multiply the (complex) results, and then take the inverse Fourier transform. The complexity of the FFT (Fast Fourier Transform) is O(n log n), while the complexity of a convolution is O(n^2). Also, if you need to blur many images with the same filter, you would only need to take the FFT of the filter once.

If you decide to go with using an FFT, the FFTW library is a good choice.