I remember vividly Jon Bentley's first Algorithms lecture at CMU, where he asked all of us incoming Ph.D. students to write a binary search, and then dissected one of our implementations in front of the class. Of course it was broken, as were most of our implementations. This made a real impression on me, as did the treatment of this material in his wonderful Programming Pearls (Addison-Wesley, 1986; Second Edition, 2000). The key lesson was to carefully consider the invariants in your programs.
Fast forward to 2006. I was shocked to learn that the binary search program that Bentley proved correct and subsequently tested in Chapter 5 of Programming Pearls contains a bug. Once I tell you what it is, you will understand why it escaped detection for two decades. Lest you think I'm picking on Bentley, let me tell you how I discovered the bug: The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so.
So what's the bug? Here's a standard binary search, in Java. (It's one that I wrote for the java.util.Arrays):
1: public static int binarySearch(int[] a, int key) {
2: int low = 0;
3: int high = a.length - 1;
4:
5: while (low <= high) {
6: int mid = (low + high) / 2;
7: int midVal = a[mid];
8:
9: if (midVal < key)
10: low = mid + 1
11: else if (midVal > key)
12: high = mid - 1;
13: else
14: return mid; // key found
15: }
16: return -(low + 1); // key not found.
17: }
The bug is in this line:
6: int mid =(low + high) / 2;
In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the
int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.This bug can manifest itself for arrays whose length (in elements) is 230 or greater (roughly a billion elements). This was inconceivable back in the '80s, when Programming Pearls was written, but it is common these days at Google and other places. In Programming Pearls, Bentley says "While the first binary search was published in 1946, the first binary search that works correctly for all values of n did not appear until 1962." The truth is, very few correct versions have ever been published, at least in mainstream programming languages.
So what's the best way to fix the bug? Here's one way:
6: int mid = low + ((high - low) / 2);
Probably faster, and arguably as clear is:
6: int mid = (low + high) >>> 1;
In C and C++ (where you don't have the
>>> operator), you can do this:6: mid = ((unsigned int)low + (unsigned int)high)) >> 1;
And now we know the binary search is bug-free, right? Well, we strongly suspect so, but we don't know. It is not sufficient merely to prove a program correct; you have to test it too. Moreover, to be really certain that a program is correct, you have to test it for all possible input values, but this is seldom feasible. With concurrent programs, it's even worse: You have to test for all internal states, which is, for all practical purposes, impossible.
The binary-search bug applies equally to mergesort, and to other divide-and-conquer algorithms. If you have any code that implements one of these algorithms, fix it now before it blows up. The general lesson that I take away from this bug is humility: It is hard to write even the smallest piece of code correctly, and our whole world runs on big, complex pieces of code.
We programmers need all the help we can get, and we should never assume otherwise. Careful design is great. Testing is great. Formal methods are great. Code reviews are great. Static analysis is great. But none of these things alone are sufficient to eliminate bugs: They will always be with us. A bug can exist for half a century despite our best efforts to exterminate it. We must program carefully, defensively, and remain ever vigilant.
Update 17 Feb 2008: Thanks to Antoine Trux, Principal Member of Engineering Staff at Nokia Research Center Finland for pointing out that the original proposed fix for C and C++ (Line 6), was not guaranteed to work by the relevant C99 standard (INTERNATIONAL STANDARD - ISO/IEC - 9899 - Second edition - 1999-12-01, 3.4.3.3), which says that if you add two signed quantities and get an overflow, the result is undefined. The older C Standard, C89/90, and the C++ Standard are both identical to C99 in this respect. Now that we've made this change, we know that the program is correct;)
Resources
- Programming Pearls - Highly recommended. Get a copy today!
- The Sun bug report describing this bug in the JDK
- A 2003 paper by Salvatore Ruggieri discussing a related problem - The problem is a bit more general but perhaps less interesting: the average of two numbers of arbitrary sign. The paper does not discuss performance, and its solution is not fast enough for use in the inner loop of a mergesort.
38 comments:
Hi,
The moment, I saw the heading, I thought something wrong with my mind... It cant be wrong for that algorithm, which is "very" popular one... Curiosity was building up... After I saw your point about (low + high) / 2... Then AHA! It is right! Good point for finding that bug! I like your opinion about bug thingy!
I had seen another one in the JDK a while back.. http://purevirtuals.com/blog/2006/06/16/son-of-a-bug/
-Jeu
Everybody knows the Puzzle "Last Laugh" from Java Puzzler.
I have one more puzzle, you can name it as "Last Cry"
Please try below code in JDK.
public class TestMock{
public static String text = new String();
public static void main(String args[]){
System.out.print(text.getClass() + " Class Seperator " + new TestMock().test());
}
public String test(){
System.out.print(getClass());
return "";
}
}
Wow that is really good to know for sure!
JT
www.FireMe.To/udi
I maybe crazy but I immediately picked that line of code in error as well. I picked it for a different reason though. int mid = (low + high) / 2; Low is statically set to 0 if the array has no elements it would also be zero. This would cause a division by zero error. I have never personally used this algorithm before so maybe I am applying it incorrect.
whats puzzling about the TestMock code ? seems to be working just fine as EXPECTED
lol, i love how robertus is just echoing things from the article and doesn't have any clue how a binary search works.
I can see how the first fix ( int mid = low + ((high - low) / 2); ) works.
However, the second and third fix seems to be suffering from the same
problem as the original code. Or does the language spec say that the
expression ( (low + high) >>> 1 ) should be compiled to the equivalent of ( low >>> 1 + high >>> 1 )? If not, you'll have overflow again before the shift operation...
Another question I: Why is your fix not ( low / 2 + high / 2 )?
Could anybody explain how the binary shift solution works in a little more detail? Thanks.
NO NO NO. The first fix looks reasonable. But I contest the other two. Right shift is not a replacement for dividing by two. Does C99 guarantee that right shifting by one is equivalent to dividing by two? This kind of behavior encourages programmers to try to optimize what the compiler should be optimizing. It isn't the case here, but if that int were to somehow need to be a double one day, what then? Or if the int became a long long and the code was cross compiled from a big endian 32-bit machine to a little endian 16-bit machine? How confident would you be that shifting is still the same as dividing?
In short, divide when you mean to divide and shift when you mean to shift. Let the compiler do the optimizing for you.
I'm not very knowledgable about algorithms, but I don't think this means that the algorithm is broken, it's just a manifestation of the limits of the language in which it's written. I think a better title for this would be "n-bit systems (where n is less than infinity) only allow for numbers that are a certain length and nobody noticed the problem until now".
In my opinion, this is not a bug in "binary search" algorithm. It is rather an implementation issue with a trivial "integer overflow" scenario.
-babil.
I am not a programmer, though I'm trying to learn. I am a professional TROUBLESHOOTER, though, and want to give praise to you for publishing that no matter how good we all are (and I'm VERY good), we all need help and humilty.
I have learned to alert on the words NEVER, ALWAYS and SHOULD. My home-made wise saying is :
"NEVER and ALWAYS are so rarely true in the real world, that they SHOULD NEVER be counted on to be accurate; and ALWAYS remember that SHOULD means 'gosh I hope so'."
Umm... wouldn't it be sufficient to make 'mid' a long (or long long depending on platform) instead of an int? This seems like a pretty common mis-typing situation to me.
Any programmer worth his salt should know how computers multiply and divide integers and the boundary values for the inputs. If you wrongly assume these boundary conditions will never be violated, shame on you.
Also, who the hell does a binary search on an array of billions of items?
After thinking about my previous comment a little, I found it has an error:
Due to integer truncation ( low + high ) / 2 is not the same as ( low / 2 + high / 2 ).
For example ( 3 + 3 ) / 2 = 3, while ( 3 / 2 + 3 / 2 ) = 1 + 1 = 2.
This also means that my argument and question about ( low + high ) >>> 1 being compiled into ( low >>> 1 + high >>> 1 ) is based on wrong premises.
However, the overflow problem still remains with your solutions...
How come ( low + high ) >>> 1 does not result in overflow, just like in the original implementation?
This is not a bug in the algorithm but in the implementation. If you are working with data sizes that large using 64 bit data types would solve this problem. (i.e. unsigned long long in C/C++ for instance)
This is a situation where calling an external function (avg() or whatever Java calls it) is better than re-implementing a function that looks trivial but has some edge cases.
@David/Ranger: changing the sum to a larger data type is a hack. What if, in 20 more years, the integer array needs to be a 64-bit integer and is actually mostly filled? Do you then change the sum to a 128-bit integer? No, the problem here is that the algorithm is wrong--it does not stay within the bounds of whatever data type the array index consists of. You can't always rely on your platform to have a larger data type handy, nor is all that casting back and forth necessary.
@Christian: I think you've got the right solution. And to correct for the "rounding error", I think something like this will work:
int mid = (low/2) + (high/2) + (((low%2)+(high%2)) / 2);
(I'm not sure my code is right, since I'm not a Java programmer. But you get the idea.)
@Christian:
It does result in an overflow, but only by one bit max, so you get a negative integer (since they are signed), then you shift it 1 to get it back into positive territory.
e.g. (assume maxint is 4 bits)
7 + 7 = -6 or -1, can't remember
0111 + 0111 = 1110
then shift right 1
0111 = 7
@christian:
">>>" is the unsigned right shift operator. So if I'm not mistaken (low + high) is stored in unsigned int and then shifted one to the right dropping remainder leaving a division by 2 with no remainder returned as a signed int. So it does work.
One correction I have has to deal with the starting value of the binary search. In its current form it doesn't technically start in the middle of an even number it starts one to the left of even. Here is my correction that results in the mid starting at the middle of both even and odd numbers and then moving from there. It only results in 1 more operation if the value is not in the array otherwise it is the same.
1: public static int binarySearch(int[] a, int key)
{
int low = 0;
int high = a.Length - 1;
int mid = a.length >>> 1;
while (low <= high)
{
int midVal = a[mid];
if (midVal < key)
low = mid + 1
else if (midVal > key)
high = mid - 1;
else
return mid; // key found
mid = (low + high) >>> 1;
}
return -(low + 1); // key not found.
}
@Danno - that one's easy...Google. While not exactly a binary search, a good many (I don't want to say most because I don't know that for sure...anyone actually KNOW this?)efficient search algorithms that search vast quantities of data are loosely based on binary searching (and indexing). I do remember learning that google pretty much proved that this method would scale.
There's a fairly ancient solution to this problem that works for both nonnegative ints and unsigneds, so it's useful for C/C++ programmers as well. (The current solution is fine for Java.) The idea is that
a+b = (a&b) + (a|b) = (a&b) + ((a^b) + (a&b)) = 2*(a&b) + (a^b),
so (a+b)/2 is the same as
(a&b) + (a^b)/2
or
(a&b) + ((a^b) >>> 1).
For example, assuming that low and high are both unsigned, in C/C++ one could write
unsigned mid = (low & high) + ((low ^ high) >> 1);
One source for this trick is here, but I'm sure there are others.
Just wow!
_____________________________________
http://classictutorials.blogspot.com
Eric:
That webpage with the tricks is interesting yet not very useful. As for the averaging trick, I count four ops. Meanwhile, (a+(b-a)/2) is just 3 ops. If you re-write the code to use unsigned int instead of int (which should have been done in the first place!) then the compiler will convert the division to a right-shift. Don't worry that b might be smaller than a, it's still 2's compliment subtraction and the compiler/interpreter will get it right.
Again, rarely do you need to sacrifice readability to make your code run better/faster. Let the compiler do the optimizing.
@Stephen: That's only correct assuming that the compiler will build a binary shift right and not a logical shift right. For signed ints, the compiler will likely use a signed shift. The variables should have been unsigned in the first place.
@David: long long is guaranteed to be wider than int, but long isn't. But one day a programmer will change "int" to "long long" at the top of the function without looking at the rest of it and things will break. Richard is right.
@Richard,
Actually, no, it isn't. After all, the values being summed are still ints and subject to the same limits. If you expect your data to go beyond the limits of the data type you've selected, you've obviously selected the wrong data type.
In response to a previous comment: the formula can end up dividing zero by two, but that's dividing by two, not zero.
what about
mid = ((low>>1) + (high>>1)) + low&high&1
since we're trunking, divide by 2 first each term and then find out if both numbers are odd... in that case low&high&1 is 1.
If a and b are both signed ints that have nonnegative values, then either
(a + b) >>> 1
or
a + (b-a)/2
will correctly compute the average. In the first case, a+b overflows into 32 bits and the unsigned right shift brings it back to 31 bits. In the second case, b-a cannot overflow.
I prefer the first of the two formulas because the >>> is a warning that something tricky is going on. The second formula is algebraically equal to (a+b)/2, and it's possible that someone might "optimize" it back to that, not realizing that the order of operations is crucial. (For that matter, a highly-optimizing compiler might do the same thing.)
If a and b are signed ints that may be negative, then neither formula will work in general. The first formula will always produce a positive result (since it shifts a zero into the sign bit); in the second formula, b-a may overflow if a and b have different signs.
If a and b are both unsigned ints, then again neither solution will work: in the first formula, a+b may overflow to 33 bits, and in the second, b-a will "wrap" to a large positive integer if b is less than a. For example, if a=1 and b=0 (as 32-bit unsigneds), then
a + (b-a)/2 = 1 + (0-1)/2 = 1 + 4294967295/2 = 1 + 2147483647 = 2147483648
(Try it yourself.)
The simplest formula that will work correctly to average two signed ints OR two unsigned ints, in all cases, is the "tricky" one I cited above, namely
(a&b) + (a^b)/2
This formula cannot overflow.
Trivially...If you are on a system/language where the heap permitted arrays larger then 2^31-1 elements( maybe google has some custom big 64 bit boxes?) this implementation would be broken since your index would overflow.
Of course this example is in java and the array size is limited to MAX_INT...but it might apply to C++ code that needs portable between 32 and 64. Funny enough that was my last project.
Is it just 7 years of no software development, or are there further problems lurking in some solutions? Let me give another case for testing the proposed solutions:
For any of these solutions, it would be nasty if line 6 calculated a mid that was not between low and high. At least in C, if the array is large enough and if high and low are changed to be unsigned, consider if their values both have the highest bit set. Then when you unsigned add them you get a value less than either. Then dividing makes it even less.
P.S. I also had some courses from Jon Bentely. In the 1980's. He flew out to Columbus Ohio weekly to teach some of us then Bell Labs developers. Definitely some of the best courses I ever had.
Things like that happen, and rightly so, if you ignore a discipline of programming, where you learn to develop a program and its proof hand in hand. For an example, have a look at EWD 1293.
Why not just put in an overflow check before doing that.
06. try {
07. mid = (iLow + iHigh)/2;
08. } catch (Exception e) {
09. // stop binary sorting!
10. }
Better yet. This check should be made while populating the array. Perhaps it could be an idea to use multiple arrays of a standard size, something like a two dimensional array. In which case, the algorithm would need a difference implementation.
Maybe I'm missing something. But the algorithm is not flawed. It is not robust. It has passed testing for extremely large numbers.
In the case of int iArray[];. We should only use int IFF we are certain that
iArray[].length < Maximum_Value_Of_Int.
The algorithm breaks because the hardware/software/compiler definitions of types are either not understood or catered for the while writing the code.
My two bits.
I question how this bug is possible in practice.
On a 32-bit system (where ints are 4 bytes and you have just under 4GB of addressable ram available), the largest int[] you can construct is just under 2^30 in length. And in that case, "low + high" will be under 2^31, which doesn't overflow.
If you have more than 4GB of ram, then you're on a 64-bit system, and your 'int' (or maybe it should be a size_t) is 8 bytes so you don't have the overflow.
Yes, you could have a byte[] of size 2^31 (2GB worth of memory), but this is silly - there are only 256 possible byte values. Your sorted array would look like:
{ 0,0,0,0...1,1,1,1... }
with billions of repeated values.
In fact, the Sun bug report uses a byte[] as the example reproducible case:
class Search {
public static void main (String[] args) {
byte[] b = new byte[1100000000];
b[b.length - 1] = 1;
int index = java.util.Arrays.binarySearch(b, (byte)1);
System.out.println("index = " + index);
}
}
How about quoting "The most beautiful code I never wrote" Jon Bentley - chapter three of the book "Beautiful code" Andrew Oram, Greg Wilson ...
O'Reilly, 2007
ISBN 0596510047, 9780596510046
That relates to the humility lesson.
Or you just declare all vars as uint instead of int.
Nice work dude
sorde
quote: "in the second, b-a will "wrap" to a large positive integer if b is less than a."
That would be rather strange, considering that b is the upper bound and a is the lower bound, in other words, b>a always.
Post a Comment