The First-digit Law (Benford’s Law)


My dad told me about this really cool phenomenon that I almost didn’t believe the other day…

If you go through a newspaper and write down all of the numbers used in the paper, among those numbers approximately 30% of them will start with a 1.

I challenge you to go count for yourself (maybe not the whole paper but part of it!) if you feel so inclined.  I’ve never done it and I’m not sure how much of it you need to count to observe this (since small samplings wont be accurate).  If you do count let us know what you find!

But what’s so special about 1?  Shouldn’t they all appear with the same frequency, and so 1 should appear 11% of the time?  That’s what makes this a paradox.

Furthermore, this applies to almost any real-world data set without restrictions on what the numbers can be.  As formulated, this is not a mathematical law because it is not stated precisely–what constitutes a real-world data set?  (The “approximately” isn’t precise either but can be made precise.)  Of course you could construct a data set where the first digits are 1 through 9 equally, but that would be cheating 😛  There is a way to make this mathematically precise, but it’s a bit out of the scope of this blog.

You’re probably still begging to know why this strange phenomenon even has a chance of happening.  Well, here are 2 facts that might convince you.  Firstly, you might be wondering what’s so special about 30%, and the answer is it is very close to \log{2} \approx .30103 (where \log is the log base 10 as explained in the previous post).

1)  If you take a number, say 3 million (this is very loose), and look at the number of numbers less than it that start with 1, you’ll find that there are at least as many that start with 1 as any other number, at least.  In the case of 3 million, about 40% start with a 1 (1,000,000 numbers between a million and 2 million + 100,000 between 0 and a million), so it is very biased towards 1’s.

2)  The sequence of powers of 2, and many other sequences, have \log{2} of its numbers beginning with 1, or approximately 30%.

Proof: (Medium-hard)  What does it mean for a number n to start with a 1?  It means that the fractional part of \log{n} (the fractional part of 2.3 is .3; you chop off the integer part) is less than \log 2, because for some integer k,  10^k < n < 2 \cdot 10^k, which implies that k < \log{n} < \log(2 \cdot 10^k) = \log{2} + \log{10^k} = \log{2} + k (by basic properties of log).  This is the same saying that the fractional part is less than \log 2; it is less than \log 2 bigger than some integer.  Furthermore, going back to the original question where n is a power of 2, the fractional part of \log{2^m} = m \log{2} is equally distributed about the interval [0,1], which is hard to define precisely but just imagine that if you take all multiples of \log{2} and put a dot for where each fractional part lands, the space between 0 and 1 is uniformly marked up.  This shows that the probability that \log{n} is less than \log{2} where n is a power of 2 is \frac{\log{2}}{1} = \log{2}!

But WAIT!  Awesomely enough, Benford’s law has a clear real-world application, which is to check if a set of data is authentic or not.  (Unless the data-forger is careful enough to do a logarithmic distribution for the first digits of the numbers!)  According to Wikipedia, this law was used to discover fraud in the 2009 Iranian elections, which is pretty cool!   I wonder if they used it in the show NUMB3RS… (which sadly stopped airing 😦 )

NOTE:  The Wikipedia article on this topic seems to be horribly inaccurate.  (For more advanced readers, the way they define a logarithmic distribution isn’t even possible since there is no such thing as a uniformly distributed set on the number line…)

Thanks to my dad for telling me about this and for presenting it so well that all I had to do was essentially recreate what he said.

Advertisements

6 responses to this post.

  1. Oh, wow! This is so fascinating, I love it. Thanks so much for sharing it so eloquently!

    Reply

  2. Yes, Benford’s law is really cool.

    And I especially love the story about how Benford discovered it. Benford was a research physicist back in the 1930s before there were computers or electronic calculators. All they had were mechanical adding machines. So they used giant references books that had pages and pages and pages of log tables, which the scientists used all the time to look up logarithms, so they could turn multiplication and division problems into addition and subtraction problems.

    Benford noticed that there was a mystifying pattern to the dirtiness on the edges of the pages in the log reference table books. The pages corresponding to the logs of numbers beginning with 1 were much dirtier than the pages corresponding to the logs of number beginning with later digits. And GE had lots of log books all over the place and they all had that pattern, so Benford started to examine all kind of different datasets and discovered this remarkable pattern in a remarkably diverse set of data.

    I love Benford’s story. He worked at GE in Schenectady (just ten minutes away from my home–and, coincidentally, where President Obama is coming to visit next Tuesday!) I drive by his old house almost every day on my way to work, and I often wonder if anyone would ever have discovered the law if electronic calculators had been invented a few hundred years earlier. (By the way, Benford was not actually the first one to discover the law. Newcomb had published an article about it in the 19th century, but Newcomb’s article disappeared into obscurity until well after Benford published his article.)

    http://www.dspguide.com/ch34/1.htm

    Here’s a fun way to introduce/demonstrate Benford’s law in front of an audience. Bring in an old telephone book for a large city that you don’t need any more. Start randomly tearing out pages of the book and hand each page to a member of the audience. Ask each person in the audience to tabulate the initial digits of the street address numbers on his page. If it’s a large city with addresses that span several orders of magnitude (e.g., some addresses have two digits, some have three, some have four digits), it’s a very safe bet that Benford’s law will apply. (Credit for this idea goes to Professor Richard Cleary, a mathematician at Bentley College who also teaches statistics at Harvard.)

    On the other hand, if you asked them to do the exact same thing with the phone numbers in the book, Benford’s law won’t apply. (If you think about how phone numbers are generated, you can understand why.)

    Benford’s law has definitely been used to detect financial fraud–e.g., making up numbers on expense reports or tax returns or financial statements. Most fraudsters are apparently not mathematically sophisticated enough to know about this law.

    Reply

    • Awesome, thanks for sharing all that Mary!
      And yah I can see why telephone numbers wont work. (SPOILER: because they’re all 7 digits and so the “cut off at 3 million” thing is actually “cut off at 1 billion” in which case there’s an equal distribution of first digits)

      Also my friend Lily Chen pointed out that it probably wont come out to 30% 1’s in the newspaper or even more 1’s than any other digit because the majority of numbers in a newspaper are dates which start with 2. However, the probability of initial 1’s should still be greater than 11%.

      Mary, does the phone book experiment actually work out to 30%? I guess I’ll have to try it for myself 🙂

      Reply

  3. Posted by Rita on January 8, 2011 at 11:07 pm

    In the inequality, do you mean that k < LOGn < log2 + k? It seems that that's what it should be.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: