tl;dr – Don’t pair the modulus function with a random number generator. You might introduce a bias in the output.
I decided to implement a random move computer AI as I was working on tic-tac-toe in Swift this week. Not only could I use the AI within my suite of acceptance tests, but I could also watch a random move AI play against a first available move AI to see who was dominant.
Of course, if you are going to implement a random move AI, it helps to be able to generate random numbers. A quick Google search yielded that Swift could call two methods from the standard library to do just that:
arc4random_uniform(n). The first of these functions generates a random number between 0 and RAND_MAX (a variable set in cstdlib). The second function generates a random number between 0 and n-1 where n is some value provided by the user.
I wondered they would create two functions for random numbers. Maybe the second function was a convenience function. After all, couldn’t you just use the modulus function to generate random numbers between zero and some maximum value? A closer inspection of the documentation yielded this quote:
arc4random_uniform() will return a uniformly distributed random number less than upper_bound.
arc4random_uniform() is recommended over constructions like “arc4random() % upper_bound” as it avoids “modulo bias” when the upper bound is not a power of two.
Apparently, there is a problem using the modulus function on a random number generator. Let’s explore why.
What Exactly Is Modulo Bias?
It’s easiest to explain what modulo bias is with an example. I want to generate a random number between 0 and 8 (i.e. the nine indices of the nine cells on my tic-tac-toe board). Just to keep the numbers simple, let’s pretend that
arc4rand is a perfect random number generator that will return a random number between 0 and RAND_MAX = 19; each number in the range (0 to 19) has an equal probability of being picked:
P(number) = 1/20 = 5%
Since I want to generate a random number between 0 and 8, I will simply take the result of
arc4rand modulo 9.
Let’s go through the results:
arc4rand returns a 0, 9, or 18, I will receive a 0 as my final output.
arc4rand returns a 1, 10, or 19, I will receive a 1 as my final output.
arc4rand returns 2 or 11, I will receive a 2 as my final output.
Wait a minute…
The probability of getting a zero is P(0) = 3/20 = 15%. The probability of getting a two is P(2) = 2/20 = 10%. That is a really big difference between those probabilities! In the graph below, I have calculated the difference between the maximum and minimum probabilities that are returned using the modulus method of random number generation. The source code for the Matlab script can be found here.
The difference between the maximum and minimum probabilities due to modulo bias is inversely proportional to RAND_MAX. Even when RAND_MAX is 100, there is a noticeable difference between the maximum and minimum probabilities. Furthermore, we see that when RAND_MAX is a multiple of eight, the error drops down to zero. The periodicity of the drops depends on the number with which you are taking the modulus (e.g. in our case we are using nine cells, so n = 9 and n – 1 = 9 – 1 = 8).
Is It Really That Bad?
“Wait a minute,” I hear you say. “Using a RAND_MAX of 100 is ridiculous. No modern computer uses a number so small.”
I cannot deny that. The function
arc4tan uses a RAND_MAX value of 2^32 – 1 = 4,294,967,295. That means that the difference between the probabilities is on the order of 10^-10 (a really tiny number). However, things change drastically when you need to compute a lot of random numbers.
Suppose my tic-tac-toe game explodes in popularity, and I decide to calculate random moves as a service. Of course, since people love my tic-tac-toe game so much, the traffic to my servers reaches the same levels as Google searches.
This site reports that Google hosts 100 billion searches per month or 1.2 trillion per year. Let’s use these numbers to make some rough estimates.
In the ideal case, each cell on a tic-tac-toe board would have a probability of P(cell) = 1/9 of being selected. That means that over the course of a year, the expected number of times a cell is selected is:
(1.2 trillion requests)(1/9) ≈ 133,333,333,333 times (i.e. 133 billion and change).
However, if I were to use the modulus method of calculating random numbers from
arc4rand, the probability of selecting one of the first four numbers is given by the ceiling function:
ceil(2^32 / 9) / 2^32 = 0.11111111124046
The probability of selecting one of the final five numbers is given by the floor function:
floor(2^32 / 9) / 2^32 = 0.1111111110076308
The first four cells are selected 133,333,334,886 times each while the last five numbers are selected 133,333,332,092 each. That is a difference of over 1000 from the expected value of 133,333,333,333!
Long story short: for my little game of tic-tac-toe, modulo bias really makes no difference. However, if lives hang in the balance, modulo bias might be unacceptable.
Update 09/27/2016: I ran a Pearson’s chi-squared test on the different outcomes and found that there is no statistically significant difference in the distributions (p > 0.05). This makes sense due to the enormous RAND_MAX value. Thus, even in my Google traffic case, modulo bias is a minor problem, at best. I am sure there exist situations where modulo bias might be more of an issue.
Modulo bias can occur when you use a modulus function to calculate a random number from a range of values. When RAND_MAX is small, the bias can be very pronounced and can heavily skew your results. However, when RAND_MAX is very large, the bias tends towards zero. This might be fine for most situations (i.e. a game of tic-tac-toe), but it might become a security risk in systems that depend on perfectly uniform random distributions.
I can think of a few solutions to the problem of modulo bias:
- Implement an algorithm that eliminates the modulo bias from the calculation.
- Use a RAND_MAX such that RAND_MAX % n == n – 1
- Use a really big RAND_MAX and accept an approximate solution.
Until next time!