Using np.random.seed(number)
has been a best practice when using NumPy to create reproducible work. Setting the random seed means that your work is reproducible to others who use your code. But now when you look at the docs for np.random.seed
, the description reads:
This is a convenient, legacy function.
The best practice is to not reseed a BitGenerator, but rather to recreate a new one. This method is here for legacy reasons only.
So what’s changed? I’ll explain the old method and the issues with it. Then I’ll demonstrate the new best practice and its benefits.
Stop Using NumPy’s Global Random Seed — Here’s Why
Legacy Best Practice
If you look up tutorials using np.random
you see many of them using np.random.seed
to set the seed for reproducible work. We can see how this works:
>>> import numpy as np
>>> import numpy as np
>>> np.random.rand(4)
array([0.96176779, 0.7088082 , 0.06416725, 0.82679036])
>>> np.random.rand(4)
array([0.15051909, 0.77788803, 0.67073372, 0.32134285])
As you can see, two calls to the function lead to two completely different answers. If you want somebody to be able to reproduce your projects, you can set the seed with the following code snippet:
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
You see the results are the same. If you need to prove this to yourself, you can enter the above code on your Python setup.
Setting the seed means the next random call is the same; it sets the sequence of random numbers such that any code that produces or uses random numbers (with NumPy) will now produce the same sequence of numbers. For example, look at the following:
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])
The Problem With NumPy’s Global Random Seed
You may be looking at the above example and thinking, “so what’s the problem?” You can create reproducible calls, which means that all random numbers generated after setting the seed will be the same on any machine. For the most part, this is true; and for many projects, you may not need to worry about this.
The problem comes in larger projects or projects with imports that could also set the seed. Using np.random.seed(number)
sets what NumPy calls the global random seed, which affects all uses to the np.random.*
module. Some imported packages or other scripts could reset the global random seed to another random seed with np.random.seed(another_number)
, which may lead to undesirable changes to your output and your results becoming unreproducible. For the most part, you will only need to ensure you use the same random numbers for specific parts of your code (like tests or functions).
The Solution and New Method
This is one of the reasons NumPy has moved toward advising users to create a random number generator for specific tasks (or to even pass around when you need parts to be reproducible).
“The preferred best practice for getting reproducible pseudorandom numbers is to instantiate a generator object with a seed and pass it around.” — Robert Kern, NEP19
Using this new best practice looks like this:
import numpy as np
>>> rng = np.random.default_rng(2021)
>>> rng.random(4)
array([0.75694783, 0.94138187, 0.59246304, 0.31884171])
As you can see, these numbers are different from the earlier example because NumPy has changed the default pseudo-random number generator. However, you can replicate the old results by using RandomState
, which is a generator for old legacy methods
>>> rng = np.random.RandomState(2021)
>>> rng.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
The Benefits
You can pass random number generators around between functions and classes, meaning each individual or function could have its own random state without resetting the global seed. In addition, each script could pass a random number generator to functions that need to be reproducible. The benefit is you know exactly what random number generator is used in each part of your project.
def f(x, rng): return rng.random(1)
#Intialise a random number generator
rng = np.random.default_rng(2021)
#pass the rng to functions which you would like to use it
random_number = f(x, rng)
Other benefits arise with parallel processing, as Albert Thomas shows us.
Using independent random number generators can help improve the reproducibility of your results. You can do this by not relying on the global random state (which can be reset or used without knowing). Passing around a random number generator means you can keep track of when and how it was used and ensure your results are the same.