Importance sampling is an approximation method. It uses a mathematical transformation and is able to formulate the problem in a different way.

## Importance Sampling Definition

Importance sampling is an approximation method that uses a mathematical transformation of the Monte Carlo sampling method to take the average of all samples to estimate an expectation.

In this post, we are going to:

1. Learn the basics of importance sampling
2. Get a deeper understanding by implementing the process.
3. Compare results from different sampling distributions.

## What Is Importance Sampling?

Importance sampling is an approximation method. Let’s say you are trying to calculate an expectation of the function f(x), where x ~ p(x), subjected to some distribution. We have the following estimation of E(f(x)):

The Monte Carlo sampling method involves simply sampling x from the distribution p(x) and taking the average of all samples to get an estimation of the expectation. But here comes the problem: What if p(x) is very hard to sample from? Are we able to estimate the expectation based on a known and easily sampled distribution?

The answer is yes. And it comes from a simple transformation of the formula:

Where x is sampled from distribution q(x), and q(x) should not be 0. This way you’re able to estimate the expectation by sampling from another distribution q(x). And p(x)/q(x) is called as the sampling ratio or sampling weight, which acts as a correction weight to offset the probability of sampling from a different distribution.

Another thing we need to talk about the variance of estimation:

Where in this case, X is f(x)p(x)/q(x). If p(x)/q(x) is large, this will result in large variance, which we hope to avoid. On the other hand, it is also possible to select a proper q(x) that results in even smaller variance. Let’s get into an example.

More on Data ScienceLikelihood vs. Probability: What’s the Difference?

## Importance Sampling Example

First, let’s define function f(x) and sample distribution:

def f_x(x):
return 1/(1 + np.exp(-x))

The curve of f(x) looks like:

Now, let’s define the distribution of p(x) and q(x):

def distribution(mu=0, sigma=1):
# return probability given a value
distribution = stats.norm(mu, sigma)
return distribution

# pre-setting
n = 1000

mu_target = 3.5
sigma_target = 1
mu_appro = 3
sigma_appro = 1

p_x = distribution(mu_target, sigma_target)
q_x = distribution(mu_appro, sigma_appro)

For simplicity, both p(x) and q(x) are normal distributions, you can try to define some p(x) that is very hard to sample from. For our first demonstration, let’s set two distributions close to each other with similar mean (3 and 3.5) and same sigma 1:

Now, we are able to compute the true value sampled from distribution p(x):

s = 0
for i in range(n):
# draw a sample
x_i = np.random.normal(mu_target, sigma_target)
s += f_x(x_i)
print("simulate value", s/n)

Where we get an estimation of 0.954. Now, let’s sample from q(x) and see how it performs:

value_list = []
for i in range(n):
# sample from different distribution
x_i = np.random.normal(mu_appro, sigma_appro)
value = f_x(x_i)*(p_x.pdf(x_i) / q_x.pdf(x_i))

value_list.append(value)

Notice that here x_i is sampled from approximate distribution q(x), and we get an estimation of 0.949 and variance 0.304. See that we are able to get the estimate by sampling from a different distribution.

More on Data ScienceFeature Engineering Explained

### Comparison

The distribution q(x) might be too similar to p(x) that you probably doubt the ability of importance sampling. So, let’s try another distribution:

# pre-setting
n = 5000

mu_target = 3.5
sigma_target = 1
mu_appro = 1
sigma_appro = 1

p_x = distribution(mu_target, sigma_target)
q_x = distribution(mu_appro, sigma_appro)

With histogram:

Here, we set n to 5000. When distribution is dissimilar, we need more samples to approximate the value. This time we get an estimation of value 0.995, but variance 83.36.

The reason comes from p(x)/q(x). Two distributions that are too different from each other could result in a huge difference of this value, thus increasing the variance. The rule of thumb is to define q(x) where p(x)|f(x)| is large.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.