Bootstrap sampling data in Python (to prevent overfitting in Machine Learning)

Yazmin T. Montana
Aug 4, 2022
1 min read

Bootstrapping is a method of inferring results for a population from results found on a set of smaller random samples of the data population, using replacement during the sampling process.

This is an easy example on how to do it after generating a random normally distributed data set with a mean of 300 and 1000 entries.

I started by opening a new Jupyter Notebook and importing my numpy and random modules:

import numpy as np
import random

Next, I generate my random data using Numpy which fits the description of the beggining (normal distribution with a mean of 300 and 1000 entries)

x = np.random.normal(loc= 300.0, size=1000)

Now I calculate the mean of this dataset :

print (np.mean(x))

This print command will print the actual mean of the population. For mine it was 299.98600753216385

Next, I do the bootstrap sampling to estimate the mean. This code will create 50 samples of size 4 each:

sample_mean = []

for i in range(50):
  y = random.sample(x.tolist(), 4)
  avg = np.mean(y)
  sample_mean.append(avg)

Next, estimate the mean again but this time for the data set that contains the 50 samples only:

print(np.mean(sample_mean))

This print command will print the mean of one 50 samples.

Mine had a value of 299.9196544960246

Every time you run this line you can expect different results. That is because the command is generating new samples, however each time you run it the result will be closer to the actual mean.

This tool is used to prevent overfitting by doing a simple mathematical proof to make sure that the mean of your entire dataset is accurate compared to the mean of random samples in large (think of big data) data sets.

Bootstrap sampling data in Python (to prevent overfitting in Machine Learning)

Recent Posts

Comments