13 mins read
## Bootstrapping and traditional hypothesis testing are inferential statistical procedures

## Differences between bootstrapping and traditional hypothesis testing

## How bootstrapping resamples your data to create simulated datasets

## Example of bootstrap samples

## How well does bootstrapping work?

## Example of using bootstrapping to create confidence intervals

### Performing the bootstrap procedure

## Benefits of bootstrapping over traditional statistics

## For which sample statistics can I use bootstrapping?

### The Problem to Solve

### The Available Resources

### How to Bootstrap

### How to Use Bootstrapping

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.

In this blog post, I will explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals.

Both bootstrapping and traditional methods use samples to draw inferences about populations. To accomplish this goal, these procedures treat the single sample that a study obtains as only one of many random samples that the study could have collected.

From a single sample, you can calculate a variety of sample statistics, such as the mean, median, and standard deviation—but we’ll focus on the mean here.

Now, suppose an analyst repeats their study many times. In this situation, the mean will vary from sample to sample and form a distribution of sample means. Statisticians refer to this type of distribution as a sampling distribution. Sampling distributions are crucial because they place the value of your sample statistic into the broader context of many other possible values.

While performing a study many times is infeasible, both methods can estimate sampling distributions. Using the larger context that sampling distributions provide, these procedures can construct confidence intervals and perform hypothesis testing.

A primary difference between bootstrapping and traditional statistics is how they estimate sampling distributions. Traditional hypothesis testing procedures require equations that estimate sampling distributions using the properties of the sample data, the experimental design, and a test statistic. To obtain valid results, you’ll need to use the proper test statistic and satisfy the assumptions.

The bootstrap method uses a very different approach to estimate sampling distributions. This method takes the sample data that a study obtains, and then resamples it over and over to create many simulated samples. Each of these simulated samples has its own properties, such as the mean. When you graph the distribution of these means on a histogram, you can observe the sampling distribution of the mean. You don’t need to worry about test statistics, formulas, and assumptions.

The bootstrap procedure uses these sampling distributions as the foundation for confidence intervals and hypothesis testing. Let’s take a look at how this resampling process works.

Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. Here’s how it works:

- The bootstrap method has an equal probability of randomly drawing each original data point for inclusion in the resampled datasets.
- The procedure can select a data point more than once for a resampled dataset. This property is the “with replacement” aspect of the process.
- The procedure creates resampled datasets that are the same size as the original dataset.

The process ends with your simulated datasets having many different combinations of the values that exist in the original dataset. Each simulated dataset has its own set of sample statistics, such as the mean, median, and standard deviation. Bootstrapping procedures use the distribution of the sample statistics across the simulated samples as the sampling distribution.

Let’s work through an easy case. Suppose a study collects five data points and creates four bootstrap samples, as shown below.

This simple example illustrates the properties of bootstrap samples. The resampled datasets are the same size as the original dataset and only contain values that exist in the original set. Furthermore, these values can appear more or less frequently in the resampled datasets than in the original dataset. Finally, the resampling process is random and could have created a different set of simulated datasets.

Of course, in a real study, you’d hope to have a larger sample size, and you’d create thousands of resampled datasets. Given the enormous number of resampled data sets, you’ll always use a computer to perform these analyses.

Resampling involves reusing your one dataset many times. It almost seems too good to be true! In fact, the term “bootstrapping” comes from the impossible phrase of pulling yourself up by your own bootstraps! However, using the power of computers to randomly resample one dataset to create thousands of simulated datasets produces meaningful results.

The bootstrap method has been around since 1979, and its usage has increased. Various studies over the intervening decades have determined that bootstrap sampling distributions approximate the correct sampling distributions.

To understand how it works, keep in mind that bootstrapping does not create new data. Instead, it treats the original sample as a proxy for the real population and then draws random samples from it. Consequently, the central assumption for bootstrapping is that the original sample accurately represents the actual population.

The resampling process creates many possible samples that a study could have drawn. The various combinations of values in the simulated samples collectively provide an estimate of the variability between random samples drawn from the same population. The range of these potential samples allows the procedure to construct confidence intervals and perform hypothesis testing. Importantly, as the sample size increases, bootstrapping converges on the correct sampling distribution under most conditions.

Now, let’s see an example of this procedure in action!

For this example, I’ll use bootstrapping to construct a confidence interval for a dataset that contains the body fat percentages of 92 adolescent girls. I used this dataset in my post about identifying the distribution of your data. These data do not follow the normal distribution. Because it does not meet the normality assumption of traditional statistics, it’s a good candidate for bootstrapping. Although, the large sample size might let us bypass this assumption. The histogram below displays the distribution of the original sample data.

Download the CSV dataset to try it yourself: body_fat.

To create the bootstrapped samples, I’m using Statistics101, which is a giftware program. This is a great simulation program that I’ve also used to tackle the Monty Hall Problem!

Using its programming language, I’ve written a script that takes my original dataset and resamples it with replacement 500,000 times. This process produces 500,000 bootstrapped samples with 92 observations in each. The program calculates each sample’s mean and plots the distribution of these 500,000 means in the histogram below. Statisticians refer to this type of distribution as the sampling distribution of means. Bootstrapping methods create these distributions using resampling, while traditional methods use equations for probability distributions. Download this script to run it yourself: BodyFatBootstrapCI.

To create the bootstrapped confidence interval, we simply use percentiles. For a 95% confidence interval, we need to identify the middle 95% of the distribution. To do that, we use the 97.5^{th} percentile and the 2.5^{th} percentile (97.5 – 2.5 = 95). In other words, if we order all sample means from low to high, and then chop off the lowest 2.5% and the highest 2.5% of the means, the middle 95% of the means remain. That range is our bootstrapped confidence interval!

For the body fat data, the program calculates a 95% bootstrapped confidence interval of the mean [27.16 30.01]. We can be 95% confident that the population mean falls within this range.

This interval has the same width as the traditional confidence interval for these data, and it is different by only several percentage points. The two methods are very close.

Notice how the sampling distribution in the histogram approximates a normal distribution even though the underlying data distribution is skewed. This approximation occurs thanks to the central limit theorem. As the sample size increases, the sampling distribution converges on a normal distribution regardless of the underlying data distribution (with a few exceptions). For more information about this theorem, read my post about the Central Limit Theorem. Compare this process to how traditional statistical methods create confidence intervals.

Readers of my blog know that I love intuitive explanations of complex statistical methods. And, bootstrapping fits right in with this philosophy. This process is much easier to comprehend than the complex equations required for the probability distributions of the traditional methods. However, bootstrapping provides more benefits than just being easy to understand!

Bootstrapping does not make assumptions about the distribution of your data. You merely resample your data and use whatever sampling distribution emerges. Then, you work with that distribution, whatever it might be, as we did in the example.

Conversely, the traditional methods often assume that the data follow the normal distribution or some other distribution. For the normal distribution, the central limit theorem might let you bypass this assumption for sample sizes that are larger than ~30. Consequently, you can use bootstrapping for a wider variety of distributions, unknown distributions, and smaller sample sizes. Sample sizes as small as 10 can be usable.

In this vein, all traditional methods use equations that estimate the sampling distribution for a specific sample statistic when the data follow a particular distribution. Unfortunately, formulas for all combinations of sample statistics and data distributions do not exist! For example, there is no known sampling distribution for medians, which makes bootstrapping the perfect analysis for it. Other analyses have assumptions such as equality of variances. However, none of these issues are problems for bootstrapping.

While this blog post focuses on the sample mean, the bootstrap method can analyze a broad range of sample statistics and properties. These statistics include the mean, median, mode, standard deviation, analysis of variance, correlations, regression coefficients, proportions, odds ratios, variance in binary data, and multivariate statistics among others.

There are several, mostly esoteric, conditions when bootstrapping is not appropriate, such as when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias. However, those cases go beyond the scope of this introductory blog post.

So what is the problem we are trying to solve? And what are the resources that we have to solve it?

Quite simply, the problem is – how do we answer a question about a group of people if we can’t ask every person?

Let’s say that you want to know the proportion of black umbrellas used by the people of Vancouver. (Let’s face it, it rains a LOT here!) You could go to the city center one lunch hour and take tallies of the colors of the umbrellas. If the count you get looks something like the picture to the right, you’d probably say a bit over 10%.

But doesn’t this just mean that you know about the people carrying umbrellas at that particular time? How do you use this information to gain insight into the entire population?

So if all you had was this picture, what resources do you have to be able to answer your question? The answer is actually the picture itself! (Plus this very intriguing concept of using this information to bootstrap to estimates of the population.)

So to bootstrap the above information, you could do something like this:

- Take bucket and fill it with white (non-black umbrellas) and black balls (black umbrellas). (If you represented the bucket as a list of 1’s and 0’s it would look something like this:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) - You would then choose a sample size, say 200, and you would close your eyes and pick one of the balls from your bucket and record its color. Then you would put the ball back into the bucket, swirl it around, and pick again. You’d do that 200 times.
- For your 200 picks, you’d then find the proportion of black balls/umbrellas for that sample. (You could just count the number of black balls and then divide that by 200)
- You would then repeat the above process for something like 10,00 times and record all of the different proportions you found.
- You could then plot the frequency of all the proportions you found, calculate the average of all of the proportions, and cut off 2.5% from each tail of the frequency distribution to create a 95% confidence interval.

If you did that, the distribution of the proportion of black umbrellas would look something like this:

The proportion of black umbrellas for this process would come out somewhere around 13% and you could say that you were 95% confident that the actual value was between 8.5% and 18%.

Which would mean… **You just did it!**

From the sample that you took, you could say that you estimated that 13% of all people in Vancouver carried black umbrellas and you were 95% confident that the actual value fell somewhere between 8.5% and 18%. Pretty neat, huh?

Obviously, if you actually did what was described above it would take FOREVER. But with developments in computing/programming, this is actually very quickly calculated (took me about 10min to get the code right, including making the graph look pretty!)

The potential applications for bootstrapping don’t just apply to estimating the mean of a population. One of our class instructors suggests that bootstrapping can be used as a substitute for any statistical test that we would use in traditional statistics to talk about populations. This includes t-tests, F-tests, chi-squared – you name it, apparently, we can do it with bootstrapping. Bootstrapping becomes particularly interesting if we recognize that all of those traditional statistical tests have specific assumptions associated with them. If your data doesn’t align with those assumptions, the conclusions you draw from those tests run the risk of being wrong. Apparently, with bootstrapping and confidence intervals/hypothesis testing we can get around a lot of these issues.

References:

https://rebeccaebarnes.github.io/2018/04/29/bootstrapping

https://towardsdatascience.com/bootstrapping-statistics-what-it-is-and-why-its-used-e2fa29577307

## 3 Comments

[…] context, we are using the bootstrapping methods (that I’ve referenced previously) for simulating null and sampling distributions (rather than standard statistical formulae) and so […]

[…] of size 200 from our full dataset. We will call it the “Original Sample”. Then, we can use the Bootstrapping technique to generate the Sampling Distribution based on the Original Sample. Finally, we calculate […]

[…] We simulated the sampling distribution using bootstrapping for the difference in proportions (or difference in click-through […]