Inverse CDF Transform Sampling
Docker Shell and Exec Form difference
Show all

Important probability distributions for Data Science with Python code

33 mins read

For a data scientist aspirant, Statistics is a must-learn thing. It can process complex and challenging problems in the real world so that Data Scientists can mine useful trends, changes, and data behavior to fit into the appropriate model, yielding the best results. Every time we get a new dataset, we must understand the data pattern and the underlying probability distribution for further optimization and treatment during the Exploratory Data Analysis (EDA). During EDA, we try to find out the behavior of data using different probability distributions. If the data satisfies any one of the issuances or resembles them, we further treat them for a better result.

Data Scientists deal with many kinds of data, such as categorical, numerical, text, image, voice, and many more. Each of them has a way of analysis and representation. Here we are going to consider the numerical data for further analysis. Numerical data can be of two types.

  1. Discrete — It can only take specific values. The outcome of the data is fixed. For example, the number of employees in a company, the result when you roll a die where a possible outcome can be between [1,6]
  2. Continuous — It can take any value. For example, the height or weight of a person can be any value like 45.6, 87.9

We can plot this numerical data, visualize and draw a conclusion based on its pattern, behavior, and the type of probability distribution it follows. Before going into the deep, let’s be familiar with some terminologies.

What is a Random Variable?

A variable associated with some chance, measured, is called a random variable. The value of a random variable is unknown, and the outcomes can be obtained using experiments. It can be discrete(when the event has a specific result) or continuous(when the event has resulted within a particular range).

What is Probability Distribution?

A Probability Distribution of a random variable is a list of all possible outcomes with corresponding probability values.

Note: The value of the probability always lies between 0 to 1.

Example of probability distribution

What is an example of Probability Distribution?

Let’s understand the probability distribution by an example:

When two dice are rolled with six-sided dots, let the possible outcome of rolling is denoted by (a, b), where

a: number on the top of the first dice

b: number on the top of the second dice

Then, the sum of a + b is: 

Sum of a + b(a, b)
3(1,2), (2,1)
4(1,3), (2,2), (3,1)
5(1,4), (2,3), (3,2), (4,1)
6(1,5), (2,4), (3,3), (4,2), (5,1)
7(1,6), (2,5), (3,4),(4,3), (5,2), (6,1)
8(2,6), (3,5), (4,4), (5,3), (6,2)
9(3,6), (4,5), (5,4), (6,3)
10(4,6), (5,5), (6,4)
+ More 2 Rows
Probability distributions
  • If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution.
    • Example: Flipping of two coins
    • Functions that represent a discrete probability distribution are known as Probability Mass Functions.
  • If a random variable is a continuous variable, its probability distribution is called a continuous probability distribution.
    • Example: Measuring temperature over a period of time
    • Functions that represent a continuous probability distribution are known as Probability Density Functions.

What is Probability Mass Function(PMF)?

Ans: The distribution of discrete random variables is called the probability mass function(PMF). The pmf of a discrete random variable x is defined as,

What is Probability Density Function(PMF)?

The distribution of continuous random variables is called the probability density function(PDF). The pdf of variables(let x) whose values range over an interval of numbers(let a & b) is defined as,


  • PMF (Probability Mass Function) — a mathematical formula to measure the probability of drawing a specific value from a discrete data distribution.
  • PDF (Probability Density Function) — a mathematical formula to measure the probability density of different values across a continuous space.
  • CDF (Cumulative Density Function) — a mathematical formula to measure the probability of drawing a sample less than or equal to a certain value.

Alright, now let’s take a look at some data distributions!

Discrete Distributions

1. Bernoulli DistributionSingle-Trial with Two Possible

The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a starting point to derive more complex distributions. Any event with a single trial and only two possible outcomes follow a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution. They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a single trail. The only two possible outcomes are either heads or tails. This is an example of a Bernoulli distribution.

Usually, when following a Bernoulli distribution, we have the probability of one of the outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it from the total probability (1), represented as (1-p).

It is represented by bern(p), where p is the probability of success. The expected value of a Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) = p(1-p).

The Bernoulli Distribution captures the probability of receiving one of two outcomes (often called success or failure) given a single trial. It is actually just a special case of the Binomial distribution where n=1.


  • You are out with your friend, and you pull a coin from your pocket to determine who is buying the next round of drinks 🍻 The outcome of this coin flip can be modeled using the Bernoulli distribution.
Bernoulli distribution

Python Code

from scipy.stats import bernoulli
import matplotlib.pyplot as plt

# Specified probability parameter
p = 0.5

x = [i for i in range(0,10)]
# Sample according to Bernoulli distribution
y = bernoulli.rvs(p, size=10)

plt.plot(x,r, "ob")

2. Binomial Distribution: A sequence of Bernoulli events

The binomial distribution is just taking Bernoulli one step further. We still have trials that result in one of two outcomes (success or failure), but now we are looking at the probability that a specific number of outcomes (x) occur in n trials instead of a single trial.

The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events, and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.

Binomial vs Bernoulli distribution.

The difference between these distributions can be explained through an example. Consider you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main characteristics of Binomial Distribution are:

  • Given multiple trials, each of them is independent of the other. That is, the outcome of one trial doesn’t affect another one.
  • Each trial can lead to just two possible results (e.g., winning or losing), with probabilities p and (1 – p).

A binomial distribution is represented by B (n, p), where n is the number of trials and p is the probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the number of times a success occurs, represented as E(x) = np. Similarly, variance is represented as Var(x) = np(1-p).

Let’s consider the probability of success (p) and the number of trials (n). We can then calculate the likelihood of success (x) for these n trials using the formula below:

binomial - formula

For example, suppose that a candy company produces both milk chocolate and dark chocolate candy bars. The total products contain half milk chocolate bars and half dark chocolate bars. Say you choose ten candy bars at random and choosing milk chocolate is defined as a success. The probability distribution of the number of successes during these ten trials with p = 0.5 is shown here in the binomial distribution graph:

binomial distribution graph
Binomial Distribution Graph

What is an example of Binomial Distribution?

Let’s understand the Binomial Distribution by an example,

Consider the experiment of Picking Balls

Problem Statement: 

Let there be 8 white balls and 2 black balls, then the probability of drawing 3 white balls, if the probability of selecting a white ball is 0.6.

Example binomial distribution : Probability Distributions


  • You decide to try and trick your friend (with statistics, of course) and you say that you will buy the next round of drinks if you flip a coin 5 times and heads comes up exactly 2 times 🪙
Binomial distribution


  • The experiment is performed under the same set of conditions for any number of trials. For example, if the prob. of success(p) is 0.5, it will be 0.5 throughout the trials.
  • For each trial, there are only two possible outcomes. success or failure
  • The sum of the probabilities will always be 1.
  • Each trial will be independent of each other.


n = number of independent trials


  • A binomial distribution is skewed unless p=q=1/2.
  • The mean np=λ is constant, which is a positive real value.
  • The sum of independent binomial variate is not a binomial variate.

Q. Under which conditions a binomial distribution can be a normal distribution?

  1. The number of independent trials should be indefinitely large, n → ∞.
  2. Neither p nor q should be small.

Distribution Plot:

for the different probability of success

Python Code

from scipy.stats import binom
import matplotlib.pyplot as plt

# Specified probability parameter
p = 0.5
n = 10
x = [i for i in range(0,n)]

# Sample according to Binomail distribution
y = binom.rvs(n, p, size=10)

plt.plot(x,y, "ob")

3. Discrete Uniform Distribution: All outcomes are equally likely

In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.

As a result, the uniform distribution graph contains bars of equal height representing each outcome. In our example, the height is a probability of 1/6 (0.166667).

A discrete uniform distribution is a simple distribution where we have a set of potential outcomes (n), each of which has an equal likelihood of occurring.


  • You blindly reach into a bag of marbles that contains a green marble, a red marble, a blue marble, and a yellow marble. What are the chances of picking the yellow marble? 🟡
discrete uniform distribution

Uniform distribution is represented by the function U(a, b), where a and b represent the starting and ending values, respectively. Similar to a discrete uniform distribution, there is a continuous uniform distribution for continuous variables.

The drawbacks of this distribution are that it often provides us with no relevant information. Using our example of a rolling die, we get the expected value of 3.5, which gives us no accurate intuition since there is no such thing as half a number on a dice. Since all values are equally likely, it gives us no real predictive power.

Python Code

import numpy as np 
import matplotlib.pyplot as plt
from scipy import stats

# for discrete
X_discrete = np.arange(1, 7)
discrete_uniform = stats.randint(1, 7)
discrete_uniform_pmf = discrete_uniform.pmf(X_discrete) 

# plot both tables
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
# discrete plot
ax[0].bar(X_discrete, discrete_uniform_pmf)
ax[0].set_title("Discrete Uniform Distribution")

4. Poisson Distribution: The probability that an event May or May not occur

The Poisson distribution is used to answer the question |how many times is an event likely to occur over a given period of time?” 

Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance. For example, a cricket chirps two times in 7 seconds on average. We can use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds. A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period. The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modeled using the following formula.

The main characteristics which describe the Poisson Processes are:

  • The events are independent of each other.
  • An event can occur any number of times (within the defined period).
  • Two events can’t take place simultaneously.


  • Let’s say that a basketball team scores an average of 4.2 three-point shots per quarter 🏀 If that is true, then what is the likelihood that this team will score exactly 7 three-point shots in a quarter?
Poisson distribution

Under which conditions a binomial distribution can form a Poisson distribution?

  1. The number of trials(n) should be huge, say ∞.
  2. The constant probability of success for each trial should be minimal p→0
  3. The mean should be equal to the Poisson parameter. np= λ


For the Poisson distribution, both mean and variance is the same, which is the Poisson parameter.

Distribution Plot:

What are examples of Poisson Distribution?

Many real-life datasets which we encounter as a data scientist follows the Poisson distribution. Such as,

  1. The number of transaction frauds that happens in a month for a particular bank.
  2. The number of insincere questions posted on Quora every day
  3. The number of customers who call the company service center for their service problem

Let in a hospital patient arriving in a hospital at an expected value is 6, then what is the probability of five patients will visit the hospital on that day?

Example Poisson Distribution : Probability Distributions

Difference between Poisson Distribution and Binomial Distribution

The number of trials is infiniteThe number of trials is fixed
Unlimited number of possible outcomesOnly two possible outcomes (Success or Failure)
Mean = VarianceMean > Variance

Python Code

If an event occurs with a fixed rate in time, then the probability of observing the number (n) of events in time may be described with the Poisson distribution. For example, a customer may arrive at a cafe at an average rate of 3 per minute. We can use the Poisson distribution to calculate the probability of arrival for 9 customers in 2 minutes. λ in our example, it’s 3. k in our example is 9. We can use Scipy to solve our example problem.

from scipy import stats

print(stats.poisson.pmf(k=9, mu=3))

The curve of a Poisson distribution is similar to a normal distribution and λ marks the peak. Let’s plot one in Python to see how this looks visually.

# generate random values from poisson distribution with sample size of 500
X = stats.poisson.rvs(mu=3, size=500)

plt.subplots(figsize=(8, 5))
plt.hist(X, density=True, edgecolor="black")
plt.title("Poisson Distribution")

5. Geometric Distribution

Suppose we are surveying an independent candidate after polls that how many votes did he/she get. So outside a polling booth, we started asking people if they voted, and each time we are getting the name of other candidates. Finally, we got a person who said that he/she voted for that independent candidate. Here Geometric distribution will be represented by the number of people we had to poll before finding someone who voted for our candidate.

Basically, it represents the number of failures before we succeed in a series of Bernoulli trials(which has two outcomes always).

We can define the function as,


  • There are two possible outcomes for each trial (success or failure).
  • The trials are independent of each other.
  • The probability of success is the same for each trial.


Distribution Plot:

6. Multinomial Distribution

As the name suggests, Multinomial distribution is related to binomial distribution: in fact, it is a generalization of the multinomial distribution. In Binomial distribution, we only have 2 possible outcomes. What if our experiments have multiple outcomes? An analogy is like Binomial distribution doing a 10-time coin-tossing game, while the Multinomial distribution is like choosing a number randomly from 1,2,3 with the probability of p1, p2, p3=(1-p1-p2). Its probability mass function looks like this:

Where n is the number of experiments, and x_i is the outcome of the ith experiment.

Again, the python implementation looks handy with the help of scipy.stats :

from scipy.stats import multinomial
import matplotlib.pyplot as plt

# Specified probability parameter
p = [0.3,0.2,0.5]
n = 10
x = [i for i in range(0,n)]

# Sample according to multinomial distribution
y = multinomial.rvs(n, p, size=10)

plt.plot(x,y, "ob")

If you compare the image of the Multinomial distribution and that of the Binomial distribution, they will look similar except for each x value, we will have multiple random outcomes in the Multinomial distribution case. (Why there are 3 values for x = 0 and 2 for x =8? Well, the reason is that we have a duplicate value for x = 8, and matplotlib cannot display that.)

Note: There are many kinds of discrete probability distributions present. Such as negative binomial, hypergeometric, etc. These kinds of distributions also have a high impact in the case of statistics and it’s good to have an idea from a data science perspective. But we will complete the discrete part here with the above 5 distributions.

Continuous Distributions

Now that we are talking about continuous values, we can no longer say “what is the likelihood of this exact value occurring” because technically there are no exact values in a continuous space. Instead, we ask the question “what is the likelihood of a sample falling within a given range of values?”

1. Gaussian (Normal) Distribution: Symmetric Distribution of Values Around the Mean

The normal distribution is the most used distribution in data science. In a normal distribution graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center.

The normal distribution frequently appears in nature and life in various forms. For example, the scores of a quiz follow a normal distribution. Many of the students scored between 60 and 80 as illustrated in the graph below. Of course, students with scores that fall outside this range are deviating from the center.

It is defined by a mean value (μ) and a standard deviation (σ).

Here, you can witness the “bell-shaped” curve around the central region, indicating that most data points exist there. The normal distribution is represented as N(µ, σ) here, µ represents the mean, and σ represents the standard deviation one of which is mostly provided. The expected value of a normal distribution is equal to its mean. 

The probability density function of a normal distribution is as follows:

The normal distribution is the backbone of statistics and data science. Many machine learning models work well with data that follow a normal distribution. Such as;

  1. Gaussian Naive Bayes Classifier
  2. Logistic, Linear Regression, and least square-based regression models
  3. Linear Discriminant Analysis(LDA) and Quadratic Discriminant Analysis(QDA)

The sigmoid function tends to work well in the case of normally distributed data. Some data may also exhibit another kind of distribution, which can later transform into a normal distribution using logarithms and square roots.


Let’s say I am friends with every single person in the world and everyone volunteers their height information to me. The mean height of the population turns out to be 164.58cm with a standard deviation of 8.83cm. Given the information, what is the probability of someone being taller than 175cm?

Standard normal distribution

  • Normal distribution with mean = 0 and standard deviation = 1.  
  • For any random Variable X, the probability distribution function is given by:

standard normal distribution formula


**NOTE** — for calculations in a continuous space, we sometimes approximate probabilities by calculating scores and finding their associated probabilities in a lookup table. For Gaussian distribution, we use what is called the z-score.

Gaussian distribution
z-score table (positive z-scores)

Empirical Rule:

Empirical Rule is often called the 68 – 95 – 99.7 rule or Three Sigma Rule. It states that on a Normal Distribution:

  • 68% of the data will be within one Standard Deviation of the Mean
  • 95% of the data will be within two Standard Deviations of the Mean
  • 99.7 of the data will be within three Standard Deviations of the Mean
  • Characteristics of Normal Distribution :
    • Symmetrical around its mean value
    • Mean = Median = Mode
    • The total area under the curve is 1
    • The curve of the distribution is a bell curve
    • It is symmetrical about the mean. Each half of the distribution is a mirror image of the other half.
    • It is asymptotic to the horizontal axis.
    • It is unimodal.

Python Code

from scipy.stats import norm
import matplotlib.pyplot as plt
import numpy as np

x1 = np.arange(-20, 20, 0.1)
y1 = norm.pdf(x1, 0, 5)
y2 = norm.pdf(x1, 0, 3)
y3 = norm.pdf(x1, 5, 3)

plt.plot(x1, y2) 
plt.plot(x1, y1)
plt.plot(x1, y3) 

plt.legend(["Standard deviation 3", "Standard deviation 5", "mean value of 5"], loc ="upper left")

2. Exponential DistributionModel elapsed time between two events

The exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events. For example, in physics, it is often used to measure radioactive decay; in engineering, to measure the time associated with receiving a defective part on an assembly line; and in finance, to measure the likelihood of the next default for a portfolio of financial assets. Another common application of Exponential distributions in survival analysis (e.g., expected life of a device/machine).

The exponential distribution is commonly represented as Exp(λ), where λ is the distribution parameter, often called the rate parameter. We can find the value of λ by the formula = 1/μ, where μ is the mean.

In Exponential distribution, we are interested the value of waiting time until the next occurrence, rather than the number of occurrences. For the bus stop example, we do not care the number of arrivals in the future 20 minutes anymore. We only care when the next bus will arrive.

The exponential distribution, also called inverse Poisson Distribution, is used to model the time elapsed between two events. For example, the amount of time starting from now an earthquake occurs follows an exponential distribution. Suppose at t time the earthquake started and at (t+1) it ended. If we plot the distribution between the time t and (t+1), it will follow the Exponential distribution.

The random variables in an exponential distribution have fewer large values and larger small values. For example, the shopping details of items in a grocery supermarket. People generally buy items with a small amount in bulk, but a few people buy items with a large amount. This is a general tendency.

How is it an inverse case of Poisson Distribution?

Let’s take the below two cases.

  1. Number of cars passing a tollgate in one hour
  2. Number of hours between cars’ arrival

In the above cases, we saw that condition 1 asks for the number of cars per hour. It is dealing with the car amount. But in condition 2, we are specifying the time interval between a car arrives. If condition 1 follows the Poisson distribution, then condition 2 will follow the exponential distribution.

Example 2: The Number of hipsters arriving at a bar in one minute and the number of minutes between new arrivals at the same bar. One follows the Poisson distribution, whereas another is Exponential.


  1. Events must occur at a constant rate
  2. Events must be independent of each other


The probability density function of an exponential distribution is as follows:

λ is the rate parameter and is the random variable.

Suppose we measure the life of a mobile phone. Then λ is called here the rate of failure of the mobile phone at time t(say), given that it has survived for time t.



The important parameter for this distribution is the rate parameter (λ), which is the rate of events done/per unit of time.


  • A postal worker spends an average of 4 minutes with each customer, and the time they spend with customers can be represented as an exponential function (0.25 customers/min)
  • Given a random customer, what is the probability the postal worker will spend less than 2 minutes with them?
Exponential distribution

Python Code

X = np.linspace(0, 5, 5000)

exponetial_distribtuion = stats.expon.pdf(X, loc=0, scale=1)

plt.plot(X, exponetial_distribtuion)
plt.title("Exponential Distribution")

3. Log-normal Distribution

A log-normal distribution is a continuous distribution of random variables whose logarithms are distributed normally. In other words, the lognormal distribution is generated by the function of eˣ, where x (random variable) is supposed to be normally distributed.

Here’s the PDF of a lognormal distribution:

A random variable that is lognormally distributed takes only positive real values. Consequently, lognormal distributions create curves that are right-skewed.

In real life, many natural phenomena that occur follow a log-normal distribution. Such as:

  1. The length of comments posted in Internet discussion forums follows a log-normal distribution
  2. Users’ dwell time on online articles (jokes, news) follows a log-normal distribution.
  3. In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally


  • Let’s say you work as a quality control engineer. You have to perform 10,000 stress tests of your product underneath a hydraulic press, to see how long it takes before it breaks/fails. The results from all the samples may look like the lognormal distribution below.
Log-normal distribution

Just knowing that your data is log-normal distributed is valuable, because we can easily translate log-normal data to a normal distribution using the log(x) function!


Python Code

X = np.linspace(0, 6, 500)

std = 1
mean = 0
lognorm_distribution = stats.lognorm([std], loc=mean)
lognorm_distribution_pdf = lognorm_distribution.pdf(X)

fig, ax = plt.subplots(figsize=(8, 5))
plt.plot(X, lognorm_distribution_pdf, label="μ=0, σ=1")
ax.set_xticks(np.arange(min(X), max(X)))

std = 0.5
mean = 0
lognorm_distribution = stats.lognorm([std], loc=mean)
lognorm_distribution_pdf = lognorm_distribution.pdf(X)
plt.plot(X, lognorm_distribution_pdf, label="μ=0, σ=0.5")

std = 1.5
mean = 1
lognorm_distribution = stats.lognorm([std], loc=mean)
lognorm_distribution_pdf = lognorm_distribution.pdf(X)
plt.plot(X, lognorm_distribution_pdf, label="μ=1, σ=1.5")

plt.title("Lognormal Distribution")

4. Chi-squared distribution

The Chi-squared distribution belongs to one of the most important and well-known distributions for data scientists and statisticians. It shows up in numerous statistical settings: Chi-squared test for independence, Chi-squared for quality of fit between data and proposed distribution, likelihood ratio test, etc. Its importance cannot be overstated. It is a continuous probability distribution on [0, infinity), and is also a special instance of Gamma distribution. The parameter it takes in is called degrees of freedom, and as usual, this parameter will determine the shape of the distribution.

The Chi-square test is especially used to calculate the fitness of sampling data given the proposed distribution or independence test. In scipy.stats , we can easily compute the Chi-squared test statistic by scipy.stats.chisquare(your_sample, expected_distribution) . It is super easy to use!

With k degrees of freedom, the chi-squared distribution is the sum of the squares of k for some independent standard normal random variables.

Here’s the PDF:

Probability Density Function (PDF) for Chi-squared distribution; Image by Author

It’s a popular probability distribution, commonly used in hypothesis testing and in the construction of confidence intervals.

Python Code

from scipy.stats import chi2
import matplotlib.pyplot as plt

# Specified probability parameter
df1 = 10
df2 = 20
df3 = 30
df4 = 40
df5 = 50

# calculate range we want to display
x = np.linspace(0,
                30, 500)

# Sample according to chi2 distribution
rv1 = chi2(df1)
rv2 = chi2(df2)
rv3 = chi2(df3)
rv4 = chi2(df4)
rv5 = chi2(df5)

plt.plot(x, rv1.pdf(x), 'r', label='df = 10')
plt.plot(x, rv2.pdf(x), 'g',label='df = 20')
plt.plot(x, rv3.pdf(x), 'b', label='df = 30')
plt.plot(x, rv4.pdf(x), 'black',label='df = 40')
plt.plot(x, rv5.pdf(x), 'yellow',label='df = 50')
plt.legend(loc="upper left")

5. Student’s T-Distribution

It is another distribution that is heavily used in statistical tests. It is useful in helping to make a comparison between groups when our sample size is small and/or the population standard deviation is unknown. It is later widely applied to a lot of statistical settings such as the construction of confidence intervals and regression analysis. Like Chi-square distribution, it also takes in one parameter, usually referred to as the degree of freedom too. As expected, the DoF also controls the shape of the distribution.

Visually a Student’s t-distribution looks much like a normal distribution but generally has fatter tails. Fatter tails allow for a higher dispersion of variables, as there is more uncertainty. The t-statistic is related to the Student’s t-distribution so that a Z-statistic is related to the standard normal distribution.

The formula that allows calculating t-statistic is,

x̅, s = sample mean and sd

t with (n-1) degree of freedom and a significance level of α equals the sample mean(x̅) minus the population means (μ) divided by the standard error of the sample.

As we can see, it is very similar to the standard normal variate or z-statistic. After all, this is an approximation of normal distribution.

Usually, for a sample of n, we have (n-1) degrees of freedom. So for 20 samples of distribution, we have 19 degrees of freedom. In another way, we can say the number of degrees of freedom describes the number of pieces of information used to describe a population quantity.


Student’s t-distribution with varying degrees of freedom

As we can see here, the increase in the degree of freedom leads to the normal distribution. Also, the tails are getting close to the x-axis.

The t-distribution is most often used for calculations within the realm of hypothesis testing, as we will see in the example below!


  • Suppose that a new car is introduced into the market and the manufacturer claims it has an average fuel consumption rating of 7.2L/100KM. You decide to go out and test 4 of these cars yourself, and you record their fuel ratings:
    [7.1, 7.5, 6.7, 6.9]
  • With the data you’ve collected, and using a 95% confidence interval, can you confirm the manufacturer’s claim?
Students t-distribution
t-test table

Since we fail to reject the null hypothesis, we can state that from the findings in our data that we are 95% confident that the car manufacturer’s claim is true!

The PDF is as follows:

n is the free parameter called “degrees of freedom” but you may also see it referred to as “d.o.f.” The t distribution gets closer to a normal distribution for high values of n.

Python Code

from scipy.stats import t
from scipy.stats import norm
import matplotlib.pyplot as plt

# Specified probability parameter
df1 = 1
df2 = 2
df3 = 3
df4 = 4

# calculate range we want to display
x = np.linspace(-10,
                10, 200)

# Sample according to t distribution
rv1 = t(df1)
rv2 = t(df2)
rv3 = t(df3)
rv4 = t(df4)

plt.plot(x, rv1.pdf(x), 'r', label='df = 10')
plt.plot(x, rv2.pdf(x), 'g',label='df = 20')
plt.plot(x, rv3.pdf(x), 'b', label='df = 30')
plt.plot(x, rv4.pdf(x), 'black',label='df = 40')
plt.plot(x, norm.pdf(x), 'yellow', label='Gaussian')
plt.legend(loc="upper left")

I deliberately put Gaussian distribution in the picture so that you can have a clear view of their difference. It can be seen that the Gaussian distribution does have smaller tails than the t-distribution as well as larger peaks.

6. Beta Distribution

Beta distribution is a continued-random variable distribution over internal [0,1]. It has two parameters α and β. α and β, just like the mean and standard deviation in the Gaussian distribution, control the shape of the distribution. They are related to the sample size and mean, but the relationship itself is more complicated than “equality”. The beta distribution is often used in Bayesian inference as the prior distribution. The details of that cannot be explained in a few sentences, but a high-level overview of prior is the things you expect before you run the random experiments. For instance, if you go watch a soccer player, you should not expect that he can score over 5 goals in this match. Probably 0.5–1.5 may be a good guessing range for that. The prior can be thought of as a more mathematically rigorous “guessing range”. The probability density function is not particularly important, so only the implementation will be shown:

Python Code

from scipy.stats import beta
import matplotlib.pyplot as plt

# Specified probability parameter
a = 2
b = 3
n = 100
x = [i for i in range(0,n)]

# Sample according to Beta distribution
y = beta.rvs(a, b, size=n)

plt.plot(x,y, "ob")

A sample random draw according to beta distribution will look like this:

7. Gamma Distribution

Similar to the exponential function, Gamma distributions are often used for waiting time problems. The difference is that Gamma distributions are used for finding the probability associated with waiting for k number of events, instead of just one event as is the case with exponential. The important parameters for Gamma are the number of events to wait (k) and the rate parameter (λ).


  • You get to your favorite restaurant and there are 2 people in line ordering ahead of you. The mean order time at this restaurant is 2 minutes (or a rate of 0.5 customers per minute).
  • How likely is it that you will get to begin placing your order in the next 4 minutes?
Gamma distribution

Like Beta distribution, Gamma distribution is also two-parameter continuous probability distribution, and it is also a good model for prior distribution. It is a conjugate prior function for many distributions: Gaussian distribution, Poisson distribution, etc. Some special case of gamma distribution includes the Exponential distribution mentioned above and the Chi-square distribution to be discussed after. Besides that, it also has an interesting link to information theory: among all distributions, the Gamma distribution has the maximum entropy. If you are interested in the details, feel free to explore more!

Python Code

from scipy.stats import gamma
import matplotlib.pyplot as plt

# Specified probability parameter
a = 2
b = 3
n = 100
x = [i for i in range(0,n)]

# Sample according to Gamma distribution
y = gamma.rvs(a, b, size=n)

plt.plot(x,y, "ob")

8. Power-law and Pareto Distribution:

In statistics, power-law states that a relative change in one quantity results in a significant change in another quantity. For example, when the length/side increases by two units in a square, the area increases by four units.

A power-law distribution has the form,

(x,y) variables of interest, “a” law exponent, “k” constant

The power-law can be used to describe a phenomenon where a small number of items is clustered at the top of a distribution(or at the bottom), taking up 95% of the resources. In other words, it implies a small number of occurrences is common, while a larger occurrence is rare.

A specific type of distribution that follows power law is called Pareto distribution. The Pareto principle states that 80% of the effects come from 20% of the cause. For example, 80% of the world’s wealth is earned by 20% of the people. We can see that 80% of the words in a text corpus form only 20% of the unique words during text preprocessing.

Pareto Distribution:

The Pareto distribution is highly skewed and has a slowly decaying tail. It has two parameters.shape parameter(α)(tail index) and scale parameter(x_m). When the distribution is used to model wealth distribution, the parameter α is called the Pareto index.

So the probability density function of the Pareto distribution is,

When plotted on linear axes, the distribution assumes the familiar J-shaped curve, which approaches each of the orthogonal axes asymptotically. All segments of the curve are self-similar (subject to appropriate scaling factors). When plotted in a log-log plot, the distribution is represented by a straight line.


  1. Model the lifetime of a manufactured item with a certain warranty period.
  2. The size of meteorites.
  3. The standardized price returns on individual stocks.



Understanding the distributions of data is important because it can give us insights and open the door to performing further statistical analysis. This article covers some of the most common data distributions, but it is by no means a comprehensive list.


Leave a Reply

Your email address will not be published. Required fields are marked *