For a data scientist aspirant, Statistics is a must-learn thing. It can process complex and challenging problems in the real world so that Data Scientists can mine useful trends, changes, and data behavior to fit into the appropriate model, yielding the best results. Every time we get a new dataset, we must understand the data pattern and the underlying probability distribution for further optimization and treatment during the Exploratory Data Analysis (EDA). During EDA, we try to find out the behavior of data using different probability distributions. If the data satisfies any one of the issuances or resembles them, we further treat them for a better result.
Data Scientists deal with many kinds of data, such as categorical, numerical, text, image, voice, and many more. Each of them has a way of analysis and representation. Here we are going to consider the numerical data for further analysis. Numerical data can be of two types.
We can plot this numerical data, visualize and draw a conclusion based on its pattern, behavior, and the type of probability distribution it follows. Before going into the deep, let’s be familiar with some terminologies.
What is a Random Variable?
A variable associated with some chance, measured, is called a random variable. The value of a random variable is unknown, and the outcomes can be obtained using experiments. It can be discrete(when the event has a specific result) or continuous(when the event has resulted within a particular range).
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with corresponding probability values.
Note: The value of the probability always lies between 0 to 1.
Let’s understand the probability distribution by an example:
When two dice are rolled with six-sided dots, let the possible outcome of rolling is denoted by (a, b), where
a: number on the top of the first dice
b: number on the top of the second dice
Then, the sum of a + b is:
|Sum of a + b||(a, b)|
|4||(1,3), (2,2), (3,1)|
|5||(1,4), (2,3), (3,2), (4,1)|
|6||(1,5), (2,4), (3,3), (4,2), (5,1)|
|7||(1,6), (2,5), (3,4),(4,3), (5,2), (6,1)|
|8||(2,6), (3,5), (4,4), (5,3), (6,2)|
|9||(3,6), (4,5), (5,4), (6,3)|
|10||(4,6), (5,5), (6,4)|
|+ More 2 Rows|
What is Probability Mass Function(PMF)?
Ans: The distribution of discrete random variables is called the probability mass function(PMF). The pmf of a discrete random variable x is defined as,
What is Probability Density Function(PMF)?
The distribution of continuous random variables is called the probability density function(PDF). The pdf of variables(let x) whose values range over an interval of numbers(let a & b) is defined as,
Alright, now let’s take a look at some data distributions!
The Bernoulli distribution is one of the easiest distributions to understand. It can be used as a starting point to derive more complex distributions. Any event with a single trial and only two possible outcomes follow a Bernoulli distribution. Flipping a coin or choosing between True and False in a quiz are examples of a Bernoulli distribution. They have a single trial and only two outcomes. Let’s assume you flip a coin once; this is a single trail. The only two possible outcomes are either heads or tails. This is an example of a Bernoulli distribution.
Usually, when following a Bernoulli distribution, we have the probability of one of the outcomes (p). From (p), we can deduce the probability of the other outcome by subtracting it from the total probability (1), represented as (1-p).
It is represented by bern(p), where p is the probability of success. The expected value of a Bernoulli trial ‘x’ is represented as, E(x) = p, and similarly Bernoulli variance is, Var(x) = p(1-p).
The Bernoulli Distribution captures the probability of receiving one of two outcomes (often called success or failure) given a single trial. It is actually just a special case of the Binomial distribution where n=1.
from scipy.stats import bernoulli import matplotlib.pyplot as plt # Specified probability parameter p = 0.5 x = [i for i in range(0,10)] # Sample according to Bernoulli distribution y = bernoulli.rvs(p, size=10) plt.plot(x,r, "ob") plt.show()
The binomial distribution is just taking Bernoulli one step further. We still have trials that result in one of two outcomes (success or failure), but now we are looking at the probability that a specific number of outcomes (x) occur in n trials instead of a single trial.
The Binomial Distribution can be thought of as the sum of outcomes of an event following a Bernoulli distribution. Therefore, Binomial Distribution is used in binary outcome events, and the probability of success and failure is the same in all successive trials. An example of a binomial event would be flipping a coin multiple times to count the number of heads and tails.
Binomial vs Bernoulli distribution.
The difference between these distributions can be explained through an example. Consider you’re attempting a quiz that contains 10 True/False questions. Trying a single T/F question would be considered a Bernoulli trial, whereas attempting the entire quiz of 10 T/F questions would be categorized as a Binomial trial. The main characteristics of Binomial Distribution are:
A binomial distribution is represented by B (n, p), where n is the number of trials and p is the probability of success in a single trial. A Bernoulli distribution can be shaped as a binomial trial as B (1, p) since it has only one trial. The expected value of a binomial trial “x” is the number of times a success occurs, represented as E(x) = np. Similarly, variance is represented as Var(x) = np(1-p).
Let’s consider the probability of success (p) and the number of trials (n). We can then calculate the likelihood of success (x) for these n trials using the formula below:
For example, suppose that a candy company produces both milk chocolate and dark chocolate candy bars. The total products contain half milk chocolate bars and half dark chocolate bars. Say you choose ten candy bars at random and choosing milk chocolate is defined as a success. The probability distribution of the number of successes during these ten trials with p = 0.5 is shown here in the binomial distribution graph:
Let’s understand the Binomial Distribution by an example,
Consider the experiment of Picking Balls
Let there be 8 white balls and 2 black balls, then the probability of drawing 3 white balls, if the probability of selecting a white ball is 0.6.
Q. Under which conditions a binomial distribution can be a normal distribution?
from scipy.stats import binom import matplotlib.pyplot as plt # Specified probability parameter p = 0.5 n = 10 x = [i for i in range(0,n)] # Sample according to Binomail distribution y = binom.rvs(n, p, size=10) plt.plot(x,y, "ob") plt.show()
In statistics, uniform distribution refers to a statistical distribution in which all outcomes are equally likely. Consider rolling a six-sided die. You have an equal probability of obtaining all six numbers on your next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6, equaling a probability of 1/6, hence an example of a discrete uniform distribution.
As a result, the uniform distribution graph contains bars of equal height representing each outcome. In our example, the height is a probability of 1/6 (0.166667).
A discrete uniform distribution is a simple distribution where we have a set of potential outcomes (n), each of which has an equal likelihood of occurring.
Uniform distribution is represented by the function U(a, b), where a and b represent the starting and ending values, respectively. Similar to a discrete uniform distribution, there is a continuous uniform distribution for continuous variables.
The drawbacks of this distribution are that it often provides us with no relevant information. Using our example of a rolling die, we get the expected value of 3.5, which gives us no accurate intuition since there is no such thing as half a number on a dice. Since all values are equally likely, it gives us no real predictive power.
import numpy as np import matplotlib.pyplot as plt from scipy import stats # for discrete X_discrete = np.arange(1, 7) discrete_uniform = stats.randint(1, 7) discrete_uniform_pmf = discrete_uniform.pmf(X_discrete) # plot both tables fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,5)) # discrete plot ax.bar(X_discrete, discrete_uniform_pmf) ax.set_xlabel("X") ax.set_ylabel("Probability") ax.set_title("Discrete Uniform Distribution") plt.show()
The Poisson distribution is used to answer the question |how many times is an event likely to occur over a given period of time?”
Poisson distribution deals with the frequency with which an event occurs within a specific interval. Instead of the probability of an event, Poisson distribution requires knowing how often it happens in a particular period or distance. For example, a cricket chirps two times in 7 seconds on average. We can use the Poisson distribution to determine the likelihood of it chirping five times in 15 seconds. A Poisson process is represented with the notation Po(λ), where λ represents the expected number of events that can take place in a period. The expected value and variance of a Poisson process is λ. X represents the discrete random variable. A Poisson Distribution can be modeled using the following formula.
The main characteristics which describe the Poisson Processes are:
Under which conditions a binomial distribution can form a Poisson distribution?
Many real-life datasets which we encounter as a data scientist follows the Poisson distribution. Such as,
Let in a hospital patient arriving in a hospital at an expected value is 6, then what is the probability of five patients will visit the hospital on that day?
|The number of trials is infinite||The number of trials is fixed|
|Unlimited number of possible outcomes||Only two possible outcomes (Success or Failure)|
|Mean = Variance||Mean > Variance|
If an event occurs with a fixed rate in time, then the probability of observing the number (n) of events in time may be described with the Poisson distribution. For example, a customer may arrive at a cafe at an average rate of 3 per minute. We can use the Poisson distribution to calculate the probability of arrival for 9 customers in 2 minutes. λ in our example, it’s 3. k in our example is 9. We can use
Scipy to solve our example problem.
from scipy import stats print(stats.poisson.pmf(k=9, mu=3)) """ 0.002700503931560479 """
The curve of a Poisson distribution is similar to a normal distribution and λ marks the peak. Let’s plot one in Python to see how this looks visually.
# generate random values from poisson distribution with sample size of 500 X = stats.poisson.rvs(mu=3, size=500) plt.subplots(figsize=(8, 5)) plt.hist(X, density=True, edgecolor="black") plt.title("Poisson Distribution") plt.show()
Suppose we are surveying an independent candidate after polls that how many votes did he/she get. So outside a polling booth, we started asking people if they voted, and each time we are getting the name of other candidates. Finally, we got a person who said that he/she voted for that independent candidate. Here Geometric distribution will be represented by the number of people we had to poll before finding someone who voted for our candidate.
Basically, it represents the number of failures before we succeed in a series of Bernoulli trials(which has two outcomes always).
We can define the function as,
As the name suggests, Multinomial distribution is related to binomial distribution: in fact, it is a generalization of the multinomial distribution. In Binomial distribution, we only have 2 possible outcomes. What if our experiments have multiple outcomes? An analogy is like Binomial distribution doing a 10-time coin-tossing game, while the Multinomial distribution is like choosing a number randomly from 1,2,3 with the probability of p1, p2, p3=(1-p1-p2). Its probability mass function looks like this:
Where n is the number of experiments, and x_i is the outcome of the ith experiment.
Again, the python implementation looks handy with the help of
from scipy.stats import multinomial import matplotlib.pyplot as plt # Specified probability parameter p = [0.3,0.2,0.5] n = 10 x = [i for i in range(0,n)] # Sample according to multinomial distribution y = multinomial.rvs(n, p, size=10) plt.plot(x,y, "ob") plt.show()
If you compare the image of the Multinomial distribution and that of the Binomial distribution, they will look similar except for each x value, we will have multiple random outcomes in the Multinomial distribution case. (Why there are 3 values for x = 0 and 2 for x =8? Well, the reason is that we have a duplicate value for x = 8, and matplotlib cannot display that.)
Note: There are many kinds of discrete probability distributions present. Such as negative binomial, hypergeometric, etc. These kinds of distributions also have a high impact in the case of statistics and it’s good to have an idea from a data science perspective. But we will complete the discrete part here with the above 5 distributions.
Now that we are talking about continuous values, we can no longer say “what is the likelihood of this exact value occurring” because technically there are no exact values in a continuous space. Instead, we ask the question “what is the likelihood of a sample falling within a given range of values?”
The normal distribution is the most used distribution in data science. In a normal distribution graph, data is symmetrically distributed with no skew. When plotted, the data follows a bell shape, with most values clustering around a central region and tapering off as they go further away from the center.
The normal distribution frequently appears in nature and life in various forms. For example, the scores of a quiz follow a normal distribution. Many of the students scored between 60 and 80 as illustrated in the graph below. Of course, students with scores that fall outside this range are deviating from the center.
It is defined by a mean value (μ) and a standard deviation (σ).
Here, you can witness the “bell-shaped” curve around the central region, indicating that most data points exist there. The normal distribution is represented as N(µ, σ) here, µ represents the mean, and σ represents the standard deviation one of which is mostly provided. The expected value of a normal distribution is equal to its mean.
The probability density function of a normal distribution is as follows:
The normal distribution is the backbone of statistics and data science. Many machine learning models work well with data that follow a normal distribution. Such as;
The sigmoid function tends to work well in the case of normally distributed data. Some data may also exhibit another kind of distribution, which can later transform into a normal distribution using logarithms and square roots.
Let’s say I am friends with every single person in the world and everyone volunteers their height information to me. The mean height of the population turns out to be 164.58cm with a standard deviation of 8.83cm. Given the information, what is the probability of someone being taller than 175cm?
**NOTE** — for calculations in a continuous space, we sometimes approximate probabilities by calculating scores and finding their associated probabilities in a lookup table. For Gaussian distribution, we use what is called the z-score.
Empirical Rule is often called the 68 – 95 – 99.7 rule or Three Sigma Rule. It states that on a Normal Distribution:
from scipy.stats import norm import matplotlib.pyplot as plt import numpy as np x1 = np.arange(-20, 20, 0.1) y1 = norm.pdf(x1, 0, 5) y2 = norm.pdf(x1, 0, 3) y3 = norm.pdf(x1, 5, 3) plt.plot(x1, y2) plt.plot(x1, y1) plt.plot(x1, y3) plt.legend(["Standard deviation 3", "Standard deviation 5", "mean value of 5"], loc ="upper left") plt.show()
The exponential distribution is one of the widely used continuous distributions. It is used to model the time taken between different events. For example, in physics, it is often used to measure radioactive decay; in engineering, to measure the time associated with receiving a defective part on an assembly line; and in finance, to measure the likelihood of the next default for a portfolio of financial assets. Another common application of Exponential distributions in survival analysis (e.g., expected life of a device/machine).
The exponential distribution is commonly represented as Exp(λ), where λ is the distribution parameter, often called the rate parameter. We can find the value of λ by the formula = 1/μ, where μ is the mean.
In Exponential distribution, we are interested the value of waiting time until the next occurrence, rather than the number of occurrences. For the bus stop example, we do not care the number of arrivals in the future 20 minutes anymore. We only care when the next bus will arrive.
The exponential distribution, also called inverse Poisson Distribution, is used to model the time elapsed between two events. For example, the amount of time starting from now an earthquake occurs follows an exponential distribution. Suppose at t time the earthquake started and at (t+1) it ended. If we plot the distribution between the time t and (t+1), it will follow the Exponential distribution.
The random variables in an exponential distribution have fewer large values and larger small values. For example, the shopping details of items in a grocery supermarket. People generally buy items with a small amount in bulk, but a few people buy items with a large amount. This is a general tendency.
How is it an inverse case of Poisson Distribution?
Let’s take the below two cases.
In the above cases, we saw that condition 1 asks for the number of cars per hour. It is dealing with the car amount. But in condition 2, we are specifying the time interval between a car arrives. If condition 1 follows the Poisson distribution, then condition 2 will follow the exponential distribution.
Example 2: The Number of hipsters arriving at a bar in one minute and the number of minutes between new arrivals at the same bar. One follows the Poisson distribution, whereas another is Exponential.
The probability density function of an exponential distribution is as follows:
λ is the rate parameter and x is the random variable.
Suppose we measure the life of a mobile phone. Then λ is called here the rate of failure of the mobile phone at time t(say), given that it has survived for time t.
The important parameter for this distribution is the rate parameter (λ), which is the rate of events done/per unit of time.
X = np.linspace(0, 5, 5000) exponetial_distribtuion = stats.expon.pdf(X, loc=0, scale=1) plt.subplots(figsize=(8,5)) plt.plot(X, exponetial_distribtuion) plt.title("Exponential Distribution") plt.show()
A log-normal distribution is a continuous distribution of random variables whose logarithms are distributed normally. In other words, the lognormal distribution is generated by the function of eˣ, where x (random variable) is supposed to be normally distributed.
Here’s the PDF of a lognormal distribution:
A random variable that is lognormally distributed takes only positive real values. Consequently, lognormal distributions create curves that are right-skewed.
In real life, many natural phenomena that occur follow a log-normal distribution. Such as:
Just knowing that your data is log-normal distributed is valuable, because we can easily translate log-normal data to a normal distribution using the log(x) function!
X = np.linspace(0, 6, 500) std = 1 mean = 0 lognorm_distribution = stats.lognorm([std], loc=mean) lognorm_distribution_pdf = lognorm_distribution.pdf(X) fig, ax = plt.subplots(figsize=(8, 5)) plt.plot(X, lognorm_distribution_pdf, label="μ=0, σ=1") ax.set_xticks(np.arange(min(X), max(X))) std = 0.5 mean = 0 lognorm_distribution = stats.lognorm([std], loc=mean) lognorm_distribution_pdf = lognorm_distribution.pdf(X) plt.plot(X, lognorm_distribution_pdf, label="μ=0, σ=0.5") std = 1.5 mean = 1 lognorm_distribution = stats.lognorm([std], loc=mean) lognorm_distribution_pdf = lognorm_distribution.pdf(X) plt.plot(X, lognorm_distribution_pdf, label="μ=1, σ=1.5") plt.title("Lognormal Distribution") plt.legend() plt.show()
The Chi-squared distribution belongs to one of the most important and well-known distributions for data scientists and statisticians. It shows up in numerous statistical settings: Chi-squared test for independence, Chi-squared for quality of fit between data and proposed distribution, likelihood ratio test, etc. Its importance cannot be overstated. It is a continuous probability distribution on [0, infinity), and is also a special instance of Gamma distribution. The parameter it takes in is called degrees of freedom, and as usual, this parameter will determine the shape of the distribution.
The Chi-square test is especially used to calculate the fitness of sampling data given the proposed distribution or independence test. In
scipy.stats , we can easily compute the Chi-squared test statistic by
scipy.stats.chisquare(your_sample, expected_distribution) . It is super easy to use!
With k degrees of freedom, the chi-squared distribution is the sum of the squares of k for some independent standard normal random variables.
Here’s the PDF:
It’s a popular probability distribution, commonly used in hypothesis testing and in the construction of confidence intervals.
from scipy.stats import chi2 import matplotlib.pyplot as plt # Specified probability parameter df1 = 10 df2 = 20 df3 = 30 df4 = 40 df5 = 50 # calculate range we want to display x = np.linspace(0, 30, 500) # Sample according to chi2 distribution rv1 = chi2(df1) rv2 = chi2(df2) rv3 = chi2(df3) rv4 = chi2(df4) rv5 = chi2(df5) plt.plot(x, rv1.pdf(x), 'r', label='df = 10') plt.plot(x, rv2.pdf(x), 'g',label='df = 20') plt.plot(x, rv3.pdf(x), 'b', label='df = 30') plt.plot(x, rv4.pdf(x), 'black',label='df = 40') plt.plot(x, rv5.pdf(x), 'yellow',label='df = 50') plt.legend(loc="upper left") plt.show()
It is another distribution that is heavily used in statistical tests. It is useful in helping to make a comparison between groups when our sample size is small and/or the population standard deviation is unknown. It is later widely applied to a lot of statistical settings such as the construction of confidence intervals and regression analysis. Like Chi-square distribution, it also takes in one parameter, usually referred to as the degree of freedom too. As expected, the DoF also controls the shape of the distribution.
Visually a Student’s t-distribution looks much like a normal distribution but generally has fatter tails. Fatter tails allow for a higher dispersion of variables, as there is more uncertainty. The t-statistic is related to the Student’s t-distribution so that a Z-statistic is related to the standard normal distribution.
The formula that allows calculating t-statistic is,
t with (n-1) degree of freedom and a significance level of α equals the sample mean(x̅) minus the population means (μ) divided by the standard error of the sample.
As we can see, it is very similar to the standard normal variate or z-statistic. After all, this is an approximation of normal distribution.
Usually, for a sample of n, we have (n-1) degrees of freedom. So for 20 samples of distribution, we have 19 degrees of freedom. In another way, we can say the number of degrees of freedom describes the number of pieces of information used to describe a population quantity.
As we can see here, the increase in the degree of freedom leads to the normal distribution. Also, the tails are getting close to the x-axis.
The t-distribution is most often used for calculations within the realm of hypothesis testing, as we will see in the example below!
Since we fail to reject the null hypothesis, we can state that from the findings in our data that we are 95% confident that the car manufacturer’s claim is true!
The PDF is as follows:
n is the free parameter called “degrees of freedom” but you may also see it referred to as “d.o.f.” The t distribution gets closer to a normal distribution for high values of n.
from scipy.stats import t from scipy.stats import norm import matplotlib.pyplot as plt # Specified probability parameter df1 = 1 df2 = 2 df3 = 3 df4 = 4 # calculate range we want to display x = np.linspace(-10, 10, 200) # Sample according to t distribution rv1 = t(df1) rv2 = t(df2) rv3 = t(df3) rv4 = t(df4) plt.plot(x, rv1.pdf(x), 'r', label='df = 10') plt.plot(x, rv2.pdf(x), 'g',label='df = 20') plt.plot(x, rv3.pdf(x), 'b', label='df = 30') plt.plot(x, rv4.pdf(x), 'black',label='df = 40') plt.plot(x, norm.pdf(x), 'yellow', label='Gaussian') plt.legend(loc="upper left") plt.show()
I deliberately put Gaussian distribution in the picture so that you can have a clear view of their difference. It can be seen that the Gaussian distribution does have smaller tails than the t-distribution as well as larger peaks.
Beta distribution is a continued-random variable distribution over internal [0,1]. It has two parameters α and β. α and β, just like the mean and standard deviation in the Gaussian distribution, control the shape of the distribution. They are related to the sample size and mean, but the relationship itself is more complicated than “equality”. The beta distribution is often used in Bayesian inference as the prior distribution. The details of that cannot be explained in a few sentences, but a high-level overview of prior is the things you expect before you run the random experiments. For instance, if you go watch a soccer player, you should not expect that he can score over 5 goals in this match. Probably 0.5–1.5 may be a good guessing range for that. The prior can be thought of as a more mathematically rigorous “guessing range”. The probability density function is not particularly important, so only the implementation will be shown:
from scipy.stats import beta import matplotlib.pyplot as plt # Specified probability parameter a = 2 b = 3 n = 100 x = [i for i in range(0,n)] # Sample according to Beta distribution y = beta.rvs(a, b, size=n) plt.plot(x,y, "ob") plt.show()
A sample random draw according to beta distribution will look like this:
Similar to the exponential function, Gamma distributions are often used for waiting time problems. The difference is that Gamma distributions are used for finding the probability associated with waiting for k number of events, instead of just one event as is the case with exponential. The important parameters for Gamma are the number of events to wait (k) and the rate parameter (λ).
Like Beta distribution, Gamma distribution is also two-parameter continuous probability distribution, and it is also a good model for prior distribution. It is a conjugate prior function for many distributions: Gaussian distribution, Poisson distribution, etc. Some special case of gamma distribution includes the Exponential distribution mentioned above and the Chi-square distribution to be discussed after. Besides that, it also has an interesting link to information theory: among all distributions, the Gamma distribution has the maximum entropy. If you are interested in the details, feel free to explore more!
from scipy.stats import gamma import matplotlib.pyplot as plt # Specified probability parameter a = 2 b = 3 n = 100 x = [i for i in range(0,n)] # Sample according to Gamma distribution y = gamma.rvs(a, b, size=n) plt.plot(x,y, "ob") plt.show()
In statistics, power-law states that a relative change in one quantity results in a significant change in another quantity. For example, when the length/side increases by two units in a square, the area increases by four units.
A power-law distribution has the form,
The power-law can be used to describe a phenomenon where a small number of items is clustered at the top of a distribution(or at the bottom), taking up 95% of the resources. In other words, it implies a small number of occurrences is common, while a larger occurrence is rare.
A specific type of distribution that follows power law is called Pareto distribution. The Pareto principle states that 80% of the effects come from 20% of the cause. For example, 80% of the world’s wealth is earned by 20% of the people. We can see that 80% of the words in a text corpus form only 20% of the unique words during text preprocessing.
The Pareto distribution is highly skewed and has a slowly decaying tail. It has two parameters.shape parameter(α)(tail index) and scale parameter(x_m). When the distribution is used to model wealth distribution, the parameter α is called the Pareto index.
So the probability density function of the Pareto distribution is,
When plotted on linear axes, the distribution assumes the familiar J-shaped curve, which approaches each of the orthogonal axes asymptotically. All segments of the curve are self-similar (subject to appropriate scaling factors). When plotted in a log-log plot, the distribution is represented by a straight line.
Understanding the distributions of data is important because it can give us insights and open the door to performing further statistical analysis. This article covers some of the most common data distributions, but it is by no means a comprehensive list.