23 Useful but less used Pandas Functions
Using pre-commit and Makefile for Python code development workflow
Show all

Making data pipelines in Pandas using .pipe() method

Real-life data is usually messy. It requires a lot of preprocessing to be ready for use. Pandas being one of the most widely used data analysis and manipulation libraries offer several functions to preprocess the raw data.

In this article, we will focus on one particular function that organizes multiple preprocessing operations into a single one: the pipe function.

Let’s start with creating a data frame with mock data.

import numpy as np
import pandas as pddf = pd.DataFrame({
   "id": [100, 100, 101, 102, 103, 104, 105, 106],
   "A": [1, 2, 3, 4, 5, 2, np.nan, 5],
   "B": [45, 56, 48, 47, 62, 112, 54, 49],
   "C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]

Our data frame contains some missing values indicated by a standard missing value representation (i.e. NaN). The id column includes duplicate values. Last but not least, 112 in column B seems like an outlier.

These are some of the typical issues in real-life data. We will be creating a pipe that handles the issues we have just described.

For each task, we need a function. Thus, the first step is to create the functions that will be placed in the pipe.

It is important to note that the functions used in the pipe need to take a data frame as an argument and return a data frame.

The first function handles the missing values.

def fill_missing_values(df):
for col in df.select_dtypes(include= ["int","float"]).columns:
val = df[col].mean()
df[col].fillna(val, inplace=True)
return df

I prefer to replace the missing values in the numerical columns with the mean value of the column. Feel free to customize this function. It will work in the pipe as long as it takes a data frame as an argument and returns a data frame.

The second function will help us remove the duplicate values.

def drop_duplicates(df, column_name):
df = df.drop_duplicates(subset=column_name)
return df

I have got some help from the built-in drop duplicates function of Pandas. It eliminates the duplicate values in the given column or columns. In addition to the data frame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.

The last function in the pipe will be used for eliminating the outliers.

def remove_outliers(df, column_list):
for col in column_list:
avg = df[col].mean()
std = df[col].std()
low = avg - 2 * std
high = avg + 2 * std
df = df[df[col].between(low, high, inclusive=True)]
return df

What this function does is as follows:

  1. It takes a data frame and a list of columns
  2. For each column in the list, it calculates the mean and standard deviation
  3. It calculates a lower and upper bound using the mean and standard deviation
  4. It removes the values that are outside range defined by the lower and upper bound

Just like the previous functions, you can choose your own way of detecting outliers.

We now have 3 functions that handle a data preprocessing task. The next step is to create a pipe with these functions.

df_processed = (df.
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))

This pipe executes the functions in the given order. We can pass the arguments to the pipe along with the function names.

One thing to mention here is that some functions in the pipe modify the original data frame. Thus, using the pipe as indicated above will update df as well.

One option to overcome this issue is to use a copy of the original data frame in the pipe. If you do not care about keeping the original data frame as is, you can just use it in the pipe.

I will update the pipe as below:

my_df = df.copy()df_processed = (my_df.
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))

Let’s take a look at the original and processed data frames:


You can, of course, accomplish the same tasks by applying these functions separately. However, the pipe function offers a structured and organized way for combining several functions into a single operation.

Depending on the raw data and the tasks, the preprocessing may include more steps. You can add as many steps as you need in the pipe function. As the number of steps increases, the syntax becomes cleaner with the pipe function compared to executing functions separately.



Amir Masoud Sefidian
Amir Masoud Sefidian
Data Scientist, Machine Learning Engineer, Researcher, Software Developer

Leave a Reply

Your email address will not be published. Required fields are marked *