In Data Processing, it is often necessary to perform operations (such as statistical calculations, splitting, or substituting values) on a certain row or column to obtain new data. Writing a for-loop to iterate through Pandas DataFrame and Series will do the job, but that doesn’t seem like a good idea. The for-loop tends to have more lines of code, less code readability, and slower performance.
Fortunately, there are already great methods that are built into Pandas to help you accomplish the goals! In this post, we will see how to perform operations using apply()
and applymap()
, and how to substitute value using map()
.
First of all, you should be aware that DataFrame and Series will have some or all of these three methods, as follows:
And the Pandas official API reference suggests that:
apply()
is used to apply a function along an axis of the DataFrame or on values of Series.applymap()
is used to apply a function to a DataFrame elementwise.map()
is used to substitute each value in a Series with another value.Before we dive into the details, let’s first create a DataFrame for demonstration.
import pandas as pd
df = pd.DataFrame({ 'A': [1,2,3,4],
'B': [10,20,30,40],
'C': [20,40,60,80]
},
index=['Row 1', 'Row 2', 'Row 3', 'Row 4'])
The Pandas apply()
is used to apply a function along an axis of the DataFrame or on values of Series. Let’s begin with a simple example, to sum each row and save the result to a new column “D”
# Let's call this "custom_sum" as "sum" is a built-in function
def custom_sum(row):
return row.sum()
df['D'] = df.apply(custom_sum, axis=1)
And here is the output
Let’s take a look df.apply(custom_sum, axis=1)
custom_sum
is a function.axis
is to specify which axis the function is applied to. 0
for applying the function to each column and 1
for applying the function to each row.Let me explain this process in a more intuitive way. The second parameter axis = 1
tells Pandas to use the row. So, the custom_sum
is applied to each row and returns a new Series with the output of each row as value.
With the understanding of the sum of each row, the sum of each column is just to use axis = 0
instead
df.loc['Row 5'] = df.apply(custom_sum, axis=0)
So far, we have been talking about apply()
on a DataFrame. Similarly, apply()
can be used on the values of Series. For example, multiply the column “C” by 2 and save the result to a new column “D”
def multiply_by_2(val):
return val * 2
df['D'] = df['C'].apply(multiply_by_2)
Notice that df[‘C’]
is used to select the column “C” and then call apply()
with the only parameter multiply_by_2
. We don’t need to specify axis anymore because Series is a one-dimensional array. The return value is a Series and gets assigned to the new column D by df[‘D’]
.
You can also use lambda expression with Pandas apply()
function.
The lambda equivalent for the sum of each row of a DataFrame:
df['D'] = df.apply(lambda x:x.sum(), axis=1)
The lambda equivalent for the sum of each column of a DataFrame:
df.loc['Row 5'] = df.apply(lambda x:x.sum(), axis=0)
And the lambda equivalent for multiplying by 2 on a Series:
df['D'] = df['C'].apply(lambda x:x*2)
result_type
is a parameter in apply()
set to 'expand'
, 'reduce'
, or 'broadcast'
to get the desired type of result.
In the above scenario if result_type
is set to 'broadcast'
then the output will be a DataFrame substituted by the custom_sum
value.
df.apply(custom_sum, axis=1, result_type='broadcast')
The result is broadcasted to the original shape of the frame, and the original index and columns are retained.
To understand result_type
as 'expand'
and 'reduce'
, we will first create a function that returns a list.
def cal_multi_col(row):
return [row['A'] * 2, row['B'] * 3]
Now apply this function across the DataFrame column with result_type
as 'expand'
df.apply(cal_multi_col, axis=1, result_type='expand')
The output is a new DataFrame with column names 0 and 1.
In order to append this to the existing DataFrame, the result has to be kept in a variable so the column names can be accessed by res.columns
.
res = df.apply(cal_multi_col, axis=1, result_type='expand')
df[res.columns] = res
And the output is:
Next, apply the function across the DataFrame column with result_type
as 'reduce'
. result_type='reduce'
is just the opposite of 'expand'
and returns a Series if possible rather than expanding list-like results.
df['New'] = df.apply(cal_multi_col, axis=1, result_type='reduce')
applymap()
is only available in DataFrame and used for element-wise operation across the whole DataFrame. It has been optimized and some cases work much faster than apply()
, but it’s good to compare it with apply()
before going for any heavier operation.
For example, to output a DataFrame with squared numbers we can use the following:
df.applymap(np.square)
map()
is only available in Series and used for substituting each value in a Series with another value. To understand how the map()
works, we first create a Series.
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s
0 cat
1 dog
2 NaN
3 rabbit
dtype: object
map()
accepts a dict
or a Series
. Values that are not found in the dict
are converted to NaN
, unless the dict
has a default value (e.g. defaultdict
):
s.map({'cat': 'kitten', 'dog': 'puppy'})
0 kitten
1 puppy
2 NaN
3 NaN
dtype: object
It also accepts a function:
s.map('I am a {}'.format)
0 I am a cat
1 I am a dog
2 I am a nan
3 I am a rabbit
dtype: object
To avoid applying the function to missing values (and keep them as NaN
) na_action='ignore'
can be used:
s.map('I am a {}'.format, na_action='ignore')
0 I am a cat
1 I am a dog
2 NaN
3 I am a rabbit
dtype: object
apply
(), map(), and applymap()For DataFrame:
apply()
: It is used when you want to apply a function along the rows or columns. axis = 0
for columns and axis = 1
for rows.applymap()
: It is used for element-wise operation across the whole DataFrame.For Series:
apply()
: It is used when you want to apply a function on the values of Series.map()
: It is used to substitute each value with another value.In this section, we will cover the following most frequently used Pandas transform()
features:
groupby()
resultsLet’s take a look at pd.transform(func, axis=0)
func
is to specify the function to be used for manipulating data. It can be a function, a string function name, a list of functions, or a dictionary of axis labels -> functionsfunc
is applied to. 0
for applying the func
to each column and 1
for applying the func
to each row.Let’s see how transform()
works with the help of some examples.
We can pass a function to func
. For example
df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
return x+10
df.transform(plus_10)
You can also use a lambda expression. Below is the lambda equivalent for the plus_10()
:
df.transform(lambda x: x+10)
We can pass any valid Pandas string function to func
, for example 'sqrt'
:
df.transform('sqrt')
func
can be a list of functions. for example sqrt
and exp
from NumPy:
df.transform([np.sqrt, np.exp])
func
can be a dictionary of axis labels -> function. For example
df.transform({
'A': np.sqrt,
'B': np.exp,
})
groupby()
resultsOne of the most compelling usages of Pandas transform()
is combining grouby()
results. Let’s see how this works with the help of an example. Suppose we have a dataset about a restaurant chain:
df = pd.DataFrame({
'restaurant_id': [101,102,103,104,105,106,107],
'address': ['A','B','C','D', 'E', 'F', 'G'],
'city': ['London','London','London','Oxford','Oxford', 'Durham', 'Durham'],
'sales': [10,500,48,12,21,22,14]
})
We can see that each city has multiple restaurants with sales. We would like to know “What is the percentage of sales each restaurant represents in the city”. The expected output is:
The tricky part in this calculation is that we need to get a city_total_sales and combine it back into the data in order to get the percentage.
There are 2 solutions:
groupby()
, apply()
, and merge()
groupby()
and transform()
groupby()
, apply()
, and merge()
The first solution is to split the data with groupby()
and using apply()
to aggregate each group, then merge the results back into the original DataFrame using merge()
Step 1: Use groupby()
and apply()
to calculate the city_total_sales
city_sales = df.groupby('city')['sales']
.apply(sum).rename('city_total_sales').reset_index()
groupby('city')
split the data by grouping on the city column. For each of these groups, the function sum
gets applied to the sales column to calculate the sum for each group. Finally, the new column gets renamed to city_total_sales and the index gets reset (Note: reset_inde()
is required to clear the index generated by groupby('city')
).
In addition, Pandas has a built-in sum()
function and the following is the Pandas sum()
equivalent:
city_sales = df.groupby('city')['sales']
.sum().rename('city_total_sales').reset_index()
Step 2: Use merge()
function to combine the results
df_new = pd.merge(df, city_sales, how='left')
The group results get merged back into the original DataFrame using merge()
with how='left'
for left outer join.
Step 3: Calculate the percentage
Finally, the percentage can be calculated and formatted.
df_new['pct'] = df_new['sales'] / df_new['city_total_sales']
df_new['pct'] = df_new['pct'].apply(lambda x: format(x, '.2%'))
This certainly does our work. But it is a multistep process and requires extra code to get the data in the form we require.
We can solve this effectively using the transform()
function
groupby()
and transform()
This solution is a game-changer. A single line of code can solve the apply and merge.
Step 1: Use groupby()
and transform()
to calculate the city_total_sales
The transform function retains the same number of items as the original dataset after performing the transformation. Therefore, a one-line step using groupby
followed by a transform(sum)
returns the same output.
df['city_total_sales'] = df.groupby('city')['sales']
.transform('sum')
Step 2: Calculate the percentage
Finally, this is the same as the solution one to get the percentage.
df['pct'] = df['sales'] / df['city_total_sales']
df['pct'] = df['pct'].apply(lambda x: format(x, '.2%'))
transform()
can also be used to filter data. Here we are trying to get records where the city’s total sales are greater than 40
df[df.groupby('city')['sales'].transform('sum') > 40]
Another usage of Pandas transform()
is to handle missing values at the group level. Let’s see how this works with an example.
Here is a DataFrame for demonstration
df = pd.DataFrame({
'name': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'value': [1, np.nan, np.nan, 2, 8, 2, np.nan, 3]
})
In the example above, the data can be split into three groups by name, and each group has missing values. A common solution to replace missing values is to replace NaN with mean.
Let’s take a look at the average value in each group.
df.groupby('name')['value'].mean()name
A 1.0
B 5.0
C 2.5
Name: value, dtype: float64
Here we can use transform()
to replace missing values with the group average value.
df['value'] = df.groupby('name')
.transform(lambda x: x.fillna(x.mean()))
When looking for applying a custom function, you might be confused with the following two choices:
apply(func, axis=0)
: call a function func
along an axis of the DataFrame. It returns the result of applying func
along the given axis.transform(func, axis=0)
: call a function func
on self producing a DataFrame with transformed values. It returns a DataFrame that has the same length as self.They take the same arguments func
and axis
. Both call the func
along the axis of the given DataFrame. So what is the difference? How do we choose one over the other?
In this section, we will cover the following usages and go through their difference:
groupby()
resultsBoth apply()
and transform()
can be used to manipulate values. Let’s see how they work with the help of some examples.
df = pd.DataFrame({'A': [1,2,3], 'B': [10,20,30] })
def plus_10(x):
return x+10
Both apply()
and transform()
can be used to manipulate the entire DataFrame.
df.apply(plus_10)
df.transform(plus_10)
Both apply()
and transform()
support lambda expression and below is the lambda equivalent:
df.apply(lambda x: x+10)
df.transform(lambda x: x+10)
Both apply()
and transform()
can be used for manipulating a single column
df['B_ap'] = df['B'].apply(plus_10)
# The lambda equivalent
df['B_ap'] = df['B'].apply(lambda x: x+10)
df['B_tr'] = df['B'].transform(plus_10)
# The lambda equivalent
df['B_tr'] = df['B'].transform(lambda x: x+10)
Here are the three main differences:
transform()
works with function, a string function, a list of functions, or a dictionary. However, apply()
is only allowed with function.transform()
cannot produce aggregated results.apply()
works with multiple Series at a time. But, transform()
is only allowed to work with a single Series at a time.Let’s take a look at them with the help of some examples.
(1) transform()
works with function, a string function, a list of functions, or a dictionary. However, apply()
is only allowed a function.
For transform()
, we can pass any valid Pandas string function to func
df.transform('sqrt')
func
can be a list of functions, for example, sqrt
and exp
from NumPy:
df.transform([np.sqrt, np.exp])
func
can be a dict of axis labels -> function. For example
df.transform({
'A': np.sqrt,
'B': np.exp,
})
(2) transform()
cannot produce aggregated results.
We can use apply()
to produce aggregated results, for example, the sum:
df.apply(lambda x:x.sum())
A 6
B 60
dtype: int64
However, we will get a ValueError when trying to do the same with transform()
. We are getting this problem because the output of transform()
has to be a DataFrame that has the same length as self.
df.transform(lambda x:x.sum())
(3) apply()
works with multiple Series at a time. But, transform()
is only allowed to work with a single Series at a time.
To demonstrate this, let’s create a function to work on 2 Series at a time.
def subtract_two(x):
return x['B'] - x['A']
apply()
works perfect with subtract_two
and axis=1
df.apply(subtract_two, axis=1)
0 9
1 18
2 27
dtype: int64
However, we are getting a ValueError when trying to do the same with transform()
. This is because transform()
is only allowed to work with a single Series at a time.
# Getting error when trying the same with transform
df.transform(subtract_two, axis=1)
We will get the same result when using a lambda expression
# It is working
df.apply(lambda x: x['B'] - x['A'], axis=1)
# Getting same error
df.transform(lambda x: x['B'] - x['A'], axis=1)
groupby()
Both apply()
and transform()
can be used in conjunction with groupby()
. And in fact, it is one of the most compelling usages of transform()
.
Here are the 2 differences when using them in conjunction with groupby()
transform()
returns a DataFrame that has the same length as the input apply()
works with multiple Series at a time. But, transform()
is only allowed to work with a single Series at a time.Let’s create a DataFrame and show the difference with some examples
df = pd.DataFrame({
'key': ['a','b','c'] * 4,
'A': np.arange(12),
'B': [1,2,3] * 4,
})
In the example above, the data can be split into three groups by key.
(1) transform()
returns a Series that has the same length as the input
To demonstrate this, let’s make a function to produce an aggregated result.
# Aggregating the sum of the given Series
def group_sum(x):
return x.sum()
For apply()
, it returns one value for each group and the output shape is (num_of_groups, 1)
.
gr_data_ap = df.groupby('key')['A'].apply(group_sum)
gr_data_ap
key
a 9
b 12
c 15
Name: A, dtype: int64
For transform()
, it returns a Series that has the same length as the given DataFrame and the output shape is (len(df), 1)
gr_data_tr = df.groupby('key')['A'].transform(group_sum)
gr_data_tr
0 9
1 12
2 15
3 9
4 12
5 15
6 9
7 12
8 15
Name: A, dtype: int64
(2) apply()
works with multiple Series at a time. But transform()
is only allowed to work with a single Series at a time.
This is the same difference as we mentioned in “Manipulating values”, and we just don’t need to specify the argument axis
on a groupby()
result.
To demonstrate this, let’s create a function to work on 2 Series at a time.
def subtract_two(x):
return x['B'] - x['A']
apply()
works with multiple Series at a time.
df.groupby('key').apply(subtract_two)
key
a 0 1
3 -2
6 -5
b 1 1
4 -2
7 -5
c 2 1
5 -2
8 -5
dtype: int64
However, we are getting a KeyError when trying the same with transform()
:
df.groupby('key').transform(subtract_two)
For manipulating values, both apply()
and transform()
can be used to manipulate an entire DataFrame or any specific column. But there are 3 differences:
transform()
can take a function, a string function, a list of functions, and a dictionary. However, apply()
is only allowed a function.transform()
cannot produce aggregated results.apply()
works with multiple Series at a time. But, transform()
is only allowed to work with a single Series at a time.For working in conjunction with groupby()
transform()
returns a Series that has the same length as the input.apply()
works with multiple Series at a time. But, transform()
is only allowed to work with a single Series at a time.Resources:
https://towardsdatascience.com/when-to-use-pandas-transform-function-df8861aa0dcf
https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff
https://towardsdatascience.com/difference-between-apply-and-transform-in-pandas-242e5cf32705