A MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working on a MultiIndex DataFrame.
In this post, you’ll learn how to access data in a MultiIndex DataFrame. This post is structured as follows:
slice(None)
xs()
IndexSlice
For demonstration, we create a dummy dataset and will load it with the first 2 columns as a row identifier and the first 2 rows as a header identifier.
df = pd.read_csv('dataset.csv',
index_col=[0,1],
header=[0,1]
).sort_index()
When it comes to select data on a DataFrame, Pandas loc
is one of the top favorites. Accessing data in a MultiIndex DataFrame can be done in a similar way to a single index DataFrame using loc
and iloc
.
We can pass the first-level label to loc
to select data
# Retrieve London's Day weather
df.loc['London', 'Day']
We can also use :
to return all data.
# To get all rows - all Day weather
df.loc[:, 'Day']
And to get all columns
# To get all columns
df.loc['London' , :]
Above works as we would expect. However, you may be surprised to find df.loc['London', '2019-07-02']
also works
# This also works
df.loc['London' , '2019-07-02']
Day Weather Shower
Wind SW 16 mph
Max Temperature 29
Night Weather Heavy rain
Wind SW 16 mph
Max Temperature 17
Name: (London, 2019-07-02), dtype: object
This result can lead to ambiguity in general. In fact, all the above statements are shorthand notations for retrieving data from a MultiIndex DataFrame. For MultiIndex DataFrame, it’s recommended to use a tuple to avoid ambiguity.
Here are equivalent statements using tuple
# Equivalent to df.loc['London', 'Day']
df.loc[('London', ) , ('Day', )]# Equivalent to df.loc[:, 'Day']
df.loc[:, ('Day',)]# Equivalent to df.loc['London' , :]
df.loc[('London', ) , :]# Equivalent to df.loc['London' , '2019-07-02']
df.loc[('London' , '2019-07-02')]
If you want to read London’s Day weather on 2019–07–01, you can simply do:
df.loc['London', 'Day'].loc['2019-07-01']
Weather Shower
Wind SW 16 mph
Max Temperature 28
Name: 2019-07-01, dtype: object
Note that loc['London', 'Day']
return a DataFrame and loc['2019-07-01']
is called on it to extract the data again. This certainly works, but there is a more effective way to use tuples.
df.loc[('London', '2019-07-01'), 'Day']
Weather Shower
Wind SW 16 mph
Max Temperature 28
Name: (London, 2019-07-01), dtype: object
MultiIndex keys take the form of tuple. In this case, ('London', '2019-07-01')
select the row label and 'Day'
select the column.
We can also pass a list of labels to select multiple rows or columns:
# Select multiple rows
df.loc[
('London' , ['2019-07-01','2019-07-02'] ) ,
'Day'
]
# Select multiple columns
df.loc[
'London' ,
('Day', ['Weather', 'Wind'])
]
Slice (written as start:stop:step
) is a powerful technique that allows selecting a range of data. It is very useful when we want to select everything in between two items.
Slicing the first-level index in MultiIndex DataFrame is similar to the way on a single index DataFrame. For example,
df.loc[
'Cambridge':'Oxford',
'Day'
]
However, we will be getting a SyntaxError when slicing the multi-level index like below:
# We are getting a SyntaxError
df.loc[
('London', '2019-07-01': '2019-07-03'),
'Day'
]
The correct way to slice a multi-level index is to use tuples:
df.loc[
('London','2019-07-01'):('London','2019-07-03'),
'Day'
]
The lower level indexes are not necessary to be the same. For example, we can select the data range from ('Cambridge', '2019-07-01')
to ('London', '2019-07-02')
df.loc[
('Cambridge', '2019-07-01'):('London','2019-07-02'),
'Day'
]
slice(None)
We can use :
to select all content when retrieving data via the first-level index. However, we will be getting SyntaxError when using it with a multi-level index.
# Getting SyntaxError
df.loc[
('London', :),
'Day'
]
# Getting SyntaxError
df.loc[
(: , '2019-07-04'),
'Day'
]
The correct way to select all content on a specific index level is to use slice(None)
df.loc[
('London', slice(None)),
'Day'
]
df.loc[
(slice(None) , '2019-07-04'),
'Day'
]
xs()
The xs()
method of DataFrame can take a level
argument to make selecting data at a particular level of a MultiIndex easier.
df.xs('2019-07-04', level='Date')
xs()
also allows selection with multiple keys
df.xs(('London', '2019-07-04'), level=['City','Date'])
You can also select on the columns with xs()
, by providing the axis
argument.
df.xs('Wind', level=1, axis=1)
Note that the selected column label 'Wind'
is not included in the result. To retain the level that was selected, we can pass drop_level=False
df.xs('Wind', level=1, axis=1, drop_level=False)
IndexSlice
You can use Pandas IndexSlice
to facilitate a more natural syntax.
For example, use :
instead of slice(None)
from pandas import IndexSlice as idx
df.loc[
idx[: , '2019-07-04'],
'Day'
]
# Instead of slice(None)
# df.loc[
# (slice(None) , '2019-07-04'),
# 'Day'
# ]
Use IndexSlice
for both rows and columns
rows = idx[: , '2019-07-01']
cols = idx['Day' , ['Max Temperature','Weather']]
df.loc[rows, cols]
Use xs()
with IndexSlice
for selecting a range of data
rows= (
idx['2019-07-02':'2019-07-04'],
'London'
)
df.xs(
rows ,
level = ['Date','City']
)
With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working with MultiIndex DataFrame. I hope you found this tutorial to access MultiIndex DataFrame useful. I recommend you check out the documentation to know about other things you can do.
Resources:
https://towardsdatascience.com/accessing-data-in-a-multiindex-dataframe-in-pandas-569e8767201d