Data Science and Machine Learning Cheat Sheets
2021-06-21
A complete guide on Pandas Hierarchical Indexing (MultiIndex)
2021-06-24
Show all

Data selection (indexing and slicing) in Pandas MultiIndex DataFrames

6 mins read

MultiIndex (also known as a hierarchical index) DataFrame allows you to have multiple columns acting as a row identifier and multiple rows acting as a header identifier. With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working on a MultiIndex DataFrame.

In this post, you’ll learn how to access data in a MultiIndex DataFrame. This post is structured as follows:

  1. Selecting data via the first-level index
  2. Selecting data via multi-level index
  3. Select a range of data using slice
  4. Selecting all content using slice(None)
  5. Using cross-section xs()
  6. Using IndexSlice

For demonstration, we create a dummy dataset and will load it with the first 2 columns as a row identifier and the first 2 rows as a header identifier.

df = pd.read_csv('dataset.csv',
    index_col=[0,1],
    header=[0,1]
).sort_index()

1. Selecting data via the first level index

When it comes to select data on a DataFrame, Pandas loc is one of the top favorites. Accessing data in a MultiIndex DataFrame can be done in a similar way to a single index DataFrame using  loc and iloc.

We can pass the first-level label to loc to select data

# Retrieve London's Day weather
df.loc['London', 'Day']

We can also use : to return all data.

# To get all rows - all Day weather
df.loc[:, 'Day']

And to get all columns

# To get all columns
df.loc['London' , :]

Above works as we would expect. However, you may be surprised to find df.loc['London', '2019-07-02'] also works

# This also works
df.loc['London' , '2019-07-02']
Day    Weather                Shower
       Wind                SW 16 mph
       Max Temperature            29
Night  Weather            Heavy rain
       Wind                SW 16 mph
       Max Temperature            17
Name: (London, 2019-07-02), dtype: object

This result can lead to ambiguity in general. In fact, all the above statements are shorthand notations for retrieving data from a MultiIndex DataFrame. For MultiIndex DataFrame, it’s recommended to use a tuple to avoid ambiguity.

Here are equivalent statements using tuple

# Equivalent to df.loc['London', 'Day']
df.loc[('London', ) , ('Day', )]# Equivalent to df.loc[:, 'Day']
df.loc[:, ('Day',)]# Equivalent to df.loc['London' , :]
df.loc[('London', ) , :]# Equivalent to df.loc['London' , '2019-07-02']
df.loc[('London' , '2019-07-02')]

2. Selecting data via multi-level index

If you want to read London’s Day weather on 2019–07–01, you can simply do:

df.loc['London', 'Day'].loc['2019-07-01']
Weather               Shower
Wind               SW 16 mph
Max Temperature           28
Name: 2019-07-01, dtype: object

Note that loc['London', 'Day'] return a DataFrame and loc['2019-07-01'] is called on it to extract the data again. This certainly works, but there is a more effective way to use tuples.

df.loc[('London', '2019-07-01'), 'Day']
Weather               Shower
Wind               SW 16 mph
Max Temperature           28
Name: (London, 2019-07-01), dtype: object

MultiIndex keys take the form of tuple. In this case, ('London', '2019-07-01') select the row label and 'Day' select the column.

We can also pass a list of labels to select multiple rows or columns:

# Select multiple rows
df.loc[ 
    ('London' , ['2019-07-01','2019-07-02'] ) ,
    'Day'
]
# Select multiple columns
df.loc[ 
    'London' ,
    ('Day', ['Weather', 'Wind'])
]

3. Selecting a range of data via slice

Slice (written as start:stop:step) is a powerful technique that allows selecting a range of data. It is very useful when we want to select everything in between two items.

Slicing the first-level index in MultiIndex DataFrame is similar to the way on a single index DataFrame. For example,

df.loc[
    'Cambridge':'Oxford',
    'Day'
]

However, we will be getting a SyntaxError when slicing the multi-level index like below:

# We are getting a SyntaxError
df.loc[
    ('London', '2019-07-01': '2019-07-03'),
    'Day'
]

The correct way to slice a multi-level index is to use tuples:

df.loc[
    ('London','2019-07-01'):('London','2019-07-03'),
    'Day'
]

The lower level indexes are not necessary to be the same. For example, we can select the data range from ('Cambridge', '2019-07-01') to ('London', '2019-07-02')

df.loc[
    ('Cambridge', '2019-07-01'):('London','2019-07-02'),
    'Day'
]

4. Selecting all content using slice(None)

We can use : to select all content when retrieving data via the first-level index. However, we will be getting SyntaxError when using it with a multi-level index.

# Getting SyntaxError
df.loc[ 
    ('London', :),
    'Day'
]
# Getting SyntaxError
df.loc[ 
    (: , '2019-07-04'),
    'Day'
]

The correct way to select all content on a specific index level is to use slice(None)

df.loc[ 
    ('London', slice(None)),
    'Day'
]
df.loc[
    (slice(None) , '2019-07-04'), 
    'Day'
]

5. Using cross-section xs()

The xs() method of DataFrame can take a level argument to make selecting data at a particular level of a MultiIndex easier.

df.xs('2019-07-04', level='Date')

xs() also allows selection with multiple keys

df.xs(('London', '2019-07-04'), level=['City','Date'])

You can also select on the columns with xs(), by providing the axis argument.

df.xs('Wind', level=1, axis=1)

Note that the selected column label 'Wind' is not included in the result. To retain the level that was selected, we can pass drop_level=False

df.xs('Wind', level=1, axis=1, drop_level=False)

6. Using IndexSlice

You can use Pandas IndexSlice to facilitate a more natural syntax.

For example, use : instead of slice(None)

from pandas import IndexSlice as idx
df.loc[ 
    idx[: , '2019-07-04'], 
    'Day'
]
# Instead of slice(None)
# df.loc[ 
#    (slice(None) , '2019-07-04'), 
#    'Day'
# ]

Use IndexSlice for both rows and columns

rows = idx[: , '2019-07-01']
cols = idx['Day' , ['Max Temperature','Weather']]
df.loc[rows, cols]

Use xs() with IndexSlice for selecting a range of data

rows= (
    idx['2019-07-02':'2019-07-04'], 
    'London'
)
df.xs(
    rows , 
    level = ['Date','City']
)

Conclusion

With MultiIndex, you can do some sophisticated data analysis, especially for working with higher dimensional data. Accessing data is the first step when working with MultiIndex DataFrame. I hope you found this tutorial to access MultiIndex DataFrame useful. I recommend you check out the documentation to know about other things you can do.

Resources:

https://towardsdatascience.com/accessing-data-in-a-multiindex-dataframe-in-pandas-569e8767201d

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.