What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?
- Slicing based on a single value/label
- Slicing based on multiple labels from one or more levels
- Filtering on boolean conditions and expressions
- Which methods are applicable in what circumstances
Assumptions for simplicity:
- input dataframe does not have duplicate index keys
- input dataframe below only has two levels. (Most solutions shown here generalize to N levels)
Example input:
mux = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), list('tuvwtuvwtuvwtuvw') ], names=['one', 'two']) df = pd.DataFrame({'col': np.arange(len(mux))}, mux) col one two a t 0 u 1 v 2 w 3 b t 4 u 5 v 6 w 7 t 8 c u 9 v 10 d w 11 t 12 u 13 v 14 w 15
Question 1: Selecting a Single Item
How do I select rows having "a" in level "one"?
col
one two
a t 0
u 1
v 2
w 3
Additionally, how would I be able to drop level "one" in the output?
col
two
t 0
u 1
v 2
w 3
Question 1b
How do I slice all rows with value "t" on level "two"?
col
one two
a t 0
b t 4
t 8
d t 12
Question 2: Selecting Multiple Values in a Level
How can I select rows corresponding to items "b" and "d" in level "one"?
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?
col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
Question 3: Slicing a Single Cross Section (x, y)
How do I retrieve a cross section, i.e., a single row having a specific values for the index from df
? Specifically, how do I retrieve the cross section of ('c', 'u')
, given by
col
one two
c u 9
Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]
How do I select the two rows corresponding to ('c', 'u')
, and ('a', 'w')
?
col
one two
c u 9
a w 3
Question 5: One Item Sliced per Level
How can I retrieve all rows corresponding to "a" in level "one" or "t" in level "two"?
col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
Question 6: Arbitrary Slicing
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
Question 7 will use a unique setup consisting of a numeric level:
np.random.seed(0) mux2 = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), np.random.choice(10, size=16) ], names=['one', 'two']) df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2) col one two a 5 0 0 1 3 2 3 3 b 7 4 9 5 3 6 5 7 2 8 c 4 9 7 10 d 6 11 8 12 8 13 1 14 6 15
Question 7: Filtering by numeric inequality on individual levels of the multiindex
How do I get all rows where values in level "two" are greater than 5?
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).
Best Answer
MultiIndex / Advanced Indexing
Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting
DataFrame.loc
- A general solution for selection by label (+pd.IndexSlice
for more complex applications involving slices)DataFrame.xs
- Extract a particular cross section from a Series/DataFrame.DataFrame.query
- Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.Boolean indexing with a mask generated using
MultiIndex.get_level_values
(often in conjunction withIndex.isin
, especially when filtering with multiple values). This is also quite useful in some circumstances.It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.
You can use
loc
, as a general purpose solution applicable to most situations:At this point, if you get
That means you're using an older version of pandas. Consider upgrading! Otherwise, use
df.loc[('a', slice(None)), :]
.Alternatively, you can use
xs
here, since we are extracting a single cross section. Note thelevels
andaxis
arguments (reasonable defaults can be assumed here).Here, the
drop_level=False
argument is needed to preventxs
from dropping level "one" in the result (the level we sliced on).Yet another option here is using
query
:If the index did not have a name, you would need to change your query string to be
"ilevel_0 == 'a'"
.Finally, using
get_level_values
:This can be easily done using either
Or,
Notice that we can omit the
drop_level
argument (it is assumed to beTrue
by default).Intuitively, you would want something involving
slice()
:It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the
pd.IndexSlice
API here.This is much, much cleaner.
With
xs
, it isWith
query
, it isAnd finally, with
get_level_values
, you may doAll to the same effect.
Using loc, this is done in a similar fashion by specifying a list.
To solve the above problem of selecting "b" and "d", you can also use
query
:And, with
get_level_values
+Index.isin
:With
loc
, this is possible only in conjuction withpd.IndexSlice
.The first colon
:
inpd.IndexSlice[:, ['t', 'w']]
means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.With
query
, this isWith
get_level_values
andIndex.isin
(similar to above):Use
loc
by specifying a tuple of keys:Or,
With
xs
, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:With
query
, things become a bit clunky:You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.
With accesses spanning multiple levels,
get_level_values
can still be used, but is not recommended:With
loc
, this is still as simple as:With
query
, you will need to dynamically generate a query string by iterating over your cross sections and levels:100% DO NOT RECOMMEND! But it is possible.
What if I have multiple levels?
One option in this scenario would be to use
droplevel
to drop the levels you're not checking, then useisin
to test membership, and then boolean index on the final result.This is actually very difficult to do with
loc
while ensuring correctness and still maintaining code clarity.df.loc[pd.IndexSlice['a', 't']]
is incorrect, it is interpreted asdf.loc[pd.IndexSlice[('a', 't')]]
(i.e., selecting a cross section). You may think of a solution withpd.concat
to handle each label separately:But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do
But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.
With
query
, this is stupidly simple:With
get_level_values
, this is still simple, but not as elegant:This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.
Usually, slicing problems like this will require explicitly passing a list of keys to
loc
. One way of doing this is with:If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and
concat
the result:Slicing specification for "a" and "b" is slightly cleaner
(('a', 'b'), ('u', 'v'))
because the same sub-levels being indexed are the same for each level.This can be done using
query
,And
get_level_values
.Actually, most solutions here are applicable to columns as well, with minor changes. Consider:
These are the following changes you will need to make to the Four Idioms to have them working with columns.
To slice with
loc
, useor,
To use
xs
as appropriate, just pass an argumentaxis=1
.You can access the column level values directly using
df.columns.get_level_values
. You will then need to do something likeWhere
{condition}
represents some condition built usingcolumns.get_level_values
.To use
query
, your only option is to transpose, query on the index, and transpose again:Not recommended, use one of the other 3 options.