2938

@2938

Stats, Money, and NYC

4,455 words

Guestbook
You'll only receive email when 2938 publishes a new post

Chaining methods in pandas

Chaining methods in pandas makes your code easy to read.

df = (pd.read_csv('test.csv')
      .set_index('myIndex')
      .rename(columns={'column_a': 'ColumnA'})
      .assign(colA=lambda x: x['ColumnA'] * 2)
      .sort_values('colA')
      .tail())

You don't need to know the pandas API to understand what's going on here. It's also way cleaner than the alternative.

df = pd.read_csv('test.csv')
df = df.set_index('myIndex')

And so on...

The best way to chain methods is to take advantage of pipe. pipe takes a function that returns a dataframe. So you can do anything with method chaining. You don't need to wait for pandas to add chaining methods.

def head20(df)
    return df.head(20)

#Then you can throw head20 into your method chaining
df.pipe(head20)


pandas.DataFrame.query

The shittiest dataframe method.

There's (at least) two common ways to filter a dataframe by rows that have a certain value in a column.

a = df.query('name == "myname"')
b = df[df['name'] == 'myname']

pandas.DataFrame.query is really bizarre (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html). It takes a string as a parameter and that string is evaluated with pandas.eval(). The string can use @ to refer to variables.

It's still better than the second method. You can nicely chain the query method, which can't really be said for the second method.

(df
 .query('name == "myname"')
 .groupby('city')
 .sum()
 .reset_index()
 .head())