Pandas Cheat Sheet: Functions for Data Analysis

Image by Author

It is no secret that data is the lifeblood of modern business, and the ability to manipulate and analyze data is a key skill for any data scientist or machine learning practitioner.

The Pandas library for Python has become the go-to tool for data manipulation and analysis, providing a wide range of powerful functions for working with tabular data.

In this article, I will provide a cheat sheet of Pandas coding examples, covering a broad range of topics including data filtering, aggregation, merging, and reshaping.

También te puede interesarCómo clasificar documentos para una consulta usando NLP

Whether you’re a beginner or an experienced Python user, this cheat sheet is a valuable resource for anyone working with data in Python.

So, let’s dive in and explore the power of Pandas for data science and machine learning!

Importing the Pandas library and creating a DataFrame

This code imports the Pandas library, creates a dictionary of data, and converts it into a DataFrame using the pd.DataFrame() method. It creates a table with columns named «name», «age», and «city», and adds four rows of data.

import pandas as pd
data = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df = pd.DataFrame(data)

Selecting specific columns of a DataFrame

This code selects only the “name” and “age” columns from the DataFrame df.

También te puede interesarPasado, presente y futuro de la inteligencia artificial: un estudio de caso de la evolución de la IA
df[['name', 'age']]

Filtering rows based on a condition

This code filters the rows of the DataFrame df based on the condition that the «age» column is greater than 25. It returns only the rows where this condition is true.

df[df['age'] > 25]

Sorting a DataFrame by a specific column

This code sorts the DataFrame df by the «age» column in ascending order. To sort in descending order, use df.sort_values('age', ascending=False) instead.

df.sort_values('age')

Renaming columns in a DataFrame

This code renames the “name” column to “full_name” and the “city” column to “location” in the DataFrame df.

df.rename(columns={'name': 'full_name', 'city': 'location'})

Grouping a DataFrame by a column and aggregating another column

This code groups the DataFrame df by the «city» column and calculates the mean of the «age» column for each group. It returns a new Series object with the cities as the index and the mean ages as the values.

También te puede interesarMagia animada con GPT-4 y Manim: crea una esfera rodante en menos de 1 minuto
df.groupby('city')['age'].mean()

Merging two DataFrames based on a common column

This code creates two DataFrames (df1 and df2) and merges them together based on the «name» column. It creates a new DataFrame (merged_df) with columns for «name», «age», and «salary».

data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'], 
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)

data2 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)

merged_df = pd.merge(df1, df2, on='name')

Adding a new column to a DataFrame

This code adds a new column called “gender” to the DataFrame df with values «M», «F», «M», and «F» for each row, respectively.

También te puede interesarA Short History of Artificial Intelligence
df['gender'] = ['M', 'F', 'M', 'F']

Handling missing values in a DataFrame

These three lines of code demonstrate different ways to handle missing values in a DataFrame. The first line drops any rows that contain missing values, the second line fills missing values with 0, and the third line fills missing values using interpolation.

df.dropna()  # drops rows with any missing values
df.fillna(0) # fills missing values with 0
df.interpolate() # fills missing values using interpolation

Saving a DataFrame to a CSV file

This code saves the DataFrame df to a CSV file called «my_dataframe.csv». The index=False argument specifies that the row index should not be included in the output.

df.to_csv('my_dataframe.csv', index=False)

Selecting rows and columns of a DataFrame using loc

This code selects the rows with index values 1, 2, and 3 and the “name” and “city” columns from the DataFrame df. It uses the loc accessor to specify the row and column selections.

df.loc[1:3, ['name', 'city']]

Pivot a DataFrame using pivot_table

This code pivots the DataFrame df using the «city» column as the index, the «gender» column as the columns, and the «age» column as the values. It calculates the mean age for each combination of city and gender.

df.pivot_table(index='city', columns='gender', values='age', aggfunc='mean')

Apply a function to a column using apply

This code defines a function add_one that adds 1 to its argument and applies this function to the «age» column of the DataFrame df using the apply method. It creates a new column called «age_plus_one» with the result.

def add_one(x):
return x + 1

df['age_plus_one'] = df['age'].apply(add_one)

Merge two DataFrames using join

This code creates two DataFrames (df1 and df2) and joins them together based on the «name» column. It creates a new DataFrame (joined_df) with columns for «name», «age», and «salary».

data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'], 
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)

data2 = {'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2, index=['John', 'Sarah', 'David', 'Rachel'])

joined_df = df1.join(df2, on='name')

Count the occurrences of unique values in a column using value_counts

This code counts the occurrences of each unique value in the “city” column of the DataFrame df, and returns the result as a Series object.

df['city'].value_counts()

Reshape a DataFrame using melt

This code creates a wide DataFrame wide_df with columns «A», «B», and «C», and uses the melt function to reshape it into a long format. The resulting DataFrame (melted_df) has columns «A», «variable», and «value», where «variable» is the original column name («B» or «C»), and «value» is the corresponding cell value.

wide_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
melted_df = pd.melt(wide_df, id_vars=['A'], value_vars=['B', 'C'])

Merge two DataFrames using merge on multiple columns

This code creates two DataFrames (df1 and `df2

data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'], 
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)

data2 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)

merged_df = pd.merge(df1, df2, on=['name', 'city'])

Filter a DataFrame by multiple conditions using & (and) and | (or)

This code filters the rows of the DataFrame df based on two conditions using the & and | operators. It returns only the rows where the «age» column is greater than 25 AND the «city» column is equal to «Chicago».

df[(df['age'] > 25) & (df['city'] == 'Chicago')]

Reshape a DataFrame using stack and unstack

This code creates a wide DataFrame wide_df with columns «A» and «B», and uses the stack method to reshape it into a tall format with a hierarchical index. The resulting DataFrame (stacked_df) has columns «A» and «B», and an index with two levels («a» and «b» in the first level, and «A» and «B» in the second level). The code then uses the unstack method to reshape the DataFrame back into a wide format.

wide_df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
stacked_df = wide_df.stack()
unstacked_df = stacked_df.unstack()

Replace values in a column using replace

This code replaces the values in the “city” column of the DataFrame df using a dictionary of replacements. It replaces «New York» with «NY» and «Los Angeles» with «LA».

df['city'] = df['city'].replace({'New York': 'NY', 'Los Angeles': 'LA'})

Calculate the cumulative sum of a column using cumsum

This code calculates the cumulative sum of the “age” column of the DataFrame df using the cumsum method. It creates a new column called «age_cumulative_sum» with the result.

df['age_cumulative_sum'] = df['age'].cumsum()

Filter a DataFrame based on the existence of a value in a column using isin

This code filters the rows of the DataFrame df based on whether the value in the «city» column is one of the values in the list [‘New York’, ‘Chicago’]. It returns only the rows where this condition is true.

df[df['city'].isin(['New York', 'Chicago'])]

Create a new column based on the values of multiple columns using apply and lambda

This code creates a new column called “age_location” in the DataFrame df by applying a lambda function to each row. The lambda function takes the «age» and «city» values from the row, concatenates them into a string with a comma separator, and returns the result. The apply method is used to apply this function to each row along axis 1 (i.e., row-wise).

df['age_location'] = df.apply(lambda row: str(row['age']) + ', ' + row['city'], axis=1)

Merge two DataFrames using merge with different column names

This code creates two DataFrames (df1 and df2) and merges them together based on different column names: the «name» column in df1 and the «employee_name» column in df2. It creates a new DataFrame (merged_df) with columns for «name», «age», and «salary».

data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'], 
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)

data2 = {'employee_name': ['John', 'Sarah', 'David', 'Rachel'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)

merged_df = pd.merge(df1, df2, left_on='name', right_on='employee_name')

Apply a function to a DataFrame using applymap

This code defines a function format_age that takes an age value and returns a string with the age followed by «years». It applies this function to the «age» column of the DataFrame df using the applymap method, and creates a new DataFrame with the formatted ages.

def format_age(age):
return str(age) + ' years'

df[['name', 'age']].applymap(format_age)

Merge two DataFrames using merge with a non-default join type

This code creates two DataFrames (df1 and df2) and merges them together based on the «name» column using a left join (how='left'). It creates a new DataFrame (merged_df) with columns for «name», «age», and «salary». The rows in merged_df correspond to the rows in df1, and any missing values in the «salary» column are filled with NaN.

data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'], 
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)

data2 = {'name': ['John', 'Sarah', 'David'],
'salary': [50000, 60000, 70000]}
df2 = pd.DataFrame(data2)

merged_df = pd.merge(df1, df2, on='name', how='left')

Calculate summary statistics for a column using describe

This code calculates summary statistics for the “age” column of the DataFrame df, including the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.

df['age'].describe()

Create a new column based on a condition using np.where

This code creates a new column called “age_category” in the DataFrame df based on a condition: if the value in the «age» column is less than 30, the corresponding value in the «age_category» column is «young», otherwise it is «old». It uses the np.where function to apply this condition element-wise to the «age» column.

import numpy as np

df['age_category'] = np.where(df['age'] < 30, 'young', 'old')

Reshape a DataFrame using crosstab

This code creates a new column called “age_category” in the DataFrame df based on a condition (as in example 28), and uses the crosstab function to reshape the DataFrame into a cross-tabulation format. The resulting DataFrame shows the count of each combination of city and age category.

df['age_category'] = np.where(df['age'] < 30, 'young', 'old')
pd.crosstab(df['city'], df['age_category'])

Select a random sample of rows from a DataFrame using sample

This code selects a random sample of 3 rows from the DataFrame df using the sample method. The resulting DataFrame contains the selected rows in a random order.

df.sample(n=3)

Calculate the correlation between columns using corr

This code calculates the correlation matrix between all pairs of columns in the DataFrame df using the corr method. The resulting DataFrame shows the correlation coefficient between each pair of columns.

df.corr()

Create a new column based on a rolling window calculation using rolling

This code creates a new column called “rolling_mean_age” in the DataFrame df based on a rolling window calculation: it calculates the mean of the «age» column over a window of size 2, and applies this calculation to each consecutive pair of rows using the rolling method.

df['rolling_mean_age'] = df['age'].rolling(window=2).mean()

Pivot a DataFrame using pivot

This code pivots the DataFrame df using the «name» column as the index, the «city» column as the columns, and the «age» column as the values. It returns a new DataFrame where the rows correspond to each unique name value, the columns correspond to each unique city value, and the values correspond to the corresponding age value.

df.pivot(index='name', columns='city', values='age')

Sort a DataFrame by one or more columns using sort_values

This code sorts the rows of the DataFrame df first by the «city» column in ascending order, and then by the «age» column in descending order, using the sort_values method.

df.sort_values(by=['city', 'age'], ascending=[True, False])

Create a new column based on a group-wise calculation using groupby and transform

This code creates a new column called “age_mean_by_city” in the DataFrame df by calculating the mean of the «age» column for each unique value in the «city» column, and applying this calculation to each row in the corresponding group using the transform method.

df['age_mean_by_city'] = df.groupby('city')['age'].transform('mean')

Filter a DataFrame based on the existence of null values using isnull

This code filters the rows of the DataFrame df based on whether any of the values in the row are null (i.e., missing), using the isnull method. It returns only the rows where this condition is true.

df[df.isnull().any(axis=1)]

Create a new column based on a group-wise calculation using groupby and apply

This code creates a new column called “age_diff_from_mean_by_city” in the DataFrame df by applying a custom function age_diff to each group of rows in df based on the unique values in the «city» column. The age_diff function takes a group of rows as its argument, and returns a Series object with the age difference between each row and the mean age of the group. The groupby method is used to group the rows by the «city» column, and the apply method is used to apply the age_diff function to each group.

def age_diff(group):
return group['age'] - group['age'].mean()

df['age_diff_from_mean_by_city'] = df.groupby('city').apply(age_diff)

Create a new column based on the difference between consecutive rows using diff

This code creates a new column called “age_diff” in the DataFrame df by calculating the difference between each value in the «age» column and the previous value using the diff method. The first value in the new column is NaN, since there is no previous value to calculate the difference with.

df['age_diff'] = df['age'].diff()

Create a new column based on a group-wise calculation using groupby and agg

This code groups the rows of the DataFrame df by the unique values in the «city» column, and applies two different aggregate functions to the «age» and «salary» columns using the agg method. It calculates the mean of the «age» column and the sum of the «salary» column for each group, and returns a new DataFrame with columns for «age» and «salary».

df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})

Filter a DataFrame based on the non-existence of null values using notnull

This code filters the rows of the DataFrame df based on whether all of the values in the row are not null (i.e., not missing), using the notnull method. It returns only the rows where this condition is true.

df[df.notnull().all(axis=1)]

Create a new column based on a group-wise calculation using groupby and filter

This code groups the rows of the DataFrame df by the unique values in the «city» column, and filters the groups based on whether they contain any members with an age less than 30, using the custom function has_young_member. The has_young_member function takes a group of rows as its argument, and returns a boolean value indicating whether any row in the group has an age less than 30. The groupby method is used to group the rows by the «city» column, and the filter method is used to apply the has_young_member function to each group.

def has_young_member(group):
return (group['age'] < 30).any()

df.groupby('city').filter(has_young_member)

Create a new column based on a group-wise calculation using groupby and pivot_table

This code groups the rows of the DataFrame df by the unique values in the «city» column, and calculates the mean of the «age» column for each group using the pivot_table function. It returns a new DataFrame with columns for «city» and «age».

pd.pivot_table(df, index='city', values='age', aggfunc=np.mean)

Filter a DataFrame based on a regular expression using str.contains

This code filters the rows of the DataFrame df based on whether the value in the «city» column contains the substrings «New» or «Los», using the str.contains method. It returns only the rows where this condition is true.

df[df['city'].str.contains('New|Los')]

In this article, I have provided a comprehensive cheat sheet of Pandas coding examples, covering a wide range of data manipulation and analysis tasks.

Regardless of your level of experience with Pandas, these examples serve as a valuable resource for working with tabular data in Python, these examples serve as a valuable resource for working with tabular data in Python.

These examples serve as a valuable resource for working with tabular data in Python.

By increasing your ability of Pandas, you can efficiently manipulate and analyze large datasets, enabling you to extract insights and make informed decisions.

I encourage you to try out these examples and continue exploring the capabilities of Pandas for your data analysis needs.

If you found this article helpful, please consider giving it a round of applause and subscribing to my newsletter for more data science tips and resources. Thank you for reading!

[post_relacionado id=»1817″]

Scroll al inicio