Pandas Coding Examples for Efficient Data Manipulation and Analysis in Python
It is no secret that data is the lifeblood of modern business, and the ability to manipulate and analyze data is a key skill for any data scientist or machine learning practitioner.
The Pandas library for Python has become the go-to tool for data manipulation and analysis, providing a wide range of powerful functions for working with tabular data.
In this article, I will provide a cheat sheet of Pandas coding examples, covering a broad range of topics including data filtering, aggregation, merging, and reshaping.
También te puede interesarCómo clasificar documentos para una consulta usando NLPWhether you’re a beginner or an experienced Python user, this cheat sheet is a valuable resource for anyone working with data in Python.
So, let’s dive in and explore the power of Pandas for data science and machine learning!
Importing the Pandas library and creating a DataFrame
This code imports the Pandas library, creates a dictionary of data, and converts it into a DataFrame using the pd.DataFrame()
method. It creates a table with columns named «name», «age», and «city», and adds four rows of data.
import pandas as pd
data = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df = pd.DataFrame(data)
Selecting specific columns of a DataFrame
This code selects only the “name” and “age” columns from the DataFrame df
.
df[['name', 'age']]
Filtering rows based on a condition
This code filters the rows of the DataFrame df
based on the condition that the «age» column is greater than 25. It returns only the rows where this condition is true.
df[df['age'] > 25]
Sorting a DataFrame by a specific column
This code sorts the DataFrame df
by the «age» column in ascending order. To sort in descending order, use df.sort_values('age', ascending=False)
instead.
df.sort_values('age')
Renaming columns in a DataFrame
This code renames the “name” column to “full_name” and the “city” column to “location” in the DataFrame df
.
df.rename(columns={'name': 'full_name', 'city': 'location'})
Grouping a DataFrame by a column and aggregating another column
This code groups the DataFrame df
by the «city» column and calculates the mean of the «age» column for each group. It returns a new Series object with the cities as the index and the mean ages as the values.
df.groupby('city')['age'].mean()
Merging two DataFrames based on a common column
This code creates two DataFrames (df1
and df2
) and merges them together based on the «name» column. It creates a new DataFrame (merged_df
) with columns for «name», «age», and «salary».
data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)data2 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on='name')
Adding a new column to a DataFrame
This code adds a new column called “gender” to the DataFrame df
with values «M», «F», «M», and «F» for each row, respectively.
df['gender'] = ['M', 'F', 'M', 'F']
Handling missing values in a DataFrame
These three lines of code demonstrate different ways to handle missing values in a DataFrame. The first line drops any rows that contain missing values, the second line fills missing values with 0, and the third line fills missing values using interpolation.
df.dropna() # drops rows with any missing values
df.fillna(0) # fills missing values with 0
df.interpolate() # fills missing values using interpolation
Saving a DataFrame to a CSV file
This code saves the DataFrame df
to a CSV file called «my_dataframe.csv». The index=False
argument specifies that the row index should not be included in the output.
df.to_csv('my_dataframe.csv', index=False)
Selecting rows and columns of a DataFrame using loc
This code selects the rows with index values 1, 2, and 3 and the “name” and “city” columns from the DataFrame df
. It uses the loc
accessor to specify the row and column selections.
df.loc[1:3, ['name', 'city']]
Pivot a DataFrame using pivot_table
This code pivots the DataFrame df
using the «city» column as the index, the «gender» column as the columns, and the «age» column as the values. It calculates the mean age for each combination of city and gender.
df.pivot_table(index='city', columns='gender', values='age', aggfunc='mean')
Apply a function to a column using apply
This code defines a function add_one
that adds 1 to its argument and applies this function to the «age» column of the DataFrame df
using the apply
method. It creates a new column called «age_plus_one» with the result.
def add_one(x):
return x + 1df['age_plus_one'] = df['age'].apply(add_one)
Merge two DataFrames using join
This code creates two DataFrames (df1
and df2
) and joins them together based on the «name» column. It creates a new DataFrame (joined_df
) with columns for «name», «age», and «salary».
data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)data2 = {'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2, index=['John', 'Sarah', 'David', 'Rachel'])
joined_df = df1.join(df2, on='name')
Count the occurrences of unique values in a column using value_counts
This code counts the occurrences of each unique value in the “city” column of the DataFrame df
, and returns the result as a Series object.
df['city'].value_counts()
Reshape a DataFrame using melt
This code creates a wide DataFrame wide_df
with columns «A», «B», and «C», and uses the melt
function to reshape it into a long format. The resulting DataFrame (melted_df
) has columns «A», «variable», and «value», where «variable» is the original column name («B» or «C»), and «value» is the corresponding cell value.
wide_df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
melted_df = pd.melt(wide_df, id_vars=['A'], value_vars=['B', 'C'])
Merge two DataFrames using merge on multiple columns
This code creates two DataFrames (df1
and `df2
data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)data2 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on=['name', 'city'])
Filter a DataFrame by multiple conditions using & (and) and | (or)
This code filters the rows of the DataFrame df
based on two conditions using the & and | operators. It returns only the rows where the «age» column is greater than 25 AND the «city» column is equal to «Chicago».
df[(df['age'] > 25) & (df['city'] == 'Chicago')]
Reshape a DataFrame using stack and unstack
This code creates a wide DataFrame wide_df
with columns «A» and «B», and uses the stack
method to reshape it into a tall format with a hierarchical index. The resulting DataFrame (stacked_df
) has columns «A» and «B», and an index with two levels («a» and «b» in the first level, and «A» and «B» in the second level). The code then uses the unstack
method to reshape the DataFrame back into a wide format.
wide_df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['a', 'b'])
stacked_df = wide_df.stack()
unstacked_df = stacked_df.unstack()
Replace values in a column using replace
This code replaces the values in the “city” column of the DataFrame df
using a dictionary of replacements. It replaces «New York» with «NY» and «Los Angeles» with «LA».
df['city'] = df['city'].replace({'New York': 'NY', 'Los Angeles': 'LA'})
Calculate the cumulative sum of a column using cumsum
This code calculates the cumulative sum of the “age” column of the DataFrame df
using the cumsum
method. It creates a new column called «age_cumulative_sum» with the result.
df['age_cumulative_sum'] = df['age'].cumsum()
Filter a DataFrame based on the existence of a value in a column using isin
This code filters the rows of the DataFrame df
based on whether the value in the «city» column is one of the values in the list [‘New York’, ‘Chicago’]. It returns only the rows where this condition is true.
df[df['city'].isin(['New York', 'Chicago'])]
Create a new column based on the values of multiple columns using apply and lambda
This code creates a new column called “age_location” in the DataFrame df
by applying a lambda function to each row. The lambda function takes the «age» and «city» values from the row, concatenates them into a string with a comma separator, and returns the result. The apply
method is used to apply this function to each row along axis 1 (i.e., row-wise).
df['age_location'] = df.apply(lambda row: str(row['age']) + ', ' + row['city'], axis=1)
Merge two DataFrames using merge with different column names
This code creates two DataFrames (df1
and df2
) and merges them together based on different column names: the «name» column in df1
and the «employee_name» column in df2
. It creates a new DataFrame (merged_df
) with columns for «name», «age», and «salary».
data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)data2 = {'employee_name': ['John', 'Sarah', 'David', 'Rachel'],
'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, left_on='name', right_on='employee_name')
Apply a function to a DataFrame using applymap
This code defines a function format_age
that takes an age value and returns a string with the age followed by «years». It applies this function to the «age» column of the DataFrame df
using the applymap
method, and creates a new DataFrame with the formatted ages.
def format_age(age):
return str(age) + ' years'df[['name', 'age']].applymap(format_age)
Merge two DataFrames using merge with a non-default join type
This code creates two DataFrames (df1
and df2
) and merges them together based on the «name» column using a left join (how='left'
). It creates a new DataFrame (merged_df
) with columns for «name», «age», and «salary». The rows in merged_df
correspond to the rows in df1
, and any missing values in the «salary» column are filled with NaN.
data1 = {'name': ['John', 'Sarah', 'David', 'Rachel'],
'age': [24, 29, 32, 27],
'city': ['New York', 'Los Angeles', 'Chicago', 'Boston']}
df1 = pd.DataFrame(data1)data2 = {'name': ['John', 'Sarah', 'David'],
'salary': [50000, 60000, 70000]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df1, df2, on='name', how='left')
Calculate summary statistics for a column using describe
This code calculates summary statistics for the “age” column of the DataFrame df
, including the count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum.
df['age'].describe()
Create a new column based on a condition using np.where
This code creates a new column called “age_category” in the DataFrame df
based on a condition: if the value in the «age» column is less than 30, the corresponding value in the «age_category» column is «young», otherwise it is «old». It uses the np.where
function to apply this condition element-wise to the «age» column.
import numpy as npdf['age_category'] = np.where(df['age'] < 30, 'young', 'old')
Reshape a DataFrame using crosstab
This code creates a new column called “age_category” in the DataFrame df
based on a condition (as in example 28), and uses the crosstab
function to reshape the DataFrame into a cross-tabulation format. The resulting DataFrame shows the count of each combination of city and age category.
df['age_category'] = np.where(df['age'] < 30, 'young', 'old')
pd.crosstab(df['city'], df['age_category'])
Select a random sample of rows from a DataFrame using sample
This code selects a random sample of 3 rows from the DataFrame df
using the sample
method. The resulting DataFrame contains the selected rows in a random order.
df.sample(n=3)
Calculate the correlation between columns using corr
This code calculates the correlation matrix between all pairs of columns in the DataFrame df
using the corr
method. The resulting DataFrame shows the correlation coefficient between each pair of columns.
df.corr()
Create a new column based on a rolling window calculation using rolling
This code creates a new column called “rolling_mean_age” in the DataFrame df
based on a rolling window calculation: it calculates the mean of the «age» column over a window of size 2, and applies this calculation to each consecutive pair of rows using the rolling
method.
df['rolling_mean_age'] = df['age'].rolling(window=2).mean()
Pivot a DataFrame using pivot
This code pivots the DataFrame df
using the «name» column as the index, the «city» column as the columns, and the «age» column as the values. It returns a new DataFrame where the rows correspond to each unique name value, the columns correspond to each unique city value, and the values correspond to the corresponding age value.
df.pivot(index='name', columns='city', values='age')
Sort a DataFrame by one or more columns using sort_values
This code sorts the rows of the DataFrame df
first by the «city» column in ascending order, and then by the «age» column in descending order, using the sort_values
method.
df.sort_values(by=['city', 'age'], ascending=[True, False])
Create a new column based on a group-wise calculation using groupby and transform
This code creates a new column called “age_mean_by_city” in the DataFrame df
by calculating the mean of the «age» column for each unique value in the «city» column, and applying this calculation to each row in the corresponding group using the transform
method.
df['age_mean_by_city'] = df.groupby('city')['age'].transform('mean')
Filter a DataFrame based on the existence of null values using isnull
This code filters the rows of the DataFrame df
based on whether any of the values in the row are null (i.e., missing), using the isnull
method. It returns only the rows where this condition is true.
df[df.isnull().any(axis=1)]
Create a new column based on a group-wise calculation using groupby and apply
This code creates a new column called “age_diff_from_mean_by_city” in the DataFrame df
by applying a custom function age_diff
to each group of rows in df
based on the unique values in the «city» column. The age_diff
function takes a group of rows as its argument, and returns a Series object with the age difference between each row and the mean age of the group. The groupby
method is used to group the rows by the «city» column, and the apply
method is used to apply the age_diff
function to each group.
def age_diff(group):
return group['age'] - group['age'].mean()df['age_diff_from_mean_by_city'] = df.groupby('city').apply(age_diff)
Create a new column based on the difference between consecutive rows using diff
This code creates a new column called “age_diff” in the DataFrame df
by calculating the difference between each value in the «age» column and the previous value using the diff
method. The first value in the new column is NaN, since there is no previous value to calculate the difference with.
df['age_diff'] = df['age'].diff()
Create a new column based on a group-wise calculation using groupby and agg
This code groups the rows of the DataFrame df
by the unique values in the «city» column, and applies two different aggregate functions to the «age» and «salary» columns using the agg
method. It calculates the mean of the «age» column and the sum of the «salary» column for each group, and returns a new DataFrame with columns for «age» and «salary».
df.groupby('city').agg({'age': 'mean', 'salary': 'sum'})
Filter a DataFrame based on the non-existence of null values using notnull
This code filters the rows of the DataFrame df
based on whether all of the values in the row are not null (i.e., not missing), using the notnull
method. It returns only the rows where this condition is true.
df[df.notnull().all(axis=1)]
Create a new column based on a group-wise calculation using groupby and filter
This code groups the rows of the DataFrame df
by the unique values in the «city» column, and filters the groups based on whether they contain any members with an age less than 30, using the custom function has_young_member
. The has_young_member
function takes a group of rows as its argument, and returns a boolean value indicating whether any row in the group has an age less than 30. The groupby
method is used to group the rows by the «city» column, and the filter
method is used to apply the has_young_member
function to each group.
def has_young_member(group):
return (group['age'] < 30).any()df.groupby('city').filter(has_young_member)
Create a new column based on a group-wise calculation using groupby and pivot_table
This code groups the rows of the DataFrame df
by the unique values in the «city» column, and calculates the mean of the «age» column for each group using the pivot_table
function. It returns a new DataFrame with columns for «city» and «age».
pd.pivot_table(df, index='city', values='age', aggfunc=np.mean)
Filter a DataFrame based on a regular expression using str.contains
This code filters the rows of the DataFrame df
based on whether the value in the «city» column contains the substrings «New» or «Los», using the str.contains
method. It returns only the rows where this condition is true.
df[df['city'].str.contains('New|Los')]
In this article, I have provided a comprehensive cheat sheet of Pandas coding examples, covering a wide range of data manipulation and analysis tasks.
Regardless of your level of experience with Pandas, these examples serve as a valuable resource for working with tabular data in Python, these examples serve as a valuable resource for working with tabular data in Python.
These examples serve as a valuable resource for working with tabular data in Python.
By increasing your ability of Pandas, you can efficiently manipulate and analyze large datasets, enabling you to extract insights and make informed decisions.
I encourage you to try out these examples and continue exploring the capabilities of Pandas for your data analysis needs.
If you found this article helpful, please consider giving it a round of applause and subscribing to my newsletter for more data science tips and resources. Thank you for reading!
[post_relacionado id=»1817″]