The Top 10 Most Frequently Used Pandas Functions in Data Science

1. read_csv()

This function is used to load data from a CSV (comma-separated values) file into a pandas DataFrame. It is one of the most commonly used pandas functions and is particularly useful for working with small datasets that can fit in memory.

import pandas as pd
df = pd.read_csv('data.csv')

2. head()

This function is used to display the first few rows of a DataFrame. It is often used to quickly inspect the data and get a feel for its structure and content. Its counterpart tail() does the same thing, but for the last few rows. Specifying an integer inside the functions returns that many rows. The default is 5.

df.head()
df.tail(10)

3. info()

This function is used to display information about a DataFrame, such as the data types of each column and the number of non-null values

>>> import pandas as pd
>>> df = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
>>> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
ID      3 non-null int64
Name    3 non-null object
Age     3 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes

This output shows that the DataFrame has 3 rows (RangeIndex: 3 entries) and 3 columns (Data columns (total 3 columns)). It also displays the data type of each column (dtypes) and the amount of memory used to store the DataFrame (memory usage).

4. describe()

This function is used to generate descriptive statistics for a DataFrame. It returns a summary of the central tendency, dispersion, and shape of the distribution of the data, excluding null values.

The describe() function behaves differently for numerical and categorical values in the data. For numeric, you will get count, mean, std deviation & distribution. For categorical, it returns the count of each unique category and the relative frequency of each category.

Numeric

>>> import pandas as pd
>>> df = pd.DataFrame({'ID': [1, 2, 3], 'Age': [25, 30, 35]})
>>> df.describe()
       ID        Age
count 3.0   3.000000
mean  2.0  30.000000
std   0.8   3.055050
min   1.0  25.000000
25%   1.5  27.500000
50%   2.0  30.000000
75%   2.5  32.500000
max   3.0  35.000000

Categorical

>>> import pandas as pd
>>> df = pd.DataFrame({'ID': [1, 2, 3], 'Gender': ['Female', 'Male', 'Female']})
>>> df.describe()
       ID Gender
count 3.0     3
unique 2.0     2
top    2.0  Male
freq   1.0     2

5. groupby()

This function is used to group a DataFrame by one or more columns and apply a function to each group. It is often used to perform aggregation or transformation operations on the data.

df.groupby('Category').mean()

To aggregate multiple attributes with varying aggregations simultaneously, you can use .agg() like so,

df.groupby('Category').agg({
  'col1':'count',
  'col2': 'mean',
  'col3': 'median',
  'col4': 'max'})

6. pivot_table()

This function is used to create a pivot table from a DataFrame. It allows you to summarize and aggregate data by one or more dimensions and display the results in a tabular format. It is similar to generating pivot tables in excel where relevant columns are dragged and dropped columns in the field list.

pd.pivot_table(df, index='Category', values='Sales', aggfunc='mean')

7. sort_values()

This function is used to sort a DataFrame by one or more columns. It is often used to rearrange the rows of a DataFrame in a particular order.

df.sort_values(by='Sales', ascending=False)

8. merge()

This function is used to merge two or more DataFrames based on a common column or set of columns. It is similar to a SQL JOIN operation and is often used to combine data from multiple sources.

df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
df3 = df1.merge(df2, on='ID', how='left')

9. apply()

This function is used to apply a function to each row or column of a DataFrame. It is a flexible and powerful tool for data transformation and manipulation.

df['Sales'] = df.apply(lambda row: row['Quantity'] * row['Price'], axis=1)

10. to_csv()

This function is used to save a DataFrame to a CSV file. It is often used to write the results of data analysis or transformation to a CSV file.

df.to_csv('output.csv', index=False)

Bonus: cut() & qcut()

These are the most powerful functions out there for manipulating and making sense of data. cut()and qcut()are used to divide a continuous variable into a set of discrete bins or categories.

In pd.cut() categories can be custom, while in pd.qcut() equal volume bins are generated

df['custom_bins'] = pd.cut(df.Sales, [-1, 0, 1, 100])
df['equal_bins'] = pd.qcut(df.Sales, 3)

Sometimes .qcut() would require duplicates = 'drop' if the variable cannot be broken into the number of bins specified.

Conclusion

In this blog post, we looked at the top 10 (+1) most frequently used pandas functions in the data science community. These functions are essential tools for working with data in Python and are widely used by data scientists, analysts, and researchers. Whether you’re just getting started with pandas or are a seasoned pro, these functions are sure to come in handy in your data projects.