Pandas

PostedSeptember 28, 2022

UpdatedJune 8, 2024

ByErnie

0 out of 5 stars

5 Stars		0%
4 Stars		0%
3 Stars		0%
2 Stars		0%
1 Stars		0%

Pandas is a Python package that assists in data analysis and manipulation. It provides data structures and functions to work with structured data. It is based on the NumPy library and provides fast and efficient operations on arrays, data frames, and series.

This tool lets you easily clean, transform, visualize, and analyze large datasets. It is widely used for data pre-processing and exploration across finance, economics, and statistics industries.

Pandas vs. NumPy

Pandas library is primarily used for data analysis purposes, while NumPy is focused on managing numerical values and executing mathematical computations. These libraries simplify and enhance typical matrix operations, making them widely used in the fields of Data Science, Research, and Machine Learning model development. The following table shows their main differences:

Pandas	NumPy
May hold diverse information.	Contains homogeneous data.
Use a tabular format to perform operations. Supports only two dimensions.	Build to work with numeric computing and n-dimensional arrays.
Consume more memory.	Consume less memory.
Pandas is built on top of NumPy.	Learning NumPy before Pandas could be advantageous.
The key tools of Pandas are the DataFrame and Series.	The key tool of Pandas is Arrays.

Quick Installation Guide

In this very short tutorial, you’ll learn how to install any Python library. Pandas is already installed in Jupyter Notebook. To verify and list all locally installed packages, use the following command:

pip list --format=columns

If the library is not already present in your Jupyter Notebook or any other shell environment, run “!pip install library_name” in a code cell or follow the appropriate installation procedure for the IDE you’re using.

!pip install pandas

Once you have installed it, employ the following command to import the library into your code:

import pandas as pd

where pd is its standard alias. You can use others but remember that it is the most adopted and frequently seen in complex codes.

The import statement for Pandas library.

Pandas Fundamentals

A DataFrame can be created from various data sources, where the most used arguments are data, index, columns, dtype and copy.

import pandas as pd

# Create DataFrame from dictionary
data = {'Name': ['John', 'Jane', 'Jim', 'Joan'],
        'Age': [32, 29, 40, 35],
        'Country': ['USA', 'UK', 'Canada', 'Australia']}
df = pd.DataFrame(data)
print(df)

You can also create a DataFrame containing different types of data.

#Dataframes Objects
DF1=pd.DataFrame({
    'ID':[1.1,2.1,3.1,4.1,5.1,6.1],
    'Dt1':[10,20,30,40,50,60],
    'Dt2':[11,22,33,44,55,66],
    'Dt3':['Flame1','Flame2','Flame3','Flame4','Flame5','Flame6']
})
  
DF2=pd.DataFrame({
    'ID':[1.1,2.1,3.1,4.1,5],
    'Dt3':[100,200,300,400,500],
    'Dt5':[101,202,303,404,505],
    'Dt6':['Flame7','Flame8','Flame9','Flame10','Flame11']
})

Concatenating DataFrame Objects

Use the “pd.concat()” argument to combine Series and DataFrame objects. For example, DF1 and DF2 can be joined vertically (along axis 0) as follows.

CombinedDF=pd.concat([DF1,DF2],axis=0)
CombinedDF

-	ID	 Dt1	  Dt2	Dt3      Dt5	 Dt6
0	1.1	10.0	11.0	Flame1	NaN	NaN
1	2.1	20.0	22.0	Flame2	NaN	NaN
2	3.1	30.0	33.0	Flame3	NaN	NaN
3	4.1	40.0	44.0	Flame4	NaN	NaN
4	5.1	50.0	55.0	Flame5	NaN	NaN
5	6.1	60.0	66.0	Flame6	NaN	NaN
0	1.1	NaN	NaN	  100	  101.0	Flame7
1	2.1	NaN	NaN	  200	  202.0	Flame8
2	3.1	NaN	NaN	  300	  303.0	Flame9
3	4.1	NaN	NaN	  400	  404.0	Flame10
4	5.0	NaN	NaN	  500	  505.0	Flame11

To combine them horizontally (along axis 1).

CombinedDF=pd.concat([DF1,DF2],axis=1)
CombinedDF

-	ID	Dt1	Dt2	Dt3	  ID	  Dt3	  Dt5	  Dt6
0	1.1	10	11	Flame1	1.1	100.0	101.0	Flame7
1	2.1	20	22	Flame2	2.1	200.0	202.0	Flame8
2	3.1	30	33	Flame3	3.1	300.0	303.0	Flame9
3	4.1	40	44	Flame4	4.1	400.0	404.0	Flame10
4	5.1	50	55	Flame5	5.0	500.0	505.0	Flame11
5	6.1	60	66	Flame6	NaN	 NaN	   NaN	   NaN

If the index of DataFrame objects is not meaningful, you can removed it from the datasets

CombinedDF=pd.concat([DF1,DF2],axis=1,ignore_index=True, sort=False)
CombinedDF

-	0	  1	 2	   3		4		5		6		7
0	1.1	10	11	Flame1	1.1	100.0	101.0	Flame7
1	2.1	20	22	Flame2	2.1	200.0	202.0	Flame8
2	3.1	30	33	Flame3	3.1	300.0	303.0	Flame9
3	4.1	40	44	Flame4	4.1	400.0	404.0	Flame10
4	5.1	50	55	Flame5	5.0	500.0	505.0	Flame11
5	6.1	60	66	Flame6	NaN	 NaN	  NaN	   NaN

Merging DataFrame Objects

Another way to combine data is through the merge argument. Once the reference column is indicated, Pandas creates a new DataFrame with the number of rows based on the common values of that reference column.

CombinedDF=pd.merge(DF1,DF2,on='ID')
CombinedDF

-	ID	Dt1	Dt2	Dt3_x	Dt3_y	Dt5	Dt6
0	1.1	10	11	Flame1	100	101	Flame7
1	2.1	20	22	Flame2	200	202	Flame8
2	3.1	30	33	Flame3	300	303	Flame9
3	4.1	40	44	Flame4	400	404	Flame10

Note that DF1 and DF2 have the same column name, “Data3”; they were distinguished in the merged DataFrame by adding a subscript. Additionally, you can use the “how” argument to specify the keys of a DataFrame as references to include the values in the resulting table.

CombinedDF=pd.merge(DF1,DF2,how='right', on='ID')
CombinedDF

-	ID	 Dt1	 Dt2	 Dt3_x	Dt3_y	Dt5	Dt6
0	1.1	10.0	11.0	Flame1	100	101	Flame7
1	2.1	20.0	22.0	Flame2	200	202	Flame8
2	3.1	30.0	33.0	Flame3	300	303	Flame9
3	4.1	40.0	44.0	Flame4	400	404	Flame10
4	5.0	NaN	 NaN	 NaN		500	505	Flame11

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
CombinedDF

-	ID	Dt1	Dt2	Dt3_x	 Dt3_y	 Dt5	 Dt6
0	1.1	10	11	Flame1	100.0	101.0	Flame7
1	2.1	20	22	Flame2	200.0	202.0	Flame8
2	3.1	30	33	Flame3	300.0	303.0	Flame9
3	4.1	40	44	Flame4	400.0	404.0	Flame10
4	5.1	50	55	Flame5	NaN	  NaN	  NaN
5	6.1	60	66	Flame6	NaN	  NaN	  NaN

Visualization Options

Visualizing tables is often used in datasets for a clearer understanding. However, a general representation may be desired when dealing with large data frames. Pandas provides the arguments “display.max_rows” and “display.max_columns”, allowing users to set the maximum number of rows and columns displayed, respectively.

pd.set_option("display.max_rows", 3)
DF1

-	ID	Dt1	Dt2	Dt3
0	1.1	10	11	Flame1
...	...	...	...	...
5	6.1	60	66	Flame6
6 rows × 4 columns

In addition, the argument “display.expand_frame_repr” allows for the representation of a DataFrame object to stretch across pages.

import pandas as pd
import numpy as np
DF3 = pd.DataFrame(np.random.randn(5, 10)).round(decimals=2)
pd.set_option("display.max_rows", 3)
pd.set_option("expand_frame_repr", False)
DF3

-	0		1		2		3		4		5		6		7		8		9
0	0.01	-0.04	-1.28	-0.35	-1.23	-0.33	0.77	-2.19	1.65	-0.37
...	...	...	...	...	...	...	...	...	...	...
4	-0.84	-1.03	-1.34	-0.39	1.68	-1.08	2.40	-1.76	0.11	-0.47
5 rows × 10 columns

Use “pd.reset_option(“display.max_rows”)” to reset the current configuration.

The “info()” argument provides comprehensive information about a DataFrame, including the total number of entries, the number of non-null values, the data type of each column, the memory usage, and the information about the index. This method is advantageous when working with large datasets, as it provides a quick and easy way to understand the structure and content.

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
print(CombinedDF.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6 non-null      float64
 1   Dt1     6 non-null      int64  
 2   Dt2     6 non-null      int64  
 3   Dt3_x   6 non-null      object 
 4   Dt3_y   4 non-null      float64
 5   Dt5     4 non-null      float64
 6   Dt6     4 non-null      object 
dtypes: float64(3), int64(2), object(2)
memory usage: 384.0+ bytes
None

Null values

Null values occur when data is missing in the items. These missing values are typically represented as NaN in the columns. Pandas offers several useful functions for identifying, removing, and replacing null values in a DataFrame, including:

pd.isnull(): It returns a true value if any row has null values.

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
pd.isnull(CombinedDF)

-	ID	   Dt1	  Dt2	 Dt3_x	 Dt3_y	 Dt5	  Dt6
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False
4	False	False	False	False	True	True	True
5	False	False	False	False	True	True	True

pd.notnull(): It returns a false value if any row has null values.

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
pd.notnull(CombinedDF)

-	ID	 Dt1	Dt2	Dt3_x	Dt3_y	Dt5	Dt6
0	True	True	True	True	True	True	True
1	True	True	True	True	True	True	True
2	True	True	True	True	True	True	True
3	True	True	True	True	True	True	True
4	True	True	True	True	False	False	False
5	True	True	True	True	False	False	False

.dropna(): It analyzes and drops the rows/columns containing null values.

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
CombinedDF.dropna()

-	ID	Dt1	Dt2	Dt3_x	 Dt3_y	 Dt5	 Dt6
0	1.1	10	11	Flame1	100.0	101.0	Flame7
1	2.1	20	22	Flame2	200.0	202.0	Flame8
2	3.1	30	33	Flame3	300.0	303.0	Flame9
3	4.1	40	44	Flame4	400.0	404.0	Flame10

.fillna(): It replaces NaN values with some other value defined by the user.

CombinedDF=pd.merge(DF1,DF2,how='left', on='ID')
CombinedDF.fillna(3)

-	ID	Dt1	Dt2	Dt3_x	 Dt3_y	 Dt5	 Dt6
0	1.1	10	11	Flame1	100.0	101.0	Flame7
1	2.1	20	22	Flame2	200.0	202.0	Flame8
2	3.1	30	33	Flame3	300.0	303.0	Flame9
3	4.1	40	44	Flame4	400.0	404.0	Flame10
4	5.1	50	55	Flame5	3.0	  3.0	   3
5	6.1	60	66	Flame6	3.0	  3.0	   3

Highlights

Project Background

Project: Pandas
Author: Wes McKinney
Initial Release: 2008
Type: Technical Computing
License: New BSD License
Contains: DataFrame, data filtration and integration of missing data, time-series functionality, and hierarchical Axis Indexing
Language: Python, Cython, C
GitHub: /pandas-dev/pandas
Runs On: Windows, MacOS, and Linux
Twitter: /pandas_d ev

Main Features

It provides two main data structures, Series (1-dimensional) and DataFrame (2-dimensional), for efficiently storing and manipulating data.
It allows users handling missing data, duplicates, and inconsistent data formats.
It provides functions for reshaping, merging, and transforming data, as well as for aggregating and grouping data, as well as for computing descriptive statistics, performing data analysis, and visualizing data.
It makes it easy to read and write data from various file formats, including CSV, Excel, and SQL databases.

Prior Knowledge Requirements

Understanding of basic programming concepts such as data structures, loops, functions, and methods.
Familiarity with the Python language itself and its syntax.
Familiarity with the Numpy library and the use of arrays in data analysis is also helpful.

Projects and Libraries

GeoPandas: A project that extends the functionality of Pandas by adding support for geographic data manipulation and analysis.
Pyjanitor: It extends the functionality of Pandas with a simple way to clean messy datasets, transforming raw data into an understandable/usable format.
Profiling: This library generates an interactive HTML report for any Pandas DataFrame, providing insights into the data, such as statistics and correlations.
Dask: It enables users to handle larger datasets; for example, manipulating them even when those datasets don’t fit in memory.
Modin: This library improves Pandas’ performance using parallel computing, which is especially effective on large datasets where Pandas runs into memory limitations.
Pandas-Bokeh: A library that provides easy-to-use functionality for creating interactive visualizations with Pandas and Bokeh.

Community Benchmarks

36,800 Stars
15,700 Forks
2,800+ Code contributors
90+ Releases
Source: GitHub

Releases

Pandas 1.5.3 (1-2023). Patch release in the 1.5.x series and includes some regression and bug fixes.
Pandas 1.5.0 (9-19-2022). It includes some new features, bug fixes, and performance improvements.
Pandas 1.4.0 (1-22-2022). It includes some new features, bug fixes, and performance improvements.
Pandas 1.3.0 (7-2-2021). It includes some new features, bug fixes, and performance improvements.

References

GitHub

Was this article helpful?

0 out of 5 stars

5 Stars		0%
4 Stars		0%
3 Stars		0%
2 Stars		0%
1 Stars		0%

Tags:

Machine Learning

AutoML

Tools

Frameworks

LLM

NLP

Data Infrastructure

Stream Processing

Data Processing

Workflows

Data Stores

Data Lakes

Hadoop Ecosystem

File Systems

Compilers

GPU & CPU

Kernel

Python Tools

Tools

Pandas

0 out of 5 stars

Pandas vs. NumPy

Quick Installation Guide

Pandas Fundamentals

Concatenating DataFrame Objects

Merging DataFrame Objects

Visualization Options

Null values

Highlights

Project Background

Main Features

Prior Knowledge Requirements

Projects and Libraries

Community Benchmarks

Releases

References

0 out of 5 stars

Please Share Your Feedback

How Can We Improve This Article?