Pandas
Pandas is a Python package that assists in data analysis and manipulation. It provides data structures and functions to work with structured data. It is based on the NumPy library and provides fast and efficient operations on arrays, data frames, and series.
This tool lets you easily clean, transform, visualize, and analyze large datasets. It is widely used for data pre-processing and exploration across finance, economics, and statistics industries.
Pandas vs. NumPy
Pandas library is primarily used for data analysis purposes, while NumPy is focused on managing numerical values and executing mathematical computations. These libraries simplify and enhance typical matrix operations, making them widely used in the fields of Data Science, Research, and Machine Learning model development. The following table shows their main differences:
Pandas | NumPy |
May hold diverse information. | Contains homogeneous data. |
Use a tabular format to perform operations. Supports only two dimensions. | Build to work with numeric computing and n-dimensional arrays. |
Consume more memory. | Consume less memory. |
Pandas is built on top of NumPy. | Learning NumPy before Pandas could be advantageous. |
The key tools of Pandas are the DataFrame and Series. | The key tool of Pandas is Arrays. |
Quick Installation Guide
In this very short tutorial, you’ll learn how to install any Python library. Pandas is already installed in Jupyter Notebook. To verify and list all locally installed packages, use the following command:
pip list --format=columns
If the library is not already present in your Jupyter Notebook or any other shell environment, run “!pip install library_name” in a code cell or follow the appropriate installation procedure for the IDE you’re using.
!pip install pandas
Once you have installed it, employ the following command to import the library into your code:
import pandas as pd
where pd is its standard alias. You can use others but remember that it is the most adopted and frequently seen in complex codes.
Pandas Fundamentals
A DataFrame can be created from various data sources, where the most used arguments are data, index, columns, dtype and copy.
import pandas as pd # Create DataFrame from dictionary data = {'Name': ['John', 'Jane', 'Jim', 'Joan'], 'Age': [32, 29, 40, 35], 'Country': ['USA', 'UK', 'Canada', 'Australia']} df = pd.DataFrame(data) print(df)
You can also create a DataFrame containing different types of data.
#Dataframes Objects DF1=pd.DataFrame({ 'ID':[1.1,2.1,3.1,4.1,5.1,6.1], 'Dt1':[10,20,30,40,50,60], 'Dt2':[11,22,33,44,55,66], 'Dt3':['Flame1','Flame2','Flame3','Flame4','Flame5','Flame6'] }) DF2=pd.DataFrame({ 'ID':[1.1,2.1,3.1,4.1,5], 'Dt3':[100,200,300,400,500], 'Dt5':[101,202,303,404,505], 'Dt6':['Flame7','Flame8','Flame9','Flame10','Flame11'] })
Concatenating DataFrame Objects
Use the “pd.concat()” argument to combine Series and DataFrame objects. For example, DF1 and DF2 can be joined vertically (along axis 0) as follows.
CombinedDF=pd.concat([DF1,DF2],axis=0) CombinedDF
- ID Dt1 Dt2 Dt3 Dt5 Dt6 0 1.1 10.0 11.0 Flame1 NaN NaN 1 2.1 20.0 22.0 Flame2 NaN NaN 2 3.1 30.0 33.0 Flame3 NaN NaN 3 4.1 40.0 44.0 Flame4 NaN NaN 4 5.1 50.0 55.0 Flame5 NaN NaN 5 6.1 60.0 66.0 Flame6 NaN NaN 0 1.1 NaN NaN 100 101.0 Flame7 1 2.1 NaN NaN 200 202.0 Flame8 2 3.1 NaN NaN 300 303.0 Flame9 3 4.1 NaN NaN 400 404.0 Flame10 4 5.0 NaN NaN 500 505.0 Flame11
To combine them horizontally (along axis 1).
CombinedDF=pd.concat([DF1,DF2],axis=1) CombinedDF
- ID Dt1 Dt2 Dt3 ID Dt3 Dt5 Dt6 0 1.1 10 11 Flame1 1.1 100.0 101.0 Flame7 1 2.1 20 22 Flame2 2.1 200.0 202.0 Flame8 2 3.1 30 33 Flame3 3.1 300.0 303.0 Flame9 3 4.1 40 44 Flame4 4.1 400.0 404.0 Flame10 4 5.1 50 55 Flame5 5.0 500.0 505.0 Flame11 5 6.1 60 66 Flame6 NaN NaN NaN NaN
If the index of DataFrame objects is not meaningful, you can removed it from the datasets
CombinedDF=pd.concat([DF1,DF2],axis=1,ignore_index=True, sort=False) CombinedDF
- 0 1 2 3 4 5 6 7 0 1.1 10 11 Flame1 1.1 100.0 101.0 Flame7 1 2.1 20 22 Flame2 2.1 200.0 202.0 Flame8 2 3.1 30 33 Flame3 3.1 300.0 303.0 Flame9 3 4.1 40 44 Flame4 4.1 400.0 404.0 Flame10 4 5.1 50 55 Flame5 5.0 500.0 505.0 Flame11 5 6.1 60 66 Flame6 NaN NaN NaN NaN
Merging DataFrame Objects
Another way to combine data is through the merge argument. Once the reference column is indicated, Pandas creates a new DataFrame with the number of rows based on the common values of that reference column.
CombinedDF=pd.merge(DF1,DF2,on='ID') CombinedDF
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 1.1 10 11 Flame1 100 101 Flame7 1 2.1 20 22 Flame2 200 202 Flame8 2 3.1 30 33 Flame3 300 303 Flame9 3 4.1 40 44 Flame4 400 404 Flame10
Note that DF1 and DF2 have the same column name, “Data3”; they were distinguished in the merged DataFrame by adding a subscript. Additionally, you can use the “how” argument to specify the keys of a DataFrame as references to include the values in the resulting table.
CombinedDF=pd.merge(DF1,DF2,how='right', on='ID') CombinedDF
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 1.1 10.0 11.0 Flame1 100 101 Flame7 1 2.1 20.0 22.0 Flame2 200 202 Flame8 2 3.1 30.0 33.0 Flame3 300 303 Flame9 3 4.1 40.0 44.0 Flame4 400 404 Flame10 4 5.0 NaN NaN NaN 500 505 Flame11
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') CombinedDF
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 1.1 10 11 Flame1 100.0 101.0 Flame7 1 2.1 20 22 Flame2 200.0 202.0 Flame8 2 3.1 30 33 Flame3 300.0 303.0 Flame9 3 4.1 40 44 Flame4 400.0 404.0 Flame10 4 5.1 50 55 Flame5 NaN NaN NaN 5 6.1 60 66 Flame6 NaN NaN NaN
Visualization Options
Visualizing tables is often used in datasets for a clearer understanding. However, a general representation may be desired when dealing with large data frames. Pandas provides the arguments “display.max_rows” and “display.max_columns”, allowing users to set the maximum number of rows and columns displayed, respectively.
pd.set_option("display.max_rows", 3) DF1
- ID Dt1 Dt2 Dt3 0 1.1 10 11 Flame1 ... ... ... ... ... 5 6.1 60 66 Flame6 6 rows × 4 columns
In addition, the argument “display.expand_frame_repr” allows for the representation of a DataFrame object to stretch across pages.
import pandas as pd import numpy as np DF3 = pd.DataFrame(np.random.randn(5, 10)).round(decimals=2) pd.set_option("display.max_rows", 3) pd.set_option("expand_frame_repr", False) DF3
- 0 1 2 3 4 5 6 7 8 9 0 0.01 -0.04 -1.28 -0.35 -1.23 -0.33 0.77 -2.19 1.65 -0.37 ... ... ... ... ... ... ... ... ... ... ... 4 -0.84 -1.03 -1.34 -0.39 1.68 -1.08 2.40 -1.76 0.11 -0.47 5 rows × 10 columns
Use “pd.reset_option(“display.max_rows”)” to reset the current configuration.
The “info()” argument provides comprehensive information about a DataFrame, including the total number of entries, the number of non-null values, the data type of each column, the memory usage, and the information about the index. This method is advantageous when working with large datasets, as it provides a quick and easy way to understand the structure and content.
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') print(CombinedDF.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 6 entries, 0 to 5 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 6 non-null float64 1 Dt1 6 non-null int64 2 Dt2 6 non-null int64 3 Dt3_x 6 non-null object 4 Dt3_y 4 non-null float64 5 Dt5 4 non-null float64 6 Dt6 4 non-null object dtypes: float64(3), int64(2), object(2) memory usage: 384.0+ bytes None
Null values
Null values occur when data is missing in the items. These missing values are typically represented as NaN in the columns. Pandas offers several useful functions for identifying, removing, and replacing null values in a DataFrame, including:
pd.isnull(): It returns a true value if any row has null values.
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') pd.isnull(CombinedDF)
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 False False False False False False False 1 False False False False False False False 2 False False False False False False False 3 False False False False False False False 4 False False False False True True True 5 False False False False True True True
pd.notnull(): It returns a false value if any row has null values.
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') pd.notnull(CombinedDF)
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 True True True True True True True 1 True True True True True True True 2 True True True True True True True 3 True True True True True True True 4 True True True True False False False 5 True True True True False False False
.dropna(): It analyzes and drops the rows/columns containing null values.
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') CombinedDF.dropna()
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 1.1 10 11 Flame1 100.0 101.0 Flame7 1 2.1 20 22 Flame2 200.0 202.0 Flame8 2 3.1 30 33 Flame3 300.0 303.0 Flame9 3 4.1 40 44 Flame4 400.0 404.0 Flame10
.fillna(): It replaces NaN values with some other value defined by the user.
CombinedDF=pd.merge(DF1,DF2,how='left', on='ID') CombinedDF.fillna(3)
- ID Dt1 Dt2 Dt3_x Dt3_y Dt5 Dt6 0 1.1 10 11 Flame1 100.0 101.0 Flame7 1 2.1 20 22 Flame2 200.0 202.0 Flame8 2 3.1 30 33 Flame3 300.0 303.0 Flame9 3 4.1 40 44 Flame4 400.0 404.0 Flame10 4 5.1 50 55 Flame5 3.0 3.0 3 5 6.1 60 66 Flame6 3.0 3.0 3
Highlights
Project Background
- Project: Pandas
- Author: Wes McKinney
- Initial Release: 2008
- Type: Technical Computing
- License: New BSD License
- Contains: DataFrame, data filtration and integration of missing data, time-series functionality, and hierarchical Axis Indexing
- Language: Python, Cython, C
- GitHub: /pandas-dev/pandas
- Runs On: Windows, MacOS, and Linux
- Twitter: /pandas_dev
Main Features
- It provides two main data structures, Series (1-dimensional) and DataFrame (2-dimensional), for efficiently storing and manipulating data.
- It allows users handling missing data, duplicates, and inconsistent data formats.
- It provides functions for reshaping, merging, and transforming data, as well as for aggregating and grouping data, as well as for computing descriptive statistics, performing data analysis, and visualizing data.
- It makes it easy to read and write data from various file formats, including CSV, Excel, and SQL databases.
Prior Knowledge Requirements
- Understanding of basic programming concepts such as data structures, loops, functions, and methods.
- Familiarity with the Python language itself and its syntax.
- Familiarity with the Numpy library and the use of arrays in data analysis is also helpful.
Projects and Libraries
- GeoPandas: A project that extends the functionality of Pandas by adding support for geographic data manipulation and analysis.
- Pyjanitor: It extends the functionality of Pandas with a simple way to clean messy datasets, transforming raw data into an understandable/usable format.
- Profiling: This library generates an interactive HTML report for any Pandas DataFrame, providing insights into the data, such as statistics and correlations.
- Dask: It enables users to handle larger datasets; for example, manipulating them even when those datasets don’t fit in memory.
- Modin: This library improves Pandas’ performance using parallel computing, which is especially effective on large datasets where Pandas runs into memory limitations.
- Pandas-Bokeh: A library that provides easy-to-use functionality for creating interactive visualizations with Pandas and Bokeh.
Community Benchmarks
- 36,800 Stars
- 15,700 Forks
- 2,800+ Code contributors
- 90+ Releases
- Source: GitHub
Releases
- Pandas 1.5.3 (1-2023). Patch release in the 1.5.x series and includes some regression and bug fixes.
- Pandas 1.5.0 (9-19-2022). It includes some new features, bug fixes, and performance improvements.
- Pandas 1.4.0 (1-22-2022). It includes some new features, bug fixes, and performance improvements.
- Pandas 1.3.0 (7-2-2021). It includes some new features, bug fixes, and performance improvements.