Getting Started with Pandas
Start your free 7-days trial now!
What is Pandas
Pandas is the most popular open-source library for data analysis in Python. The library provides you with simple but powerful data structures that easily allows you to manipulate data at ease. Here is a quick summary of what makes Pandas great:
Pandas has over 200 methods and properties to interact with your data. For instance, you could easily remove rows with missing values by calling the method
dropna(~)
.Pandas is far more performant than standard Python lists in terms of speed and memory efficiency. Pandas is built on top of NumPy, which has highly efficient array operations. This means that computations for large datasets is blazing fast.
Pandas synergises with other data-related libraries in Python such as NumPy, Matplotlib and Scikit-learn. For instance, you could perform data preprocessing using Pandas, and use the processed data to train a machine learning model in Scikit-learn.
Data Structure
DataFrame
In Pandas, the primary data structure used is called DataFrame. You can think of a DataFrame as a standard table that is composed of rows and columns. For example, DataFrames can be used to represent the following table that has 3 rows and 3 columns:
Name | Age | Class | |
---|---|---|---|
0 | Alex | 16 | A |
1 | Cathy | 17 | B |
2 | Bob | 17 | A |
To create a DataFrame representing this table in Pandas, use the DataFrame
constructor:
Name Age Class0 Alex 16 A1 Cathy 17 B2 Bob 17 A
As you can see, DataFrames can store a diverse range of data types such as strings, numbers, categories, arrays and so on.
To learn more about the different ways in which you can create a DataFrame, click here.
Each row and column of the DataFrame is represented using a data structure known as Series. You can think of a Series as a combination of an array and a dictionary in which you can access data using integer index and labels.
To access a particular column of a DataFrame, use the []
notation with the column label like so:
df["Name"]
0 Alex1 Cathy2 BobName: Name, dtype: object
Here, we are accessing the Name
column and the return type is Series
. You can access individual values in a Series using integer indices, just as you would for standard arrays:
col_name = df["Name"] # col_name is a Seriescol_name[1]
'Cathy'
Series
We have said that a Series can be regarded as a hybrid of arrays and dictionaries. Leaving our DataFrame behind, let us explore Series as a data structure in a bit more depth.
We can create a Series from a list like so:
data
0 51 92 3dtype: int64
Here, the output is somewhat confusing - the first column (0,1,2) is actually just the labels attached to the values. By default, the labels are sequential integers (0
,1
,2
, ...). As stated earlier, we can access individual values using integer notation:
data[1]
9
We can also specify the labels using the index
argument:
data
a 5b 9c 3dtype: int64
We can refer to the value with label "b"
like so:
data["b"]
9
Note that we can still refer to values using integer index, just as we do for standard arrays:
data[1]
9
Computing basic statistics
Pandas comes with over 200+ methods and properties that allow you to easily manipulate data. In this section, we will introduce some basic manipulation techniques.
Consider the following DataFrame:
df
A B0 3 11 5 22 7 3
To compute the descriptive statistics of each column:
A Bcount 3.0 3.0mean 5.0 2.0std 2.0 1.0min 3.0 1.025% 4.0 1.550% 5.0 2.075% 6.0 2.5max 7.0 3.0
Here, we are using the DataFrame's describe(~)
method. Most of the methods that are available for DataFrames are also available for Series. For instance, to compute the descriptive statistics of a single column (Series) like so:
count 3.0mean 5.0std 2.0min 3.025% 4.050% 5.075% 6.0max 7.0Name: A, dtype: float64
There are many other methods for computing basic statistics such as max(~)
and mean(~)
.
Selecting a subset of the DataFrame
To select a subset of the DataFrame, use one of the following property/methods:
Consider the following DataFrame:
df
A Ba 3 1b 5 2c 7 3
Square bracket notation
To get column B
:
df["B"] # returns a Series
a 1b 2c 3Name: B, dtype: int64
To get columns A
and B
, pass in an array like so:
df[["A","B"]] # returns a DataFrame
A Ba 3 1b 5 2c 7 3
loc property
Pandas DataFrame.loc
is used to access or update values of the DataFrame using row and column labels. Note that loc
is a property and not a function - we provide the parameters using []
notation.
We show df
here again for your reference:
df
A Ba 3 1b 5 2c 7 3
To access the value at [bB]
using row and column labels:
2
To access row b
:
A 5B 2Name: b, dtype: int64
To access column B
:
a 1b 2c 3Name: B, dtype: int64
Here, the :
before the comma indicates that we want to retrieve all rows. The "B"
after the comma then indicates that we just want to fetch column B.
loc
is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.
iloc property
Pandas' DataFrame.iloc
is used to access or update specific rows/columns of the DataFrame using integer indices.
We show df
here again for your reference:
df
A Ba 3 1b 5 2c 7 3
To access the second row:
A 5B 2Name: b, dtype: int64
To access the second column:
a 1b 2c 3Name: B, dtype: int64
Here, the :
before the comma means that we want to fetch all rows. The 1
after the comma means that we want to fetch the second column (column at the first integer index).
iloc
is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.
query method
Pandas' DataFrame.query(~)
method filters rows according to the provided boolean expression.
We show the df
here again for your reference:
df
A Ba 3 1b 5 2c 7 3
To get all rows where the value for column A
is 3
:
A Ba 3 1
To get all rows where value for column A
is greater than 2
, and value for column B
is not equal to 2
:
A Ba 3 1c 7 3