Pandas is the most popular open-source library for data analysis in Python. The library provides you with simple but powerful data structures that easily allows you to manipulate data at ease. Here is a quick summary of what makes Pandas great:

Pandas has over 200 methods and properties to interact with your data. For instance, you could easily remove rows with missing values by calling the method dropna(~).
Pandas is far more performant than standard Python lists in terms of speed and memory efficiency. Pandas is built on top of NumPy, which has highly efficient array operations. This means that computations for large datasets is blazing fast.
Pandas synergises with other data-related libraries in Python such as NumPy, Matplotlib and Scikit-learn. For instance, you could perform data preprocessing using Pandas, and use the processed data to train a machine learning model in Scikit-learn.

Data Structure

DataFrame

In Pandas, the primary data structure used is called DataFrame. You can think of a DataFrame as a standard table that is composed of rows and columns. For example, DataFrames can be used to represent the following table that has 3 rows and 3 columns:

	Name	Age	Class
0	Alex	16	A
1	Cathy	17	B
2	Bob	17	A

To create a DataFrame representing this table in Pandas, use the DataFrame constructor:


        
        
            
                
                
                    df = pd.DataFrame({
    "Name": ["Alex","Cathy","Bob"],
    "Age": [16,17,17],
    "Class": ["A","B","A"]
})

df
                
            
               Name   Age  Class
0  Alex   16     A
1  Cathy  17     B
2  Bob    17     A

As you can see, DataFrames can store a diverse range of data types such as strings, numbers, categories, arrays and so on.

NOTE

To learn more about the different ways in which you can create a DataFrame, click here.

Each row and column of the DataFrame is represented using a data structure known as Series. You can think of a Series as a combination of an array and a dictionary in which you can access data using integer index and labels.

To access a particular column of a DataFrame, use the [] notation with the column label like so:


        
        
            
                
                
                    df["Name"]
                
            
            0     Alex
1    Cathy
2      Bob
Name: Name, dtype: object

Here, we are accessing the Name column and the return type is Series. You can access individual values in a Series using integer indices, just as you would for standard arrays:


        
        
            
                
                
                    col_name = df["Name"]   # col_name is a Series
col_name[1]
                
            
            'Cathy'

Series

We have said that a Series can be regarded as a hybrid of arrays and dictionaries. Leaving our DataFrame behind, let us explore Series as a data structure in a bit more depth.

We can create a Series from a list like so:


        
        
            
                
                
                    data = pd.Series([5,9,3])
data
                
            
            0    5
1    9
2    3
dtype: int64

Here, the output is somewhat confusing - the first column (0,1,2) is actually just the labels attached to the values. By default, the labels are sequential integers (0,1,2, ...). As stated earlier, we can access individual values using integer notation:

We can also specify the labels using the index argument:


        
        
            
                
                
                    data = pd.Series([5,9,3], index=["a","b","c"])
data
                
            
            a    5
b    9
c    3
dtype: int64

We can refer to the value with label "b" like so:

Note that we can still refer to values using integer index, just as we do for standard arrays:

Computing basic statistics

Pandas comes with over 200+ methods and properties that allow you to easily manipulate data. In this section, we will introduce some basic manipulation techniques.

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"A":[3,5,7],"B":[1,2,3]})
df
                
            
               A  B
0  3  1
1  5  2
2  7  3

To compute the descriptive statistics of each column:


        
        
            
                
                
                    df.describe()
                
            
                   A    B
count  3.0  3.0
mean   5.0  2.0
std    2.0  1.0
min    3.0  1.0
25%    4.0  1.5
50%    5.0  2.0
75%    6.0  2.5
max    7.0  3.0

Here, we are using the DataFrame's describe(~) method. Most of the methods that are available for DataFrames are also available for Series. For instance, to compute the descriptive statistics of a single column (Series) like so:


        
        
            
                
                
                    df["A"].describe()
                
            
            count    3.0
mean     5.0
std      2.0
min      3.0
25%      4.0
50%      5.0
75%      6.0
max      7.0
Name: A, dtype: float64

There are many other methods for computing basic statistics such as max(~) and mean(~).

Selecting a subset of the DataFrame

To select a subset of the DataFrame, use one of the following property/methods:

[] notation
loc property
iloc property
query(~) method

Consider the following DataFrame:


        
        
            
                
                
                    df = pd.DataFrame({"A":[3,5,7],"B":[1,2,3]}, index=["a","b","c"])
df
                
            
               A  B
a  3  1
b  5  2
c  7  3

Square bracket notation

To get column B:


        
        
            
                
                
                    df["B"]   # returns a Series
                
            
            a    1
b    2
c    3
Name: B, dtype: int64

To get columns A and B, pass in an array like so:


        
        
            
                
                
                    df[["A","B"]]   # returns a DataFrame
                
            
               A  B
a  3  1
b  5  2
c  7  3

loc property

Pandas DataFrame.loc is used to access or update values of the DataFrame using row and column labels. Note that loc is a property and not a function - we provide the parameters using [] notation.

We show df here again for your reference:

To access the value at [bB] using row and column labels:


        
        
            
                
                
                    df.loc["b","B"]
                
            
            2

To access row b:


        
        
            
                
                
                    df.loc["b"]   # returns a Series
                
            
            A    5
B    2
Name: b, dtype: int64

To access column B:


        
        
            
                
                
                    df.loc[:,"B"]   # returns a Series
                
            
            a    1
b    2
c    3
Name: B, dtype: int64

Here, the : before the comma indicates that we want to retrieve all rows. The "B" after the comma then indicates that we just want to fetch column B.

NOTE

loc is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

iloc property

Pandas' DataFrame.iloc is used to access or update specific rows/columns of the DataFrame using integer indices.

We show df here again for your reference:

To access the second row:


        
        
            
                
                
                    df.iloc[1]
                
            
            A    5
B    2
Name: b, dtype: int64

To access the second column:


        
        
            
                
                
                    df.iloc[:,1]
                
            
            a    1
b    2
c    3
Name: B, dtype: int64

Here, the : before the comma means that we want to fetch all rows. The 1 after the comma means that we want to fetch the second column (column at the first integer index).

NOTE

iloc is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

query method

Pandas' DataFrame.query(~) method filters rows according to the provided boolean expression.

We show the df here again for your reference:

To get all rows where the value for column A is 3:


        
        
            
                
                
                    df.query("A == 3")   # returns a DataFrame
                
            
               A  B
a  3  1

To get all rows where value for column A is greater than 2, and value for column B is not equal to 2:


        
        
            
                
                
                    df.query("A > 2 and B != 2")
                
            
               A  B
a  3  1
c  7  3

NOTE

The DataFrame's query(~) is an extremely powerful method that allows you to perform granular querying of data. To learn more, click here.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!