search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
chevron_leftPandas
Common questions10 topics
Documentation5 topics
Cookbooks2 topics
Getting startedAPI referenceRecipes reference
check_circle
Mark as learned
thumb_up
45
thumb_down
0
chat_bubble_outline
0
Comment
auto_stories Bi-column layout
settings

Getting Started with Pandas

schedule Aug 11, 2023
Last updated
local_offer
PandasPython
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

What is Pandas

Pandas is the most popular open-source library for data analysis in Python. The library provides you with simple but powerful data structures that easily allows you to manipulate data at ease. Here is a quick summary of what makes Pandas great:

  • Pandas has over 200 methods and properties to interact with your data. For instance, you could easily remove rows with missing values by calling the method dropna(~).

  • Pandas is far more performant than standard Python lists in terms of speed and memory efficiency. Pandas is built on top of NumPy, which has highly efficient array operations. This means that computations for large datasets is blazing fast.

  • Pandas synergises with other data-related libraries in Python such as NumPy, Matplotlib and Scikit-learn. For instance, you could perform data preprocessing using Pandas, and use the processed data to train a machine learning model in Scikit-learn.

Data Structure

DataFrame

In Pandas, the primary data structure used is called DataFrame. You can think of a DataFrame as a standard table that is composed of rows and columns. For example, DataFrames can be used to represent the following table that has 3 rows and 3 columns:

Name

Age

Class

0

Alex

16

A

1

Cathy

17

B

2

Bob

17

A

To create a DataFrame representing this table in Pandas, use the DataFrame constructor:

df = pd.DataFrame({
"Name": ["Alex","Cathy","Bob"],
"Age": [16,17,17],
"Class": ["A","B","A"]
})

df
Name Age Class
0 Alex 16 A
1 Cathy 17 B
2 Bob 17 A

As you can see, DataFrames can store a diverse range of data types such as strings, numbers, categories, arrays and so on.

NOTE

To learn more about the different ways in which you can create a DataFrame, click here.

Each row and column of the DataFrame is represented using a data structure known as Series. You can think of a Series as a combination of an array and a dictionary in which you can access data using integer index and labels.

To access a particular column of a DataFrame, use the [] notation with the column label like so:

df["Name"]
0 Alex
1 Cathy
2 Bob
Name: Name, dtype: object

Here, we are accessing the Name column and the return type is Series. You can access individual values in a Series using integer indices, just as you would for standard arrays:

col_name = df["Name"] # col_name is a Series
col_name[1]
'Cathy'

Series

We have said that a Series can be regarded as a hybrid of arrays and dictionaries. Leaving our DataFrame behind, let us explore Series as a data structure in a bit more depth.

We can create a Series from a list like so:

data = pd.Series([5,9,3])
data
0 5
1 9
2 3
dtype: int64

Here, the output is somewhat confusing - the first column (0,1,2) is actually just the labels attached to the values. By default, the labels are sequential integers (0,1,2, ...). As stated earlier, we can access individual values using integer notation:

data[1]
9

We can also specify the labels using the index argument:

data = pd.Series([5,9,3], index=["a","b","c"])
data
a 5
b 9
c 3
dtype: int64

We can refer to the value with label "b" like so:

data["b"]
9

Note that we can still refer to values using integer index, just as we do for standard arrays:

data[1]
9

Computing basic statistics

Pandas comes with over 200+ methods and properties that allow you to easily manipulate data. In this section, we will introduce some basic manipulation techniques.

Consider the following DataFrame:

df = pd.DataFrame({"A":[3,5,7],"B":[1,2,3]})
df
A B
0 3 1
1 5 2
2 7 3

To compute the descriptive statistics of each column:

A B
count 3.0 3.0
mean 5.0 2.0
std 2.0 1.0
min 3.0 1.0
25% 4.0 1.5
50% 5.0 2.0
75% 6.0 2.5
max 7.0 3.0

Here, we are using the DataFrame's describe(~) method. Most of the methods that are available for DataFrames are also available for Series. For instance, to compute the descriptive statistics of a single column (Series) like so:

df["A"].describe()
count 3.0
mean 5.0
std 2.0
min 3.0
25% 4.0
50% 5.0
75% 6.0
max 7.0
Name: A, dtype: float64

There are many other methods for computing basic statistics such as max(~) and mean(~).

Selecting a subset of the DataFrame

To select a subset of the DataFrame, use one of the following property/methods:

Consider the following DataFrame:

df = pd.DataFrame({"A":[3,5,7],"B":[1,2,3]}, index=["a","b","c"])
df
A B
a 3 1
b 5 2
c 7 3

Square bracket notation

To get column B:

df["B"] # returns a Series
a 1
b 2
c 3
Name: B, dtype: int64

To get columns A and B, pass in an array like so:

df[["A","B"]] # returns a DataFrame
A B
a 3 1
b 5 2
c 7 3

loc property

Pandas DataFrame.loc is used to access or update values of the DataFrame using row and column labels. Note that loc is a property and not a function - we provide the parameters using [] notation.

We show df here again for your reference:

df
A B
a 3 1
b 5 2
c 7 3

To access the value at [bB] using row and column labels:

df.loc["b","B"]
2

To access row b:

df.loc["b"] # returns a Series
A 5
B 2
Name: b, dtype: int64

To access column B:

df.loc[:,"B"] # returns a Series
a 1
b 2
c 3
Name: B, dtype: int64

Here, the : before the comma indicates that we want to retrieve all rows. The "B" after the comma then indicates that we just want to fetch column B.

NOTE

loc is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

iloc property

Pandas' DataFrame.iloc is used to access or update specific rows/columns of the DataFrame using integer indices.

We show df here again for your reference:

df
A B
a 3 1
b 5 2
c 7 3

To access the second row:

df.iloc[1]
A 5
B 2
Name: b, dtype: int64

To access the second column:

df.iloc[:,1]
a 1
b 2
c 3
Name: B, dtype: int64

Here, the : before the comma means that we want to fetch all rows. The 1 after the comma means that we want to fetch the second column (column at the first integer index).

NOTE

iloc is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

query method

Pandas' DataFrame.query(~) method filters rows according to the provided boolean expression.

We show the df here again for your reference:

df
A B
a 3 1
b 5 2
c 7 3

To get all rows where the value for column A is 3:

df.query("A == 3") # returns a DataFrame
A B
a 3 1

To get all rows where value for column A is greater than 2, and value for column B is not equal to 2:

df.query("A > 2 and B != 2")
A B
a 3 1
c 7 3
NOTE

The DataFrame's query(~) is an extremely powerful method that allows you to perform granular querying of data. To learn more, click here.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
45
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!