*chevron_left*Pandas

# Getting Started with Pandas

*schedule*Jul 1, 2022

*toc*Table of Contents

*expand_more*

# What is Pandas

Pandas is the most popular open-source library for data analysis in Python. The library provides you with simple but powerful data structures that easily allows you to manipulate data at ease. Here is a quick summary of what makes Pandas great:

Pandas has over 200 methods and properties to interact with your data. For instance, you could easily remove rows with missing values by calling the method

`dropna(~)`

.Pandas is far more performant than standard Python lists in terms of speed and memory efficiency. Pandas is built on top of NumPy, which has highly efficient array operations. This means that computations for large datasets is blazing fast.

Pandas synergises with other data-related libraries in Python such as NumPy, Matplotlib and Scikit-learn. For instance, you could perform data preprocessing using Pandas, and use the processed data to train a machine learning model in Scikit-learn.

# Data Structure

## DataFrame

In Pandas, the primary data structure used is called DataFrame. You can think of a DataFrame as a standard table that is composed of rows and columns. For example, DataFrames can be used to represent the following table that has 3 rows and 3 columns:

Name | Age | Class | |
---|---|---|---|

0 | Alex | 16 | A |

1 | Cathy | 17 | B |

2 | Bob | 17 | A |

To create a DataFrame representing this table in Pandas, use the `DataFrame`

constructor:

```
Name Age Class0 Alex 16 A1 Cathy 17 B2 Bob 17 A
```

As you can see, DataFrames can store a diverse range of data types such as strings, numbers, categories, arrays and so on.

To learn more about the different ways in which you can create a DataFrame, click here.

Each row and column of the DataFrame is represented using a data structure known as Series. You can think of a Series as a combination of an array and a dictionary in which you can access data using integer index and labels.

To access a particular column of a DataFrame, use the `[]`

notation with the column label like so:

```
df["Name"]
0 Alex1 Cathy2 BobName: Name, dtype: object
```

Here, we are accessing the `Name`

column and the return type is `Series`

. You can access individual values in a Series using integer indices, just as you would for standard arrays:

```
col_name = df["Name"] # col_name is a Seriescol_name[1]
'Cathy'
```

## Series

We have said that a Series can be regarded as a hybrid of arrays and dictionaries. Leaving our DataFrame behind, let us explore Series as a data structure in a bit more depth.

We can create a Series from a list like so:

```
data
0 51 92 3dtype: int64
```

Here, the output is somewhat confusing - the first column (0,1,2) is actually just the labels attached to the values. By default, the labels are sequential integers (`0`

,`1`

,`2`

, ...). As stated earlier, we can access individual values using integer notation:

```
data[1]
9
```

We can also specify the labels using the `index`

argument:

```
data
a 5b 9c 3dtype: int64
```

We can refer to the value with label `"b"`

like so:

```
data["b"]
9
```

Note that we can still refer to values using integer index, just as we do for standard arrays:

```
data[1]
9
```

# Computing basic statistics

Pandas comes with over 200+ methods and properties that allow you to easily manipulate data. In this section, we will introduce some basic manipulation techniques.

Consider the following DataFrame:

```
df
A B0 3 11 5 22 7 3
```

To compute the descriptive statistics of each column:

```
A Bcount 3.0 3.0mean 5.0 2.0std 2.0 1.0min 3.0 1.025% 4.0 1.550% 5.0 2.075% 6.0 2.5max 7.0 3.0
```

Here, we are using the DataFrame's `describe(~)`

method. Most of the methods that are available for DataFrames are also available for Series. For instance, to compute the descriptive statistics of a single column (Series) like so:

```
count 3.0mean 5.0std 2.0min 3.025% 4.050% 5.075% 6.0max 7.0Name: A, dtype: float64
```

There are many other methods for computing basic statistics such as `max(~)`

and `mean(~)`

.

# Selecting a subset of the DataFrame

To select a subset of the DataFrame, use one of the following property/methods:

Consider the following DataFrame:

```
df
A Ba 3 1b 5 2c 7 3
```

## Square bracket notation

To get column `B`

:

```
df["B"] # returns a Series
a 1b 2c 3Name: B, dtype: int64
```

To get columns `A`

and `B`

, pass in an array like so:

```
df[["A","B"]] # returns a DataFrame
A Ba 3 1b 5 2c 7 3
```

## loc property

Pandas `DataFrame.loc`

is used to access or update values of the DataFrame using row and column labels. Note that `loc`

is a property and not a function - we provide the parameters using `[]`

notation.

We show `df`

here again for your reference:

```
df
A Ba 3 1b 5 2c 7 3
```

To access the value at `[bB]`

using row and column labels:

```
2
```

To access row `b`

:

```
A 5B 2Name: b, dtype: int64
```

To access column `B`

:

```
a 1b 2c 3Name: B, dtype: int64
```

Here, the `:`

before the comma indicates that we want to retrieve all rows. The `"B"`

after the comma then indicates that we just want to fetch column B.

`loc`

is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

## iloc property

Pandas' `DataFrame.iloc`

is used to access or update specific rows/columns of the DataFrame using integer indices.

We show `df`

here again for your reference:

```
df
A Ba 3 1b 5 2c 7 3
```

To access the second row:

```
A 5B 2Name: b, dtype: int64
```

To access the second column:

```
a 1b 2c 3Name: B, dtype: int64
```

Here, the `:`

before the comma means that we want to fetch all rows. The `1`

after the comma means that we want to fetch the second column (column at the first integer index).

`iloc`

is an extremely powerful property that allows you to perform granular filtering of data. To learn more, click here.

## query method

Pandas' `DataFrame.query(~)`

method filters rows according to the provided boolean expression.

We show the `df`

here again for your reference:

```
df
A Ba 3 1b 5 2c 7 3
```

To get all rows where the value for column `A`

is `3`

:

```
A Ba 3 1
```

To get all rows where value for column `A`

is greater than `2`

, and value for column `B`

is not equal to `2`

:

```
A Ba 3 1c 7 3
```