search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
chevron_leftCreating DataFrames Cookbook
Combining multiple Series into a DataFrameCombining multiple Series to form a DataFrameConverting a Series to a DataFrameConverting list of lists into DataFrameConverting list to DataFrameConverting percent string into a numeric for read_csvConverting scikit-learn dataset to Pandas DataFrameConverting string data into a DataFrameCreating a DataFrame from a stringCreating a DataFrame using listsCreating a DataFrame with different type for each columnCreating a DataFrame with empty valuesCreating a DataFrame with missing valuesCreating a DataFrame with random numbersCreating a DataFrame with zerosCreating a MultiIndex DataFrameCreating a Pandas DataFrameCreating a single DataFrame from multiple filesCreating empty DataFrame with only column labelsFilling missing values when using read_csvImporting DatasetImporting tables from PostgreSQL as Pandas DataFramesInitialising a DataFrame using a constantInitialising a DataFrame using a dictionaryInitialising a DataFrame using a list of dictionariesInserting lists into a DataFrame cellKeeping leading zeroes when using read_csvParsing dates when using read_csvPreventing strings from getting parsed as NaN for read_csvReading data from GitHubReading file without headerReading large CSV files in chunksReading n random lines using read_csvReading space-delimited filesReading specific columns from fileReading tab-delimited filesReading the first few lines of a file to create DataFrameReading the last n lines of a fileReading URL using read_csvReading zipped csv file as a DataFrameRemoving Unnamed:0 columnResolving ParserError: Error tokenizing dataSaving DataFrame as zipped csvSkipping rows without skipping header for read_csvSpecifying data type for read_csvTreating missing values as empty strings rather than NaN for read_csv
check_circle
Mark as learned
thumb_up
1
thumb_down
0
chat_bubble_outline
1
Comment
auto_stories Bi-column layout
settings

Comprehensive guide on importing tables from PostgreSQL as Pandas DataFrames

schedule Aug 12, 2023
Last updated
local_offer
PythonPandas
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Sample data in PostgreSQL

Suppose we have the following users table in our PostgreSQL database called test_db:

id | name | age
----+-------+------
1 | alex | 20
2 | cathy | 30
3 | bob | NULL
Creating the test_db database and users table
expand_more

First create the test_db database like so:

CREATE DATABASE test_db;

If you are using psql in the command line, use \c to connect with the test_db database:

\c test_db

Once connected, we can create the users table like so:

DROP TABLE IF EXISTS users;
CREATE TABLE users (
id INT GENERATED ALWAYS AS IDENTITY,
name TEXT NOT NULL,
age INT,
PRIMARY KEY(id)
);
INSERT INTO users (name, age) VALUES
('alex', 20),
('cathy', 30),
('bob', NULL);

Connecting to the database using sqlalchemy

To connect to our PostgreSQL server in Python, we must first establish a connection to the server. We can do so by using the sqlalchemy library:

from sqlalchemy import create_engine

The create_engine(~) method creates a connection to the server.

Connecting as root user

Now, to connect to the test_db database on the PostgreSQL server hosted on port 5432 on your local machine (localhost) as the user called postgres:

engine = create_engine('postgresql://postgres@localhost:5432/test_db')

By convention, PostgreSQL servers are hosted on port 5432.

Warning.

Troubleshooting for Mac users

sqlalchemy requires that you have the psycopg2 library installed. Use conda or pip to install this library:

conda install psycopg2
pip install psycopg2
* * *

When running the above code, you may encounter the following issue:

OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (::1),
port 5432 failed: FATAL: role "postgres" does not exist

This can happen if you are using a Mac and used homebrew to install your Postgres server. The fix is to create a user called postgres by invoking the createruser program that should already be installed.

For non-M1 Mac users, run the following in the terminal:

/usr/local/opt/postgres/bin/createuser -s postgres

For M1 Mac users, run the following in the terminal:

/opt/homebrew/bin/createuser -s postgres

Here, the -s stands for superuser; you are creating a new user called postgres with superuser privileges.

Connecting with password

Suppose you have a role with username alex with login password '12345'. You can connect to the database test_db hosted on your local machine like so:

engine = create_engine('postgresql://alex:12345@localhost:5432/test_db')

Connecting to remote database

To connect to a remote database (e.g. PostgreSQL managed by Heroku), you need the following database credentials:

Host: (e.g. test_host)
Port: 5432 (by convention)
User: (e.g. alex)
Password: (e.g. 12345)
Database: (e.g. test_db)

These information should be provided by your PostgreSQL providers like Heroku or Supabase.

Now, to make the connection:

engine = create_engine('postgresql://alex:12345@test_host:5432/test_db')

Using Pandas built-in method to read Postgres tables as DataFrame

Pandas has a series of convenient methods to create DataFrames from Postgres tables:

  • read_sql_table(~): reads an entire table as a DataFrame

  • read_sql_query(~): reads the result of a SQL query (e.g. SELECT name FROM users;) as a DataFrame

  • read_sql(~): you can either pass in a table name (which will call read_sql_table(~)) or a SQL query (which will call read_sql_query(~))

For each of these methods, you must pass the engine object defined above to the argument con.

Using read_sql_table(~)

Let's read the users table by using the read_sql_table(~) method:

import pandas as pd
pd.read_sql_table('users', con=engine)
id name age
0 1 alex 20.0
1 2 cathy 30.0
2 3 bob NaN

To set the id column as the index for the DataFrame, set the index_col argument:

pd.read_sql_table('users', index_col='id', con=engine)
name age
id
1 alex 20.0
2 cathy 30.0
3 bob NaN

Specifying columns to extract

As seen in the above results, all columns are read by default. To read specific columns instead, pass a list of column labels:

pd.read_sql_table('users', columns=['name'], con=engine)
name
0 alex
1 cathy
2 bob

Here, we are only reading the name column.

Reading by chunk

The size of your Postgres tables can be quite big - to the point where the entire table cannot fit into memory. In such cases, you may wish to read the table in chunks using the chunksize argument:

chunks_iterator = pd.read_sql_table('users', chunksize=2, con=engine)
for chunk in chunks_iterator:
print(type(chunk)) # Pandas DataFrame
print(chunk)
<class 'pandas.core.frame.DataFrame'>
id name age
0 1 alex 20
1 2 cathy 30
<class 'pandas.core.frame.DataFrame'>
id name age
0 3 bob None

Note the following:

  • chunksize=2 means that we read 2 records from the table at once

  • the return type of the read_sql_table(~) method when you specify chunksize is an iterator, which means that you can loop through it using for.

Using read_sql_query(~)

The first argument of read_sql_query(~) method is a SQL query, and the method returns the result of the query as a Pandas DataFrame. Hers's an example:

pd.read_sql_query('SELECT * FROM users;', con=engine)
id name age
0 1 alex 20.0
1 2 cathy 30.0
2 3 bob NaN

We can modify our SQL query to retrieve only the data we are interested in. For instance, to obtain the name column where id is larger or equal to 2:

pd.read_sql_query('SELECT name FROM users WHERE id >= 2;', con=engine)
name
0 cathy
1 bob
NOTE

Arguments such as chunksize and index_col that we explored for read_sql_table(~) are also available for read_sql_query(~).

Unlike read_sql_table(~), the read_sql_query(~) method has a dtype argument which is a map indicating the desired data type of the columns of the returned DataFrame:

df = pd.read_sql_query('SELECT name, age FROM users;', con=engine)
name object
age float64
dtype: object

By default, Pandas will respect the type indicated by the table's schema. By specifying dname, we can modify the data types on the fly:

map_dtype = {
'name': 'string',
'age': 'f4' # 4 bytes (32-bits)
}

df = pd.read_sql_query('SELECT name, age FROM users;', con=engine, dtype=map_dtype)
name string
age float32
dtype: object

Here, we indicate that we want the name column to be of type string, while the age column to be of type float32.

Using read_sql(~)

The read_sql(~) method is a hybrid of real_sql_table(~) and real_sql_query(~). You can either pass a table name:

pd.read_sql('users', con=engine)
id name age
0 1 alex 20.0
1 2 cathy 30.0
2 3 bob NaN

Or you can pass a SQL query:

pd.read_sql('SELECT name, age FROM users;', con=engine)
name age
0 alex 20.0
1 cathy 30.0
2 bob NaN
NOTE

read_sql(~) has the same arguments as read_sql_table(~).

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...