Comprehensive guide on importing tables from PostgreSQL as Pandas DataFrames
Start your free 7-days trial now!
Sample data in PostgreSQL
Suppose we have the following users
table in our PostgreSQL database called test_db
:
id | name | age ----+-------+------ 1 | alex | 20 2 | cathy | 30 3 | bob | NULL
test_db
database and users
tableFirst create the test_db
database like so:
CREATE DATABASE test_db;
If you are using psql
in the command line, use \c
to connect with the test_db
database:
\c test_db
Once connected, we can create the users
table like so:
DROP TABLE IF EXISTS users;CREATE TABLE users ( id INT GENERATED ALWAYS AS IDENTITY, name TEXT NOT NULL, age INT, PRIMARY KEY(id));INSERT INTO users (name, age) VALUES('alex', 20),('cathy', 30),('bob', NULL);
Connecting to the database using sqlalchemy
To connect to our PostgreSQL server in Python, we must first establish a connection to the server. We can do so by using the sqlalchemy
library:
from sqlalchemy import create_engine
The create_engine(~)
method creates a connection to the server.
Connecting as root user
Now, to connect to the test_db
database on the PostgreSQL server hosted on port 5432
on your local machine (localhost
) as the user called postgres
:
engine = create_engine('postgresql://postgres@localhost:5432/test_db')
By convention, PostgreSQL servers are hosted on port 5432
.
Troubleshooting for Mac users
sqlalchemy
requires that you have the psycopg2
library installed. Use conda
or pip
to install this library:
conda install psycopg2pip install psycopg2
When running the above code, you may encounter the following issue:
OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (::1),port 5432 failed: FATAL: role "postgres" does not exist
This can happen if you are using a Mac and used homebrew
to install your Postgres server. The fix is to create a user called postgres
by invoking the createruser
program that should already be installed.
For non-M1 Mac users, run the following in the terminal:
/usr/local/opt/postgres/bin/createuser -s postgres
For M1 Mac users, run the following in the terminal:
/opt/homebrew/bin/createuser -s postgres
Here, the -s
stands for superuser; you are creating a new user called postgres
with superuser privileges.
Connecting with password
Suppose you have a role with username alex
with login password '12345'
. You can connect to the database test_db
hosted on your local machine like so:
engine = create_engine('postgresql://alex:12345@localhost:5432/test_db')
Connecting to remote database
To connect to a remote database (e.g. PostgreSQL managed by Heroku), you need the following database credentials:
Host: (e.g. test_host)Port: 5432 (by convention)User: (e.g. alex)Password: (e.g. 12345)Database: (e.g. test_db)
These information should be provided by your PostgreSQL providers like Heroku or Supabase.
Now, to make the connection:
engine = create_engine('postgresql://alex:12345@test_host:5432/test_db')
Using Pandas built-in method to read Postgres tables as DataFrame
Pandas has a series of convenient methods to create DataFrames from Postgres tables:
read_sql_table(~)
: reads an entire table as a DataFrameread_sql_query(~)
: reads the result of a SQL query (e.g.SELECT name FROM users;
) as a DataFrameread_sql(~)
: you can either pass in a table name (which will callread_sql_table(~)
) or a SQL query (which will callread_sql_query(~)
)
For each of these methods, you must pass the engine
object defined above to the argument con
.
Using read_sql_table(~)
Let's read the users
table by using the read_sql_table(~)
method:
import pandas as pdpd.read_sql_table('users', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
To set the id
column as the index for the DataFrame, set the index_col
argument:
pd.read_sql_table('users', index_col='id', con=engine)
name ageid 1 alex 20.02 cathy 30.03 bob NaN
Specifying columns to extract
As seen in the above results, all columns are read by default. To read specific columns instead, pass a list of column labels:
pd.read_sql_table('users', columns=['name'], con=engine)
name0 alex1 cathy2 bob
Here, we are only reading the name
column.
Reading by chunk
The size of your Postgres tables can be quite big - to the point where the entire table cannot fit into memory. In such cases, you may wish to read the table in chunks using the chunksize
argument:
Note the following:
chunksize=2
means that we read 2 records from the table at oncethe return type of the
read_sql_table(~)
method when you specifychunksize
is an iterator, which means that you can loop through it usingfor
.
Using read_sql_query(~)
The first argument of read_sql_query(~)
method is a SQL query, and the method returns the result of the query as a Pandas DataFrame. Hers's an example:
pd.read_sql_query('SELECT * FROM users;', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
We can modify our SQL query to retrieve only the data we are interested in. For instance, to obtain the name
column where id
is larger or equal to 2
:
pd.read_sql_query('SELECT name FROM users WHERE id >= 2;', con=engine)
name0 cathy1 bob
Arguments such as chunksize
and index_col
that we explored for read_sql_table(~)
are also available for read_sql_query(~)
.
Unlike read_sql_table(~)
, the read_sql_query(~)
method has a dtype
argument which is a map indicating the desired data type of the columns of the returned DataFrame:
df = pd.read_sql_query('SELECT name, age FROM users;', con=engine)df.dtypes
name objectage float64dtype: object
By default, Pandas will respect the type indicated by the table's schema. By specifying dname
, we can modify the data types on the fly:
map_dtype = { 'name': 'string', 'age': 'f4' # 4 bytes (32-bits)}
df = pd.read_sql_query('SELECT name, age FROM users;', con=engine, dtype=map_dtype)df.dtypes
name stringage float32dtype: object
Here, we indicate that we want the name
column to be of type string
, while the age
column to be of type float32
.
Using read_sql(~)
The read_sql(~)
method is a hybrid of real_sql_table(~)
and real_sql_query(~)
. You can either pass a table name:
pd.read_sql('users', con=engine)
id name age0 1 alex 20.01 2 cathy 30.02 3 bob NaN
Or you can pass a SQL query:
pd.read_sql('SELECT name, age FROM users;', con=engine)
name age0 alex 20.01 cathy 30.02 bob NaN
read_sql(~)
has the same arguments as read_sql_table(~)
.