id | name  | age  
----+-------+------
  1 | alex  |   20
  2 | cathy |   30
  3 | bob   | NULL

Creating the test_db database and users table

expand_more

First create the test_db database like so:


        
        
            
                
                
                    CREATE DATABASE test_db;

If you are using psql in the command line, use \c to connect with the test_db database:

Once connected, we can create the users table like so:


        
        
            
                
                
                    DROP TABLE IF EXISTS users;
CREATE TABLE users (
    id INT GENERATED ALWAYS AS IDENTITY,
    name TEXT NOT NULL,
    age INT,
    PRIMARY KEY(id)
);
INSERT INTO users (name, age) VALUES
('alex', 20),
('cathy', 30),
('bob', NULL);

Connecting to the database using sqlalchemy

To connect to our PostgreSQL server in Python, we must first establish a connection to the server. We can do so by using the sqlalchemy library:


        
        
            
                
                
                    from sqlalchemy import create_engine

The create_engine(~) method creates a connection to the server.

Connecting as root user

Now, to connect to the test_db database on the PostgreSQL server hosted on port 5432 on your local machine (localhost) as the user called postgres:


        
        
            
                
                
                    engine = create_engine('postgresql://postgres@localhost:5432/test_db')

By convention, PostgreSQL servers are hosted on port 5432.

Warning.

Troubleshooting for Mac users

sqlalchemy requires that you have the psycopg2 library installed. Use conda or pip to install this library:


        
        
            
                
                
                    conda install psycopg2
pip install psycopg2

* * *

When running the above code, you may encounter the following issue:


        
        
            
                
                
                    OperationalError: (psycopg2.OperationalError) connection to server at "localhost" (::1),
port 5432 failed: FATAL:  role "postgres" does not exist

This can happen if you are using a Mac and used homebrew to install your Postgres server. The fix is to create a user called postgres by invoking the createruser program that should already be installed.

For non-M1 Mac users, run the following in the terminal:


        
        
            
                
                
                    /usr/local/opt/postgres/bin/createuser -s postgres

For M1 Mac users, run the following in the terminal:


        
        
            
                
                
                    /opt/homebrew/bin/createuser -s postgres

Here, the -s stands for superuser; you are creating a new user called postgres with superuser privileges.

Connecting with password

Suppose you have a role with username alex with login password '12345'. You can connect to the database test_db hosted on your local machine like so:


        
        
            
                
                
                    engine = create_engine('postgresql://alex:12345@localhost:5432/test_db')

Connecting to remote database

To connect to a remote database (e.g. PostgreSQL managed by Heroku), you need the following database credentials:


        
        
            
                
                
                    Host: (e.g. test_host)
Port: 5432 (by convention)
User: (e.g. alex)
Password: (e.g. 12345)
Database: (e.g. test_db)

These information should be provided by your PostgreSQL providers like Heroku or Supabase.

Now, to make the connection:


        
        
            
                
                
                    engine = create_engine('postgresql://alex:12345@test_host:5432/test_db')

Using Pandas built-in method to read Postgres tables as DataFrame

Pandas has a series of convenient methods to create DataFrames from Postgres tables:

read_sql_table(~): reads an entire table as a DataFrame
read_sql_query(~): reads the result of a SQL query (e.g. SELECT name FROM users;) as a DataFrame
read_sql(~): you can either pass in a table name (which will call read_sql_table(~)) or a SQL query (which will call read_sql_query(~))

For each of these methods, you must pass the engine object defined above to the argument con.

Using read_sql_table(~)

Let's read the users table by using the read_sql_table(~) method:


        
        
            
                
                
                    import pandas as pd
pd.read_sql_table('users', con=engine)
                
            
               id   name   age
0   1   alex   20.0
1   2   cathy  30.0
2   3   bob    NaN

To set the id column as the index for the DataFrame, set the index_col argument:


        
        
            
                
                
                    pd.read_sql_table('users', index_col='id', con=engine)
                
            
                name   age
id    
1   alex   20.0
2   cathy  30.0
3   bob    NaN

Specifying columns to extract

As seen in the above results, all columns are read by default. To read specific columns instead, pass a list of column labels:


        
        
            
                
                
                    pd.read_sql_table('users', columns=['name'], con=engine)
                
            
               name
0  alex
1  cathy
2  bob

Here, we are only reading the name column.

Reading by chunk

The size of your Postgres tables can be quite big - to the point where the entire table cannot fit into memory. In such cases, you may wish to read the table in chunks using the chunksize argument:


        
        
            
                
                
                    chunks_iterator = pd.read_sql_table('users', chunksize=2, con=engine)
for chunk in chunks_iterator:
    print(type(chunk))   # Pandas DataFrame
    print(chunk)
                
            
            <class 'pandas.core.frame.DataFrame'>
   id   name  age
0   1   alex   20
1   2  cathy   30
<class 'pandas.core.frame.DataFrame'>
   id name   age
0   3  bob  None

Note the following:

chunksize=2 means that we read 2 records from the table at once
the return type of the read_sql_table(~) method when you specify chunksize is an iterator, which means that you can loop through it using for.

Using read_sql_query(~)

The first argument of read_sql_query(~) method is a SQL query, and the method returns the result of the query as a Pandas DataFrame. Hers's an example:


        
        
            
                
                
                    pd.read_sql_query('SELECT * FROM users;', con=engine)
                
            
               id   name   age
0   1   alex   20.0
1   2   cathy  30.0
2   3   bob    NaN

We can modify our SQL query to retrieve only the data we are interested in. For instance, to obtain the name column where id is larger or equal to 2:


        
        
            
                
                
                    pd.read_sql_query('SELECT name FROM users WHERE id >= 2;', con=engine)
                
            
               name
0  cathy
1  bob

NOTE

Arguments such as chunksize and index_col that we explored for read_sql_table(~) are also available for read_sql_query(~).

Unlike read_sql_table(~), the read_sql_query(~) method has a dtype argument which is a map indicating the desired data type of the columns of the returned DataFrame:


        
        
            
                
                
                    df = pd.read_sql_query('SELECT name, age FROM users;', con=engine)
df.dtypes
                
            
            name     object
age     float64
dtype: object

By default, Pandas will respect the type indicated by the table's schema. By specifying dname, we can modify the data types on the fly:


        
        
            
                
                
                    map_dtype = {
   'name': 'string',
   'age': 'f4'          # 4 bytes (32-bits)
}

df = pd.read_sql_query('SELECT name, age FROM users;', con=engine, dtype=map_dtype)
df.dtypes
                
            
            name     string
age     float32
dtype: object

Here, we indicate that we want the name column to be of type string, while the age column to be of type float32.

Using read_sql(~)

The read_sql(~) method is a hybrid of real_sql_table(~) and real_sql_query(~). You can either pass a table name:


        
        
            
                
                
                    pd.read_sql('users', con=engine)
                
            
               id   name   age
0   1   alex   20.0
1   2   cathy  30.0
2   3   bob    NaN

Or you can pass a SQL query:


        
        
            
                
                
                    pd.read_sql('SELECT name, age FROM users;', con=engine)
                
            
               name   age
0  alex   20.0
1  cathy  30.0
2  bob    NaN

NOTE

read_sql(~) has the same arguments as read_sql_table(~).

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!