Pandas

655 guides

keyboard_arrow_down

Other math topics

Dagster

Pandas

NumPy

Matplotlib

PySpark

MySQL

chevron_leftCreating DataFrames Cookbook

Combining multiple Series into a DataFrame Combining multiple Series to form a DataFrame Converting a Series to a DataFrame Converting list of lists into DataFrame Converting list to DataFrame Converting percent string into a numeric for read_csv Converting scikit-learn dataset to Pandas DataFrame Converting string data into a DataFrame Creating a DataFrame from a string Creating a DataFrame using lists Creating a DataFrame with different type for each column Creating a DataFrame with empty values Creating a DataFrame with missing values Creating a DataFrame with random numbers Creating a DataFrame with zeros Creating a MultiIndex DataFrame Creating a Pandas DataFrame Creating a single DataFrame from multiple files Creating empty DataFrame with only column labels Filling missing values when using read_csv Importing Dataset Importing tables from PostgreSQL as Pandas DataFrames Initialising a DataFrame using a constant Initialising a DataFrame using a dictionary Initialising a DataFrame using a list of dictionaries Inserting lists into a DataFrame cell Keeping leading zeroes when using read_csv Parsing dates when using read_csv Preventing strings from getting parsed as NaN for read_csv Reading data from GitHub Reading file without header Reading large CSV files in chunks Reading n random lines using read_csv Reading space-delimited files Reading specific columns from file Reading tab-delimited files Reading the first few lines of a file to create DataFrame Reading the last n lines of a file Reading URL using read_csv Reading zipped csv file as a DataFrame Removing Unnamed:0 column Resolving ParserError: Error tokenizing data Saving DataFrame as zipped csv Skipping rows without skipping header for read_csv Specifying data type for read_csv Treating missing values as empty strings rather than NaN for read_csv

check_circle

Mark as learned

thumb_up

thumb_down

chat_bubble_outline

Comment

auto_stories Bi-column layout

settings

Reading n random lines using read_csv in Pandas

schedule Aug 12, 2023

Last updated

local_offer

Python●Pandas

When file contains a header row

Consider the following my_data.txt file:

To read n random lines using read_csv(~) in Pandas.


        
        
            
                
                
                    import random

def get_num_lines(fname):
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

num_lines = get_num_lines("my_data.txt") - 1

# How many randomn rows do you want?
sample_size = 2
rows_to_skip = random.sample(range(1,num_lines), num_lines-sample_size)

df = pd.read_csv("my_data.txt", skiprows=rows_to_skip)
df
                
            
               A  B  C
0  1  2  3
1  7  8  9

Note the following:

we first start by fetching the total number of lines in the file. Since we have a header row in our file, we subtract the number by 1. In this case, num_lines=3.
we then use random.sample(~) method to randomly get the row numbers to skip.
- the first argument is the values to randomly select from. In this case, since num_lines=3, random integers between 1 (inclusive) and 3 (inclusive) is chosen. We used range(1,_) because the first line of the file is for column labels, and so we don't want to skip this row. In this case, it turned out that rows_to_skip=[2], which means that the second row is skipped.
- the second argument is the number of random integers you want.

When file does not contain a header row

Consider the following my_data.txt file:

To read n random lines using read_csv(~):


        
        
            
                
                
                    import random

def get_num_lines(fname):
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

num_lines = get_num_lines("my_data.txt")

# How many random rows do you want?
sample_size = 2
rows_to_skip = random.sample(range(num_lines), num_lines-sample_size)

df = pd.read_csv("my_data.txt", skiprows=rows_to_skip, header=None)
df
                
            
               0  1  2
0  4  5  6
1  7  8  9

Note the following:

we first start by fetching the total number of lines in the file. In this case, num_lines=3.
we then use random.sample(~) method to randomly get the row numbers to skip. In this case, it turns out that rows_to_skip=[0].

Pandas | read_csv method

Reads a file, and parses its content into a DataFrame.

chevron_right