Reading n random lines using read_csv in Pandas
Start your free 7-days trial now!
When file contains a header row
Consider the following my_data.txt
file:
A,B,C1,2,34,5,67,8,9
To read n
random lines using read_csv(~)
in Pandas.
import random
def get_num_lines(fname): with open(fname) as f: for i, _ in enumerate(f): pass return i + 1
num_lines = get_num_lines("my_data.txt") - 1
# How many randomn rows do you want?sample_size = 2rows_to_skip = random.sample(range(1,num_lines), num_lines-sample_size)
df = pd.read_csv("my_data.txt", skiprows=rows_to_skip)df
A B C0 1 2 31 7 8 9
Note the following:
we first start by fetching the total number of lines in the file. Since we have a header row in our file, we subtract the number by
1
. In this case,num_lines=3
.we then use
random.sample(~)
method to randomly get the row numbers to skip.the first argument is the values to randomly select from. In this case, since
num_lines=3
, random integers between1
(inclusive) and3
(inclusive) is chosen. We usedrange(1,_)
because the first line of the file is for column labels, and so we don't want to skip this row. In this case, it turned out thatrows_to_skip=[2]
, which means that the second row is skipped.the second argument is the number of random integers you want.
When file does not contain a header row
Consider the following my_data.txt
file:
1,2,34,5,67,8,9
To read n
random lines using read_csv(~)
:
import random
def get_num_lines(fname): with open(fname) as f: for i, _ in enumerate(f): pass return i + 1
num_lines = get_num_lines("my_data.txt")
# How many random rows do you want?sample_size = 2rows_to_skip = random.sample(range(num_lines), num_lines-sample_size)
df = pd.read_csv("my_data.txt", skiprows=rows_to_skip, header=None)df
0 1 20 4 5 61 7 8 9
Note the following:
we first start by fetching the total number of lines in the file. In this case,
num_lines=3
.we then use
random.sample(~)
method to randomly get the row numbers to skip. In this case, it turns out thatrows_to_skip=[0]
.