Pandas DataFrame | sample method
Start your free 7-days trial now!
Pandas DataFrame.sample(~)
method returns the specified number of rows or columns randomly. Note that a new copy is returned, that is, modifying the returned DataFrame will not mutate the source DataFrame.
Parameters
1. n
| int
| optional
The size of the random sample. By default, n=1
.
2. frac
link | float
| optional
The relative size of the random sample. For instance, frac=0.6
means that the size of the random sample would be 60% of the total number of values.
Only specify either n
or frac
- not both.
3. replace
link | boolean
| optional
Whether or not to allow sampling from the same row. By default, replace=False
.
4. weights
link | string
or array-like
| optional
The weights assigned to the items. An item with high weight is more likely to be selected. If the weights do not sum up to 1, then they are normalised so that the sum becomes 1. By default, weights=None
, which means that equal weights are assigned.
5. random_state
link | int
or numpy.random.RandomState
| optional
The seed used to generate the random samples. This is used for reproducibility - if you'd like to get consistent results, then specify this parameter.
6. axis
link | int
or string
| optional
Whether to return rows or columns:
Axis | Description |
---|---|
| Rows will be returned. |
| Columns will be returned. |
By default, axis=0
.
Return Value
A new DataFrame containing rows or columns selected at random.
Examples
Consider the following DataFrame:
df
A B C0 a e i1 b f j2 c g k3 d h l
Basic usage
To get 2 rows in random:
df.sample(n=2)
A B C3 d h l0 a e i
Specifying frac parameter
To make the sample size half of the total number of rows, set frac=0.5
:
df.sample(frac=0.5)
A B C2 c g k0 a e i
Here, 50% of the total number of rows is 2, so that is why we ended up with 2 rows.
Specifying replace parameter
To allow the same rows to be selected, set replace=True
. This would mean that the following outcome may now be possible:
df.sample(n=2, replace=True)
A B C0 a e i0 a e i
Specifying weights parameter
By default, all rows have an equal probability of getting selected. We can make certain rows more likely to be selected by setting the weights
parameter, like so:
df.sample(n=1, weights=[0.7 ,0.1, 0.1, 0.1])
A B C0 a e i
Here, row 0
will get selected 70% of the time, and other rows will each get selected 10% of the time. Note that the sum of the weights need not be 1; the method will automatically normalise the weights so that the sum becomes 1.
Specifying random_state parameter
When you need to reproduce your results, set the random_state
parameter, like so:
df.sample(n=2, random_state=42)
A B C1 b f j3 d h l
Now, no matter how many times you run this method, the result will always be the same. You can give the number 42 to your friends, and they would also get the same result on their machines!
Specifying axis parameter
By default, rows will be returned in random:
df.sample(n=1) # axis=0
A B C3 d h l
To get columns instead, set axis=1
like so:
df.sample(n=1, axis=1)
B0 e1 f2 g3 h