PySpark DataFrame | sampleBy method
Start your free 7-days trial now!
PySpark DataFrame's sampleBy(~)
method performs stratified sampling based on a column. Consult examples below for clarification.
Parameters
1. col
| Column
or string
The column by which to perform sampling.
2. fractions
| dict
The probability with which to include the value. Consult examples below for clarification.
3. seed
| int
| optional
Using the same value for seed
produces the exact same results every time. By default, no seed will be set, which means that the outcome will be different every time you run the method.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
from pyspark.sql.types import *vals = ['a','a','a','a','a','a','b','b','b','b']
+-----+|value|+-----+| a|| a|| a|+-----+only showing top 3 rows
Performing stratified sampling
Let's performing stratified sampling based on the column value
:
+-----+|value|+-----+| a|| a|| a|| b|| b|+-----+
Here, rows with value 'a'
will be included in our sample with a probability of 0.5
, while rows with value 'b'
will be included with a probability of 0.25
.
The number of samples that will be included will be different each time. For instance, specifying {'a':0.5}
does not mean that half the rows with the value 'a'
will be included - instead it means that each row will be included with a probability of 0.5
. This means that there may be cases when all rows with value 'a'
will end up in the final sample.