Using the same value for seed produces the exact same results every time. By default, no seed will be set, which means that the outcome will be different every time you run the method.

Return Value

A PySpark DataFrame.

Examples

Consider the following PySpark DataFrame:


        
        
            
                
                
                    from pyspark.sql.types import *
vals = ['a','a','a','a','a','a','b','b','b','b']
df = spark.createDataFrame(vals, StringType())
df.show(3)
                
            
            +-----+
|value|
+-----+
|    a|
|    a|
|    a|
+-----+
only showing top 3 rows

Performing stratified sampling

Let's performing stratified sampling based on the column value:


        
        
            
                
                
                    df.sampleBy('value', fractions={'a':0.5,'b':0.25}).show()
                
            
            +-----+
|value|
+-----+
|    a|
|    a|
|    a|
|    b|
|    b|
+-----+

Here, rows with value 'a' will be included in our sample with a probability of 0.5, while rows with value 'b' will be included with a probability of 0.25.

WARNING

The number of samples that will be included will be different each time. For instance, specifying {'a':0.5} does not mean that half the rows with the value 'a' will be included - instead it means that each row will be included with a probability of 0.5. This means that there may be cases when all rows with value 'a' will end up in the final sample.

PySpark DataFrame | sample method

PySpark DataFrame's sample(~) method returns a random subset of rows of the DataFrame.

chevron_right