df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Getting rows that start with a certain substring in PySpark DataFrame

To get rows that start with a certain substring:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.filter(F.col("name").startswith("A")).show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Here, F.col("name").startswith("A") returns a Column object of booleans where True corresponds to values that begin with A:


        
        
            
                
                
                    df.select(F.col("name").startswith("A")).show()
                
            
            +-------------------+
|startswith(name, A)|
+-------------------+
|               true|
|              false|
|              false|
+-------------------+

We then use the PySpark DataFrame's filter(~) method to fetch rows that correspond to True.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.startswith.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!