PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark RDD
check_circle
Mark as learned thumb_up
0
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark RDD | filter method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark RDD's filter(~)
method extracts a subset of the data based on the given function.
Parameters
1. f
| function
A function that takes in as input an item of the RDD's data and returns a boolean where:
True
indicates keepingFalse
indicates ignoring.
Return Value
A PySpark RDD (pyspark.rdd.PipelinedRDD
).
Examples
Consider the following RDD:
rdd
ParallelCollectionRDD[7] at readRDDFromInputStream at PythonRDD.scala:413
Filtering elements of a RDD
To obtain a new RDD where the values are all strictly larger than 3:
new_rdd = rdd.filter(lambda x: x > 3)
[4, 5, 7]
Here, the collect()
method is used to retrieve the content of the RDD as a single list.
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.RDD.filter.html
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!