PySpark
keyboard_arrow_down 147 guides
chevron_leftPySpark DataFrame
Method aliasMethod coalesceMethod collectMethod colRegexMethod corrMethod countMethod covMethod describeMethod distinctMethod dropMethod dropDuplicatesMethod dropnaMethod exceptAllMethod fillnaMethod filterMethod foreachMethod groupByMethod headMethod intersectMethod intersectAllMethod joinMethod limitMethod orderByMethod printSchemaMethod randomSplitMethod repartitionMethod replaceMethod sampleMethod sampleByMethod selectMethod selectExprMethod showMethod sortMethod summaryMethod tailMethod takeMethod toDFMethod toJSONMethod toPandasMethod transformMethod unionMethod unionByNameMethod whereMethod withColumnMethod withColumnRenamedProperty columnsProperty dtypesProperty rdd
check_circle
Mark as learned thumb_up
1
thumb_down
0
chat_bubble_outline
0
Comment auto_stories Bi-column layout
settings
PySpark DataFrame | summary method
schedule Aug 12, 2023
Last updated local_offer
Tags PySpark
tocTable of Contents
expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
Start your free 7-days trial now!
PySpark DataFrame's summary(~)
method returns a PySpark DataFrame containing basic summary statistics of numeric columns.
Parameters
1. statistics
| string
| optional
The statistic to compute. The following are available:
count
mean
stddev
min
max
arbitrary percentiles (e.g.
"60%"
)
By default, all the above as well as the 25%, 50%, and 75% percentiles are computed.
Return Value
PySpark DataFrame (pyspark.sql.dataframe.DataFrame
).
Examples
Consider the following PySpark DataFrame:
df = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22], ["Doge", 30]], ["name", "age"])
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|| Doge| 30|+-----+---+
Getting the summary statistics of numeric columns of PySpark DataFrame
The summary statistics of our DataFrame is as follows:
+-------+----+-----------------+|summary|name| age|+-------+----+-----------------+| count| 4| 4|| mean|null| 24.0|| stddev|null|4.320493798938574|| min|Alex| 20|| 25%|null| 20|| 50%|null| 22|| 75%|null| 24|| max|Doge| 30|+-------+----+-----------------+
To compute certain summary statistics only:
+-------+----+---+|summary|name|age|+-------+----+---+| max|Doge| 30|| min|Alex| 20|+-------+----+---+
Getting n-th percentile of numeric columns in PySpark DataFrame
To compute the 60th percentile:
+-------+----+---+|summary|name|age|+-------+----+---+| 60%|null| 24|+-------+----+---+
Getting summary statistics of certain columns in PySpark DataFrame
To summarise certain columns instead, use the select(~)
method first to select the columns that you want to summarize:
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.summary.html
thumb_up
1
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!