PySpark SparkSession | createDataFrame method
Start your free 7-days trial now!
PySpark's createDataFrame(~)
method creates a new DataFrame from the given list, Pandas DataFrame or RDD.
Parameters
1. data
| list-like
or Pandas DataFrame
or RDD
The data used to create the new DataFrame.
2. schema
| pyspark.sql.types.DataType
, string
or list
| optional
The column names and the data type of each column.
3. samplingRatio
| float
| optional
If the data type is not provided via schema
, then samplingRatio
indicates the proportion of rows to sample from when making inferences about the column type. By default, only the first row will be used for type inference.
4. verifySchema
| boolean
| optional
Whether or not to check the data against the given schema. If data type does not align, then an error will be thrown. By default, verifySchema=True
.
Return Value
A PySpark DataFrame.
Examples
Creating a PySpark DataFrame from a list of lists
To create a PySpark DataFrame from a list of lists:
+----+---+| _1| _2|+----+---+|Alex| 25|| Bob| 30|+----+---+
To create a PySpark DataFrame from a list of lists with the column names specified:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Creating a PySpark DataFrame with column names and type
To create a PySpark DataFrame with column name and type:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Creating a PySpark DataFrame from a list of values
To create a PySpark DataFrame from a list of values:
from pyspark.sql.types import *vals = [3,4,5]spark.createDataFrame(vals, IntegerType()).show()
+-----+|value|+-----+| 3|| 4|| 5|+-----+
Here, the IntegerType()
indicates that the column is of type integer - this is needed in this case, otherwise PySpark will throw an error.
Creating a PySpark DataFrame from a list of tuples
To create a PySpark DataFrame from a list of tuples:
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Creating a PySpark DataFrame from a list of objects
To create a PySpark DataFrame from a list of objects:
data = [{"name":"Alex", "age":20},{"name":"Bob", "age":30}]df = spark.createDataFrame(data)
+---+----+|age|name|+---+----+| 20|Alex|| 30| Bob|+---+----+
Creating a PySpark DataFrame from a RDD
To create a PySpark DataFrame from a RDD:
df = spark.createDataFrame(rdd, ["name", "age"])
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Here, we are using the parallelize(~)
method to create a RDD.
Creating a PySpark DataFrame from a Pandas DataFrame
Consider the following Pandas DataFrame:
A B0 3 51 4 6
To create a PySpark DataFrame from this Pandas DataFrame:
pyspark_df = spark.createDataFrame(df)
+---+---+| A| B|+---+---+| 3| 5|| 4| 6|+---+---+
Creating a PySpark DataFrame with a schema (StructType)
To create PySpark DataFrame while specifying the column names and types:
from pyspark.sql.types import *schema = StructType([ StructField("name", StringType()), StructField("age", IntegerType())])
rows = [["Alex", 25], ["Bob", 30]]df = spark.createDataFrame(rows, schema)
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Here, name
is of type string and age
is of type integer.
Creating a PySpark DataFrame with date columns
To create a PySpark DataFrame with date
columns, use the datetime
library:
import datetimedf = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name", "birthday"])
+----+----------+|name| birthday|+----+----------+|Alex|1995-12-16|| Bob|1995-05-09|+----+----------+
Specifying verifySchema
By default, verifySchema=True
, which means that an error is thrown if there is a mismatch in the type indicated by the schema
and the type inferred from data
:
from pyspark.sql.types import *schema = StructType([ StructField("name", IntegerType()), StructField("age", IntegerType())])
rows = [["Alex", 25], ["Bob", 30]]df = spark.createDataFrame(rows, schema) # verifySchema=True
org.apache.spark.api.python.PythonException:'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'
Here, an error is thrown because the inferred type of column name
is string
, but we have specified the column type to be integer
in our schema
.
By setting verifySchema=False
, PySpark will fill the column with nulls instead of throwing an error:
from pyspark.sql.types import *schema = StructType([ StructField("name", IntegerType()), StructField("age", IntegerType())])
rows = [["Alex", 25], ["Bob", 30]]df = spark.createDataFrame(rows, schema, verifySchema=False)
+----+---+|name|age|+----+---+|null| 25||null| 30|+----+---+