If the data type is not provided via schema, then samplingRatio indicates the proportion of rows to sample from when making inferences about the column type. By default, only the first row will be used for type inference.

4. verifySchema | boolean | optional

Whether or not to check the data against the given schema. If data type does not align, then an error will be thrown. By default, verifySchema=True.

Return Value

A PySpark DataFrame.

Examples

Creating a PySpark DataFrame from a list of lists

To create a PySpark DataFrame from a list of lists:


        
        
            
                
                
                    rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows)
df.show()
                
            
            +----+---+
|  _1| _2|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

To create a PySpark DataFrame from a list of lists with the column names specified:


        
        
            
                
                
                    rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame with column names and type

To create a PySpark DataFrame with column name and type:


        
        
            
                
                
                    rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, "name:string, age:int")
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of values

To create a PySpark DataFrame from a list of values:


        
        
            
                
                
                    from pyspark.sql.types import *
vals = [3,4,5]
spark.createDataFrame(vals, IntegerType()).show()
                
            
            +-----+
|value|
+-----+
|    3|
|    4|
|    5|
+-----+

Here, the IntegerType() indicates that the column is of type integer - this is needed in this case, otherwise PySpark will throw an error.

Creating a PySpark DataFrame from a list of tuples

To create a PySpark DataFrame from a list of tuples:


        
        
            
                
                
                    rows = (("Alex", 25), ("Bob", 30))
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Creating a PySpark DataFrame from a list of objects

To create a PySpark DataFrame from a list of objects:


        
        
            
                
                
                    data = [{"name":"Alex", "age":20},{"name":"Bob", "age":30}]
df = spark.createDataFrame(data)
df.show()
                
            
            +---+----+
|age|name|
+---+----+
| 20|Alex|
| 30| Bob|
+---+----+

Creating a PySpark DataFrame from a RDD

To create a PySpark DataFrame from a RDD:


        
        
            
                
                
                    rdd = sc.parallelize([["Alex", 25], ["Bob", 30]])
df = spark.createDataFrame(rdd, ["name", "age"])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, we are using the parallelize(~) method to create a RDD.

Creating a PySpark DataFrame from a Pandas DataFrame

Consider the following Pandas DataFrame:


        
        
            
                
                
                    import pandas as pd
df = pd.DataFrame({"A":[3,4],"B":[5,6]})
df
                
            
               A  B
0  3  5
1  4  6

To create a PySpark DataFrame from this Pandas DataFrame:


        
        
            
                
                
                    pyspark_df = spark.createDataFrame(df)
pyspark_df.show()
                
            
            +---+---+
|  A|  B|
+---+---+
|  3|  5|
|  4|  6|
+---+---+

Creating a PySpark DataFrame with a schema (StructType)

To create PySpark DataFrame while specifying the column names and types:


        
        
            
                
                
                    from pyspark.sql.types import *
schema = StructType([
   StructField("name", StringType()),
   StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema)
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Here, name is of type string and age is of type integer.

Creating a PySpark DataFrame with date columns

To create a PySpark DataFrame with date columns, use the datetime library:


        
        
            
                
                
                    import datetime
df = spark.createDataFrame([["Alex", datetime.date(1995,12,16)], ["Bob", datetime.date(1995,5,9)]], ["name", "birthday"])
df.show()
                
            
            +----+----------+
|name|  birthday|
+----+----------+
|Alex|1995-12-16|
| Bob|1995-05-09|
+----+----------+

Specifying verifySchema

By default, verifySchema=True, which means that an error is thrown if there is a mismatch in the type indicated by the schema and the type inferred from data:


        
        
            
                
                
                    from pyspark.sql.types import *
schema = StructType([
   StructField("name", IntegerType()),
   StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema)   # verifySchema=True
df.show()
                
            
            org.apache.spark.api.python.PythonException:
'TypeError: field name: IntegerType can not accept object 'Alex' in type <class 'str'>'

Here, an error is thrown because the inferred type of column name is string, but we have specified the column type to be integer in our schema.

By setting verifySchema=False, PySpark will fill the column with nulls instead of throwing an error:


        
        
            
                
                
                    from pyspark.sql.types import *
schema = StructType([
   StructField("name", IntegerType()),
   StructField("age", IntegerType())])

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, schema, verifySchema=False)
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|null| 25|
|null| 30|
+----+---+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!