search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

File system in Databricks

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

What is Databricks filesystem?

Every Azure Databricks workspace has a mounted filesystem called DBFS (Databricks file system). Under the hood, DBFS is actually a scalable object storage but what's great about it is that we can use Unix-like commands (e.g. ls) to interact with it.

Besides the DBFS, we also have access to the file system of the driver node. This guide will go through both types of file systems and how they are related.

File system in the driver node

Using magic command

To see the files in the driver node, use the %sh magic command:

%sh ls
azure
conf
eventlogs
hadoop_accessed_config.lst
logs
preload_class.lst

Note that the output is misleading because some directories such as /dbfs and /Workspace are not listed here. We will go into more details about these hidden directories later.

Writing to and reading from driver node's filesystem

The file system accessed by the os library will also point to the file system of the driver node:

import os
os.listdir()
['azure',
'preload_class.lst',
'hadoop_accessed_config.lst',
'conf',
'eventlogs',
'logs']

Except when using PySpark, the files that we write ends up in the file system of the driver's node. For instance, let's write a dictionary as a JSON file like so:

import json
with open("./my_file.json", "w") as file:
json.dump({"A":1}, file)

The written JSON file can be found in the file system of the driver's node:

%sh ls
azure
conf
...
my_file.json
...

Again, except when using PySpark functions, files are usually read from the driver node:

with open("./my_file.json", "r") as file:
my_json = json.load(file)
print(my_json)
{'A': 1}

File system of DBFS

Using magic command

To access the files in DBFS, use the %fs magic command:

%fs ls
path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 0
2 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
3 dbfs:/databricks-results/ databricks-results/ 0 0
4 dbfs:/tmp/ tmp/ 0 0
5 dbfs:/user/ user/ 0 0

Note that the size column is misleading because it looks as though the folders are empty. However, size is always equal to zero for folders (but not for files as we will see later).

Default folders in DBFS root

Let's go through the nature of the default 5 folders:

Folder

Description

FileStore

Stores files uploaded to Databricks.

databricks-datasets

Open-source datasets for exploration.

databricks-results

Stores results downloaded via the display(-) method.

tmp

Folder for temporary folders - this is managed by you and not Databricks, that is, Databricks will not delete the content of this folder.

user

Stores uploaded datasets that are registered as Databricks tables.

Let's now take a look at what's inside the databricks-datasets/ folder:

%fs ls databricks-datasets/
path name size modificationTime
1 dbfs:/databricks-datasets/COVID/ COVID/ 0 0
2 dbfs:/databricks-datasets/README.md README.md 976 1532468253000
3 dbfs:/databricks-datasets/Rdatasets/ Rdatasets/ 0 0
4 dbfs:/databricks-datasets/SPARK_README.md path 3359 1455043490000
5 dbfs:/databricks-datasets/adult/ adult/ 0 0
6 dbfs:/databricks-datasets/airlines/ airlines/ 0 0

We see that the databricks-datasets directory contains some sample datasets that we can use for our own exploration!

Using dbutils command

We could also use the dbutils method like so:

dbutils.fs.ls("/databricks-datasets")
[FileInfo(path='dbfs:/databricks-datasets/COVID/', name='COVID/', size=0, modificationTime=0),
FileInfo(path='dbfs:/databricks-datasets/README.md', name='README.md', size=976, modificationTime=1532468253000),
FileInfo(path='dbfs:/databricks-datasets/Rdatasets/', name='Rdatasets/', size=0, modificationTime=0),
FileInfo(path='dbfs:/databricks-datasets/SPARK_README.md', name='SPARK_README.md', size=3359, modificationTime=1455043490000),
FileInfo(path='dbfs:/databricks-datasets/adult/', name='adult/', size=0, modificationTime=0), ...

Writing PySpark DataFrame into DBFS

By default, PySpark DataFrames are written to and read from DBFS. For instance, suppose we write a PySpark DataFrame as a CSV file like so:

df = spark.createDataFrame([["Alex", 20], ["Bob", 30]], ["name", "age"])
df.write.csv("my_data.csv", header=True)

Our CSV file can be found in the DBFS root folder:

%fs ls
path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 1688792239000
2 dbfs:/Volume/ Volume/ 0 0
3 dbfs:/Volumes/ Volumes/ 0 0
4 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
5 dbfs:/databricks-results/ databricks-results/ 0 0
6 dbfs:/my_data.csv/ my_data.csv/ 0 1689247799000

Here, the my_data.csv has a size of 0 because it is a directory. The actual CSV file containing our data is inside this directory.

Reading PySpark DataFrame from DBFS

To read our my_data.csv from DBFS as a PySpark DataFrame:

df = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema", "true") \
.load("/my_data.csv")
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 30|
+----+---+

Moving files between driver node and DBFS

Copying files from driver node to DBFS

Let's first write a text file to the driver node like so:

with open("sample.txt", "w") as file:
file.write("Hello world!")

To copy this file from the driver node to DBFS:

dbutils.fs.cp("file:///databricks/driver/sample.txt", "dbfs:/my_sample.txt")

To confirm that our text file now exists in DBFS:

%fs ls
path name size modificationTime
1 dbfs:/FileStore/ FileStore/ 0 0
2 dbfs:/Volumes/ Volumes/ 0 0
3 dbfs:/databricks-datasets/ databricks-datasets/ 0 0
4 dbfs:/databricks-results/ databricks-results/ 0 0
5 dbfs:/my_sample.txt my_sample.txt 12 1688792423000

Copying files from DBFS to driver node

Let's start by uploading a text file to DBFS. Navigate to a notebook, and click on the Upload data to DBFS button:

By default, all files that we upload through this way will be stored under /FileStore in DBFS. I will upload a text file called hello_world.txt like so:

Clicking on the Next button gives:

For reference later, the File API format is:

/dbfs/FileStore/hello_world.txt

Let's now check that the text file exists in DBFS:

%fs ls dbfs:/FileStore/
path name size modificationTime
1 dbfs:/FileStore/hello_world.txt hello_world.txt 12 1688831344000

Let's now use the dbutils method to copy the text file from DBFS to the driver node:

dbutils.fs.cp("dbfs:/FileStore/hello_world.txt", "file:///databricks/driver/hello_world.txt")
True

Let's check that the file is now available in the file system of the driver node:

%sh ls
azure
...
hello_world.txt
...

Finally, we can read the content of the text file in our notebook:

with open("hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world

Accessing files in DBFS directly in the driver node file system

In the previous section, we have copied files from the driver node to DBFS. However, DBFS is mounted in the driver node under the /dbfs directory:

%sh ls /dbfs
FileStore
Volumes
databricks
databricks-datasets
databricks-results
user

Here, the directories in the output are what we see when we run %fs ls.

Therefore, instead of copying the text file from DBFS to the driver node, we can directly access the text file at /dbfs/FileStore/hello_world.txt - this is exactly the File API format of the text file that we noted earlier:

with open("/dbfs/FileStore/hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world

Workspace files

Workspace files are the files that you can create and store in your Databricks workspace. Workspace files can be notebooks, Python scripts, SQL scripts, text files, or any other type of file that you want to work with in Databricks.

These files are stored within the workspace, which provides a hierarchical organization structure to manage your files and folders. You can create, edit, and delete workspace files directly within the Databricks workspace UI or through the Databricks CLI or APIs.

Uploading workspace files using Databricks UI

As an example, let's upload a text file called hello_world.txt via Databricks UI in our dedicated folder below:

/Workspace/Users/support@skytowner.com

In the Workspace tab, right-click on our dedicated workspace directory and click on Import like so:

Behind the scenes, Databricks will upload this file into the driver node's file system instead of DBFS.

After uploading our hello_world.txt file, we should have the following files in our account workspace:

Here, we will use our Demo notebook to write some Python code that reads the hello_world.txt file.

Accessing workspace files programatically

In our Demo notebook, we can check for the existence of our new file by:

%sh ls /Workspace/Users/support@skytowner.com
Demo
hello_world.txt

Here, we are using %sh instead of %fs because the workspace files are located in the driver node's file system.

We can also programatically access the text file like so:

with open("/Workspace/Users/support@skytowner.com/hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world

Mounting object storage to DBFS

Mounting Azure blob storage

Let's mount our Azure blob storage to DBFS such that we can access our storage directly through DBFS. To do so, use the dbutils.fs.mount() method:

storage_account = "demostorageskytowner"
container = "democontainer"
access_key = "*****"

dbutils.fs.mount(
source = f"wasbs://{container}@{storage_account}.blob.core.windows.net",
mount_point = "/mnt/my_demo_storage",
extra_configs = {
f"fs.azure.account.key.{storage_account}.blob.core.windows.net": access_key
}
)
True

Here, the access_key can be obtained in the Azure portal:

We can now access all the files in our blob storage using DBFS like so:

%fs ls /mnt/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/ my_demo_storage/ 0 0

Let's have a look at what's inside our blob storage:

%fs ls /mnt/my_demo_storage/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000

Reading files from mounted storage

Recall that the DBFS is mounted on the driver node as /dbfs, which means we can directly access the files in our blob storage like so:

with open("/dbfs/mnt/my_demo_storage/hello_world.txt", "r") as file:
content = file.read()
print(content)
Hello world

Writing files to mounted storage

Let's now write a text file to the mounted storage:

with open("/dbfs/mnt/my_demo_storage/bye_world.txt", "w") as file:
file.write("Bye world!")

Let's check that the new file has been written into the mounted storage:

%fs ls /mnt/my_demo_storage/
path name size modificationTime
1 dbfs:/mnt/my_demo_storage/bye_world.txt bye_world.txt 10 1689253301000
2 dbfs:/mnt/my_demo_storage/hello_world.txt hello_world.txt 12 1689250809000

We can also check the Azure blob UI dashboard to see that this text file is present:

Unmounting storage

To unmount the storage:

dbutils.fs.unmount("/mnt/my_demo_storage")
/mnt/my_demo_storage has been unmounted.
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...