Guide on DAGSTER_HOME and dagster.yaml
Start your free 7-days trial now!
Types of data in Dagster
Broadly speaking, there are two types of data in Dagster:
run data: describes how the pipeline executes (e.g. execution time, the involved assets, event logs). The Dagster UI is the visual interface to the run data.
materialized assets: the output data of our pipeline (e.g. a CSV file).
How run-related data is stored in Dagster
When we launch the Dagster UI, Dagster will look for the environment variable DAGSTER_HOME
, which is a (absolute) path that points to the directory containing our dagster.yaml
file. Using the dagster.yaml
file, we can configure the behavior of Dagster such as specifying where to store run data. As for the storage location of materialized assets, if DAGSTER_HOME
is defined, then they will be stored under DAGSTER_HOME
by default.
We typically use Dagster's IO managers to define the storage location of the materialized assets because they're more flexible. For instance, using IO managers, we can store our materialized assets in a local directory or blob storage (e.g. AWS S3) in any format (e.g. csv
, pickle
, parquet
).
On the other hand, the storage location of run data is defined using dagster.yaml
.
Dagster instance
The dagster.yaml
is used to load up what's known as the Dagster instance. The naming here is slightly misleading because the Dagster instance is not a process or a long-running daemon like the dagster daemon
or dagster-webserver
. Instead, we can think of the Dagster instance as the root configuration file that contains all the parameters that the Dagster system (including the dagster daemon
and dagster-webserver
) will use.
Here's a diagram showing what the Dagster instance is:
Here, we're setting the environment variable DAGSTER_HOME
, which is an absolute path pointing to a directory (my_dagster_home
) containing the dagster.yaml
file. Within dagster.yaml
, we can specify basic configurations such as where we wish to store run data (via the storage
property).
Case when DAGSTER_HOME is not defined
If DASGTER_HOME
is not defined, then Dagster will output all our materialized assets and run data (e.g. run ID, timestamps) in a temporary folder that is deleted when the Dagster server terminates. Let's demonstrate this - suppose we have a main.py
file like so:
from dagster import Definitions, assetimport pandas as pd
@asset(name="iris_data")def get_iris_data():
defs = Definitions(assets=[get_iris_data])
Launch the Dagster UI using the following command:
dagster dev -f main.py
2023-07-29 23:47:46 +0800 - dagster - INFO - Using temporary directory /Users/isshininada/Desktop/dagster_demo/tmpckgig9g2 for storage. This will be removed when dagster dev exits.2023-07-29 23:47:46 +0800 - dagster - INFO - To persist information across sessions, set the environment variable DAGSTER_HOME to a directory to use.2023-07-29 23:47:48 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 82177...
The output tells us that Dagster is using a temporary directory (/tmpckgig9g2
) to store our run data as well as the materialized assets.
On the Dagster UI, click on the Materialize button:
Once our asset is materialized, our temporary tmpckgig92
directory should look something like so:
tmpckgig92├── schedules/├── history│ ├── runs│ │ ├── 00848753-18b4-a9vn-abnfe.db│ │ └── ...│ └── runs.db└── storage ├── 00848753-18b4-a9vn-abnfe │ └── compute_logs │ ├── jecaitjb.complete │ ├── jecaitjb.err │ └── jecaitjb.out ├── ... │ └── compute_logs │ ├── ....complete │ ├── ....err │ └── ....out └── iris_data
Note the following:
the
schedules/
directory contains information about any scheduling logic of our pipeline. We can ignore this folder for now since we don't use any scheduling.the
history/
directory contains information about our runs.the
history/runs
directory contains SQLite files, each of which holds the run data for a single particular run. The name of the SQLite files is their run IDs (e.g.00848753-18b4-a9vn-abnfe
).the
history/runs.db
is an SQLite file storing basic information about all the runs.the
storage/
directory contains the computed logs (stdout
andstderr
) of each run as well as the materialized assets (iris_data
in pickle format). This means that if we callprint(-)
in our data pipeline, the printed value will be stored here. Note that Dagster's own event logs (e.g.PIPELINE_STARTING
) are stored under thehistory/runs
directory.
Again, this temporary folder will be deleted once we stop the Dagster server. This can be useful for quick testing but is not recommended for most cases since we usually want to persist all our assets and logs.
Case when DAGSTER_HOME is defined
Case when dagster.yaml is missing
If DAGSTER_HOME
is defined, then Dagster will proceed to look for the configuration file at $DAGSTER_HOME/dagster.yaml
. If this .yaml
file does not exist, then Dagster will persist our run data and materialized assets under DAGSTER_HOME
(instead of a temporary folder) like so:
$DAGSTER_HOME├── schedules/├── history/└── storage/
Note that unlike the case when DAGSTER_HOME
is not defined, the run data and our assets will be persisted even after terminating the Dagster server.
Let's now demonstrate this. Suppose our project structure is like so:
.envmy_dagster_home/main.py
Where my_dagster_home
is an empty folder and main.py
is the same as before:
from dagster import Definitions, assetimport pandas as pd
@asset(name="iris_data")def get_iris_data():
defs = Definitions(assets=[get_iris_data])
Whenever we launch a Dagster process, it will look into our .env
file to check if the environment variable DAGSTER_HOME
is defined. Let's update our .env
file and define DAGSTER_HOME
like so:
DAGSTER_HOME=/Users/isshininada/Desktop/dagster_demo/my_dagster_home
Here, the DAGSTER_HOME
must be an absolute path rather than a relative path.
Let's now launch our Dagster UI:
dagster dev -f main.py
2023-07-30 12:10:05 +0800 - dagster - INFO - Loaded environment variables from .env file: DAGSTER_HOME...2023-07-30 12:10:07 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 84499
The output tells us that Dagster has found DAGSTER_HOME
in our .env
file!
Now, materialize the iris_data
asset in the Dagster UI. We should then see our run data and materialized assets stored in our my_dagster_home
folder:
my_dagster_home├── schedules│ └── ...├── history│ └── ...└── storage ├── ... └── iris_data
Case when DAGSTER_HOME and dagster.yaml are present
Finally, let's consider the case when DAGSTER_HOME
is defined and the dagster.yaml
is present. Suppose our current file structure is:
.envmain.pymy_dagster_home└── dagster.yaml
By default, if we have an empty dagster.yaml
file, our run data (history/
and schedules/
), stdout/stderr
logs and materialized assets (storage/
) will be written inside the $DAGSTER_HOME/
folder - just as in the case when the dagster.yaml
is not present:
my_dagster_home├── schedules/├── history/└── storage/
As discussed in the beginning, we can modify the dagster.yaml
file to configure the behavior of Dagster. We will now demonstrate this.
Changing the location of run-data storage
Let's update our dagster.yaml
file such that we write history/
and schedules/
inside a directory called my_runs_info
instead:
storage: sqlite: base_dir: my_runs_info
Let's now launch our Dagster server:
dagster dev -f main.py
2023-07-30 12:25:41 +0800 - dagster-webserver - INFO - Loaded environment variables from .env file: DAGSTER_HOME...2023-07-30 12:25:42 +0800 - dagster-webserver - INFO - Serving dagster-webserver on http://127.0.0.1:3000 in process 84654
In our local repository, we should now see that Dagster has created a new directory called my_runs_info
containing history/
and schedules/
like so:
.envmain.pymy_dagster_home└── dagster.yamlmy_runs_info├── history│ ├── runs│ │ └── index.db│ └── runs.db└── schedules └── schedules.db
The storage/
directory, which contains our stdout
and stderr
logs, is still stored within the dagster_home
directory.
Dagster's naming convention is quite misleading here. Even though we specified storage
in dagster.yaml
, the storage folder, which contains stdout
and stderr
logs and our materialized assets will still be written under the $DAGSTER/
folder. Instead, specifying storage
in dagster.yaml
will change where the history/
and schedules/
directories will be written to.
To change the location of stdout
and stderr
logs, replace storage
with compute_logs
in our dagster.yaml
. To change the location of the materialized assets, use IO managers.
Since we have not yet materialized any assets, the .db
files do not contain any run information. Let's head over to the Dagster UI and materialize our assets. We should then see the following:
...my_runs_info├── history│ ├── runs│ │ ├── 4c6ca96c-c34d-4e6c-ad74-e32be41178c1.db # new run!│ │ └── index.db│ └── runs.db└── schedules/
Again, the materialized assets and the stdout/stderr
logs are still stored in the my_dagster_home/storage
folder.
Using environment variables
Instead of hard-coding the base path (my_runs_info
in this case) inside the dagster.yaml
file, we can specify this as an environment variable. To demonstrate, let's update the .env
file like so:
DAGSTER_HOME=/Users/isshininada/Desktop/dagster_demo/my_dagster_homeSQLITE_STORAGE_BASE_DIR=my_runs_info
In our dagster.yaml
file, we can access the environment variable like so:
storage: sqlite: base_dir: env: SQLITE_STORAGE_BASE_DIR
Notice how we have to create a new field under base_dir
called env
to be able to access the environment variable.
Storing logs in PostgresDB in dagster.yaml
By default, run data is stored as an SQLite file locally. Dagster currently supports storing the run data in remote PostgreSQL and MySQL databases. Let's demonstrate this - go ahead and set up a PostgreSQL database using your favorite cloud service and fetch its credentials.
Suppose our dagster.yaml
file is as follows:
storage: postgres: postgres_db: username: my_username password: my_password hostname: my_hostname db_name: my_db_name port: 5432
Since we have specified storage
, all our run data and the event logs (e.g. PIPELINE_STARTING
) will be written to the PostgresDB. By specifying storage
and compute_logs
, we have the flexibility to store run data and stdout/stderr
logs in different locations - for instance, we can store run data in PostgresDB but stream stdout/stderr
logs in an AWS S3 bucket!
To be able to store logs in PostgresDB, we must first install the package dagster-postgres
.
Starting our Dagster server for the first time with this new configuration file may take a while because Dagster has to create many tables in our PostgresDB. Once the Dagster server is finished launching, we should see many new tables in our database such as:
...jobsrunsrun_tags
Since we haven't materialized our assets yet, almost all of these tables will be empty.
Let's now head over to the Dagster UI and materialize our assets. It should take longer than usual to materialize the assets since all the run data has to be written to the remote PostgresDB.
If we inspect our runs
table, we should see a new row like so:
run_id | status | ... | create_timestamp | update_timestamp |
---|---|---|---|---|
7d6329bb-2149-4318-80f0-da46fcde9807 | SUCCESS | ... | 2023-07-02 09:46:05.057909 | 2023-07-02 09:48:21.954296 |
Note that we can see our deployment config in the Dagster UI as well: