Dagster

11 guides

keyboard_arrow_down

Other math topics

Dagster

Pandas

NumPy

Matplotlib

PySpark

MySQL

1. Dagster

Getting started with Dagster Logging asset metadata Adding compute kind badgeslock Grouping assetslock IO Managerslock DAGSTER_HOME and dagster.yamllock Code locationslock Different ways to materialize assetslock Accessing execution context within assetslock Setting run-time configurationslock Dagster jobslock

check_circle

Mark as learned

thumb_up

thumb_down

chat_bubble_outline

Comment

auto_stories Bi-column layout

settings

Guide on Logging Asset Metadata in Dagster

schedule Aug 10, 2023

Last updated

local_offer

Dagster

Logging asset metadata

To demonstrate, let's set up the same Dagster environment as we did in our Getting Started with Dagster guide. Our project structure is like so:


        
        
            
                
                
                    my_dagster_code_location
├── __init__.py
└── my_assets.py

Where the __init__.py is:


        
        
            
                
                
                    from dagster import Definitions, load_assets_from_modules
from . import my_assets

all_assets = load_assets_from_modules([my_assets])
defs = Definitions(assets=all_assets)

Copy the paste the following code into my_assets.py:


        
        
            
                
                
                    from dagster import asset, MetadataValue, Output
import pandas as pd

@asset(name="iris_data")
def get_iris_data():
    df = pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")
    return df

@asset(name="setosa")
def get_setosa(iris_data):
    df_setosa = iris_data.query("species == 0")
    return Output(
        value=df_setosa,
        metadata={
            "n_rows": len(df_setosa),
            "preview": MetadataValue.md(df_setosa.head().to_markdown()),
        }
    )

Here, note the following:

instead of returning df_setosa directly, we wrap the return value in Dagster's Output object. This allows us to log metadata (n_rows and preview in our case).
Dagster supports logs in markdown format via MetadataValue.md(-). We use a Pandas DataFrame's to_markdown() method to convert the DataFrame into a markdown string.

Let's now launch our Dagster UI like so:


        
        
            
                
                
                    dagster dev -m my_dagster_code_location
                
            
            2023-07-15 13:22:52 +0800 - dagit - INFO - Serving dagit on http://127.0.0.1:3000 in process 49252

In the Dagster UI, materialize the setosa asset. Click on setosa and we will see the meta information that we logged earlier:

Great, we see our markdown parsed elegantly as a table!

NOTE

The other way of logging asset metadata is by using the context.add_output_metadata(-) function in a custom IO manager. This approach is explored in this sectionlink. The advantage of using an IO manager is that we do not need to wrap the output of our function in Dagster's Output(-). This is great because tampering with the output of a function makes unit tests a challenge to write.

Visualizing changes in metadata over time

Using the Dagster UI, we can visualize the changes in the value of metadata over time. Suppose we materialized the setosa asset multiple times, each time adding a random integer to the n_rows metadata value.

We can view the metadata plots by clicking on the asset in the graph like so:

As of now, Dagster does not support changing the scale of the Timestamp axis, which is set to days by default. Technically, there are several points plotted in the above time series graph, but because the interval between each run is too short, we see more or less a vertical line.

We can also access a bigger plot by navigating to the Assets tab in the header and clicking on our setosa asset. Next, we click on the Plots tab to see the same time-series graph:

Again, we cannot change the x-axis scale here for now 😟.

Logging images

Since Dagster allows logging in markdown format, we can also log images! Copy and paste the following code into my_assets.py:


        
        
            
                
                
                    from dagster import asset, MetadataValue, Output
import matplotlib.pyplot as plt
import pandas as pd

# for handling image
import base64
from io import BytesIO

@asset(name="iris_data")
def get_iris_data():
    df = pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")
    return df

def get_img_as_md(df_setosa):
    plt.figure(figsize=(10, 6))
    plt.title("Setosa sepal length vs petal length")
    plt.scatter(df_setosa["sepal_length"], df_setosa["petal_length"])    
    buffer = BytesIO()
    plt.savefig(buffer, format="png")
    image_data = base64.b64encode(buffer.getvalue())
    return f"![img](data:image/png;base64,{image_data.decode()})"

@asset(name="setosa")
def get_setosa(iris_data):
    df_setosa = iris_data.query("species == 0")
    return Output(
        value=df_setosa,
        metadata={
            "n_rows": len(df_setosa),
            "preview": MetadataValue.md(df_setosa.head().to_markdown()),
            "plot": MetadataValue.md(get_img_as_md(df_setosa))
        },
    )

Here, the get_img_as_md(-) method returns an image encoded as base64 in markdown format. More specifically, the image is first generated using matplolib and then stored as a buffer using BytesIO(). We then encode the buffer as base64 and return this as a markdown string.

Back in our Dagster UI, materialize the assets once more. Click on the setosa asset to see:

Click on Show Markdown in the plot field to see our image:

Great, we managed to log a scatter plot image as metadata!

Adding description and metadata to assets

Besides the name property, we can supply other properties such as description and metadata like so:


        
        
            
                
                
                    from dagster import Definitions, asset
import pandas as pd

@asset(name="iris_data", description="My description", metadata={"key1": "val1", "key2": "val2"})
def get_iris_data():
    return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

The description of the assets will be displayed in multiple places. Firstly, it will be displayed in the data lineage:

It will also be displayed in the assets catalog, which can be found by clicking on the Assets header:

The description as well as the asset metadata will be displayed in the data lineage screen when clicking on the asset:

Note that the metadata property here is not intended to describe the content of the asset (e.g. the number of rows of the outputted DataFrame), but rather the nature of the asset (e.g. the name of the person who wrote the code).

NOTE

The description property is parsed as markdown. For instance, consider the following:


        
        
            
                
                
                    @asset(name="iris_data", description="My **description**")
def get_iris_data():
    return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

This will be rendered in the UI like so:

Notice how the description in the graph is parsed incorrectly although the description in the right panel is parsed correctly. One quick fix is to write in plain text for the first line, then switch to markdown in the subsequent lines:


        
        
            
                
                
                    @asset(name="iris_data", description="My description\n\nI am a **bold text**")
def get_iris_data():
    return pd.read_csv("https://raw.githubusercontent.com/SkyTowner/sample_data/main/iris_data.csv")

This will render the following:

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!