Data stored in Cloud Storage

Cloud Composer 1 | Cloud Composer 2 | Cloud Composer 3

This page describes what data Cloud Composer stores for your environment in Cloud Storage.

When you create an environment, Cloud Composer creates a Cloud Storage bucket and associates the bucket with your environment. The name of the bucket is based on the environment region, name, and a random ID such as us-central1-b1-6efabcde-bucket.

Cloud Composer synchronizes specific folders in your environment's bucket to Airflow components that run in your environment. For example, when you update a file with the code of your Airflow DAG in the environment's bucket, Airflow components also receive the updated version. Cloud Composer uses Cloud Storage FUSE for synchronization.

Folders in the Cloud Storage bucket

Folder Storage path Mapped directory Description
DAG gs://bucket-name/dags /home/airflow/gcs/dags Stores the DAGs for your environment.
Plugins gs://bucket-name/plugins /home/airflow/gcs/plugins Stores your custom plugins, such as custom in-house Airflow operators, hooks, sensors, or interfaces.
Data gs://bucket-name/data /home/airflow/gcs/data Stores the data that tasks produce and use.
Logs gs://bucket-name/logs /home/airflow/gcs/logs Stores the Airflow logs for tasks. Logs are also available in the Airflow web interface and in Logs tab in Google Cloud console.

Cloud Composer synchronizes the dags/ and plugins/ folders uni-directionally. Unidirectional syncing means that local changes in these folders on an Airflow component are overwritten. The data/ and logs/ folders synchronize bidirectionally.

Data synchronization is eventually consistent. To send messages from one operator to another, use XComs.

Capacity considerations

Data from dags/, plugins/ and data/ folders are synchronized to Airflow scheduler(s) and workers.

In Airflow 2, the content of the plugins/ folder is also synchronized to the Airflow web server. In Airflow 1, the content dags/ and plugins/ folders is synchronized to Airflow web server only if DAG Serialization is turned off. Otherwise, the synchronization is not performed.

The more data is put into these folders, the more space is occupied in the local storage of Airflow components. Saving too much data in dags/ and plugins/ can disrupt your operations and lead to issues such as:

  • A worker or a scheduler runs out of local storage and is evicted because of insufficient space on the local disk of the component.

  • Synchronization of files from dags/ and plugins/ folders to workers and schedulers takes a long time.

  • Synchronizing files from dags/ and plugins/ folders to workers and schedulers becomes impossible. For example, you store a 2 GB file in the dags/ folder, but the local disk of an Airflow worker can only accommodate 1 GB. During the synchronization, the worker runs out of local storage and synchronization can't be completed.

DAGs and plugins folders

To avoid DAG run failures, store your DAGs, plugins, and Python modules in the dags/ or plugins/ folders, even if your Python modules don't contain DAGs or plugins.

For example, you use a DataFlowPythonOperator that references a py_file Dataflow pipeline. That py_file doesn't contain DAGs or plugins, but you must still store it in the dags/ or plugins/ folder.

Data folder

There are scenarios when certain files from the data/ folder are synchronized to a specific Airflow component. For example, when Cloud Composer attempts to read a given file for the first time during:

  • DAG parsing: When a file is read for the first time during DAG parsing, Cloud Composer synchronizes it to the scheduler that parses the DAG.

  • DAG execution: When a file is read for the first time during DAG execution, Cloud Composer synchronizes it to the worker running the execution.

Airflow components have limited local storage, so consider deleting downloaded files to free disk space in your components. Notice that local storage usage can also temporarily go up if you have concurrent tasks that download the same file to a single Airflow worker.

Logs folder

The logs/ folder is synchronized from Airflow workers to the environment's bucket using the Cloud Storage API.

Cloud Storage API quota is calculated by the amount of data moved, so the number of Airflow tasks your system runs can increase your Cloud Storage API usage: the more tasks you run, the bigger your log files.

Synchronization with the web server

Airflow 2 uses DAG serialization out of the box. The plugins/ folder is automatically synchronized to the web server so that plugins can be loaded by Airflow UI. You can't turn off DAG serialization in Airflow 2.

In Airflow 1, DAG serialization is supported and is turned on by default in Cloud Composer.

  • When DAG serialization is turned on, the files from dags/ and plugins/ folders aren't synchronized to the web server.
  • When DAG serialization is turned off, the files from dags/ and plugins/ are synchronized to the web server.

What's next