Troubleshooting DAG Processor issues

Cloud Composer 3 | Cloud Composer 2 | Cloud Composer 1

This page refers only to issues to issues related to DAG File Processing, for issues scheduling tasks, see Troubleshooting Airflow scheduler issues.

Troubleshooting Workflow

Inspecting DAG Processor logs

If you have complex DAGs then the DAG Processor might not parse all your DAGs. This might lead to many issues that have the following symptoms.

Symptoms:

  • If the DAG Processor encounters problems when parsing your DAGs, then it might lead to a combination of the issues listed here. If DAGs are generated dynamically, these issues might be more impactful compared to static DAGs.

  • DAGs are not visible in Airflow UI and DAG UI.

  • DAGs are not scheduled for execution.

  • There are errors in the DAG processor logs, for example:

    dag-processor-manager [2023-04-21 21:10:44,510] {manager.py:1144} ERROR -
    Processor for /home/airflow/gcs/dags/dag-example.py with PID 68311 started
    at 2023-04-21T21:09:53.772793+00:00 has timed out, killing it.
    

    or

    dag-processor-manager [2023-04-26 06:18:34,860] {manager.py:948} ERROR -
    Processor for /home/airflow/gcs/dags/dag-example.py exited with return
    code 1.
    
  • DAG processors experience issues which lead to restarts.

  • Airflow tasks that are scheduled for execution are cancelled and DAG runs for DAGs that failed to be parsed might be marked as failed. For example:

    airflow-scheduler Failed to get task '<TaskInstance: dag-example.task1--1
    manual__2023-04-17T10:02:03.137439+00:00 [removed]>' for dag
    'dag-example'. Marking it as removed.
    

Solution:

  • Increase parameters related to DAG parsing:

  • Correct or remove DAGs that cause problems to the DAG processor.

Inspecting DAG parse times

To verify if the issue happens at DAG parse time, follow these steps.

Console

In Google Cloud console you can use the Monitoring page and the Logs tab to inspect DAG parse times.

Inspect DAG parse times with the Cloud Composer Monitoring page:

  1. In Google Cloud console, go to the Environments page.

    [Go to Environments][console-list-env]

  2. In the list of environments, click the name of your environment. The Monitoring page opens.

  3. In the Monitoring tab, review the Total parse time for all DAG files chart in the DAG runs section and identify possible issues.

    The DAG runs section in the Composer Monitoring tab shows health metrics for the DAGs in your environment

Inspect DAG parse times with the Cloud Composer Logs tab:

  1. In Google Cloud console, go to the Environments page.

    [Go to Environments][console-list-env]

  2. In the list of environments, click the name of your environment. The Monitoring page opens.

  3. Go to the Logs tab, and from the All logs navigation tree select the DAG processor manager section.

  4. Review dag-processor-manager logs and identify possible issues.

    The DAG processor logs will show DAG parsing times

gcloud

Use the dags report command to see the parse time for all your DAGs.

gcloud composer environments run ENVIRONMENT_NAME \
    --location LOCATION \
    dags report

Replace:

  • ENVIRONMENT_NAME with the name of the environment.
  • LOCATION with the region where the environment is located.

The output of the command looks similar to the following:

Executing within the following Kubernetes cluster namespace: composer-2-0-31-airflow-2-3-3
file                  | duration       | dag_num | task_num | dags
======================+================+=========+==========+===================
/manydagsbig.py       | 0:00:00.038334 | 2       | 10       | serial-0,serial-0
/airflow_monitoring.py| 0:00:00.001620 | 1       | 1        | airflow_monitoring

Look for the duration value for each of the dags listed in the table. A large value might indicate that one of your DAGs is not implemented in an optimal way. From the output table, you can identify which DAGs have a long parsing time.

Troubleshooting issues at DAG parse time

The following sections describe symptoms and potential fixes for some common issues at DAG parse time.

Limited number of threads

Allowing the DAG processor manager to use only a limited number of threads might impact your DAG parse time.

To solve the issue, override the following Airflow configuration options:

  • Override the parsing_processes parameter:

    Section Key Value Notes
    scheduler parsing_processes NUMBER_OF_CPUs_IN_DAG_PROCESSOR - 1 Replace NUMBER_OF_CPUs_IN_DAG_PROCESSOR with the number of cpus
    in the DAG processor.

Make the DAG processor ignore unnecessary files

You can improve performance of the DAG processor by skipping unnecessary files in the DAGs folder. DAG processor ignores files and folders specified in the .airflowignore file.

To make the DAG processor ignore unnecessary files:

  1. Create an .airflowignore file.
  2. In this file, list files and folders that should be ignored.
  3. Upload this file to the /dags folder in your environment's bucket.

For more information about the .airflowignore file format, see Airflow documentation.

Airflow processes paused DAGs

Airflow users pause DAGs to avoid their execution. This saves Airflow workers processing cycles.

Airflow will continue parsing paused DAGs. If you really want to improve the DAG processor's performance, use .airflowignore or delete paused DAGs from DAGs folder.

Common Issues

The following sections describe symptoms and potential fixes for some common parsing issues.

DAG load import timeout

Symptom:

  • In the Airflow web interface, at the top of the DAGs list page, a red alert box shows Broken DAG: [/path/to/dagfile] Timeout.
  • In Cloud Monitoring: The airflow-scheduler logs contain entries similar to:

    • ERROR - Process timed out
    • ERROR - Failed to import: /path/to/dagfile
    • AirflowTaskTimeout: Timeout

Fix:

Override the dag_file_processor_timeout Airflow configuration option and allow more time for DAG parsing:

Section Key Value
core dag_file_processor_timeout New timeout value

A DAG is not visible in Airflow UI or DAG UI and the scheduler does not schedule it

The DAG processor parses each DAG before it can be scheduled by the scheduler and before a DAG becomes visible in the Airflow UI or DAG UI.

The following Airflow configuration options define timeouts for parsing DAGs:

If a DAG is not visible in the Airflow UI or DAG UI:

  • Check DAG processor logs if the DAG processor is able to correctly process your DAG. In case of problems, you might see the following log entries in the DAG processor or scheduler logs:

    [2020-12-03 03:06:45,672] {dag_processing.py:1334} ERROR - Processor for
    /usr/local/airflow/dags/example_dag.py with PID 21903 started at
    2020-12-03T03:05:55.442709+00:00 has timed out, killing it.
    
  • Check scheduler logs to see if the scheduler works correctly. In case of problems, you might see the following log entries in scheduler logs:

    DagFileProcessorManager (PID=732) last sent a heartbeat 240.09 seconds ago! Restarting it
    Process timed out, PID: 68496
    

Solutions:

  • Fix all DAG parsing errors. The DAG processor parses multiple DAGs, and in rare cases parsing errors of one DAG can negatively impact the parsing of other DAGs.

  • If the parsing of your DAG takes more than the amount of seconds defined in [core]dagbag_import_timeout, then increase this timeout.

  • If the parsing of all your DAGs takes more than the amount of seconds defined in [core]dag_file_processor_timeout, then increase this timeout.

  • If your DAG takes a long time to parse, it can also mean that it is not implemented in an optimal way. For example, if it reads read many environment variables, or performs calls to external services or Airflow database. To the extent possible, avoid performing such operations in global sections of DAGs.

  • Increase CPU and memory resources for the DAG processor so it can work faster.

  • Lower the frequency of DAG parsing.

  • Lower the load on the Airflow database.

What's next