Troubleshoot deployments and jobs

This page shows you how to resolve issues with BigQuery Engine for Apache Flink deployments and jobs.

Deployment errors

The following section contains common errors that you might encounter when working with deployments.

Deployment creation takes a long time

When you create the first deployment in a project or in a subnet, deployment creation takes 30 minutes or more.

The first time you create either a deployment or an on-demand job in a project or in a subnet, the creation can take 30 minutes or more to complete. After that, it takes less time to create a new deployment or job.

On-demand job doesn't have a deployment

When you create an on-demand job, no corresponding deployment appears in the list of deployments.

This behavior is expected, because on-demand jobs aren't associated with deployments.

The network resource... is already being used

After you delete your deployment, the following error occurs when you try to delete your VPC network:

ERROR: (gcloud.compute.networks.delete) Could not fetch resource:
 - The network resource 'projects/PROJECT_ID/global/networks/NETWORK' is already being used by 'projects/PROJECT_ID/global/firewalls/ID'

When you try to delete your subnet, the following error occurs:

ERROR: (gcloud.compute.networks.subnets.delete) Could not fetch resource:
 - The subnetwork resource 'projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET' is already being used by 'projects/PROJECT_ID/regions/REGION/networkAttachments/NETWORK_ATTACHMENT'

This error is a known issue. If possible, delete your project. Deleting a project also deletes all of the resources in the project, including the network and subnet.

Job errors

The following section contains common errors that you might encounter when running jobs.

Combined slots usage ... exceeds the deployment's maximum limit

The following error occurs when you try to create a job:

INVALID_ARGUMENT: The request was invalid: failed to create a job, the combined slots usage from your autotuning configuration and existing jobs (NUMBER total) exceeds the deployment's maximum limit of NUMBER slots. To fix this, you can either adjust the parallelism (Fixed policy) or max_parallelism (Throughput Based policy) values in your autotuning settings to use fewer slots, consider reducing the slot usage of your existing jobs, or raise the limits.max_slots value in your deployment configuration to accommodate a higher number of slots.

This issue occurs because the job is requesting more task slots than are available in the deployment.

To resolve this issue, make one of the following changes:

  • Use fewer task slots in your job by setting the max_parallelism value to a lower number.

  • Increase the number of task slots available in your deployment. For more information, see Update deployments.

Job restarts with status Restarting

A job has the status Restarting if it encountered an error and is restarting from a checkpoint. A job might continually restart if the error is not recoverable. To troubleshoot further, review the logs and look for the cause.

If the job continues to restart after you fix the issue, you might need to delete the restarting job and create a new job. For more information, see Create and manage jobs.

Job completes with status Initializing

When you run a batch job, the job completes, but the job status in the Google Cloud console is Initializing.

This behavior is a known issue. To see the correct job status, in the console, click the job name to open the Job details page. Opening the Job details page refreshes the job status.

Job graph URI is empty

The following error occurs when you try to create a job:

failed to create a job, job graph uri is empty

This error is usually triggered by other errors, such as an invalid argument name or problems in the job JAR file. To troubleshoot further, review the gcloud CLI output for errors that appear before this one.

Job JAR doesn't exist

The following error occurs when you try to create a job:

job jar doesn't exist

To resolve this issue, in your job creation command, use an absolute path to the job JAR file instead of a relative path.

Operation... has not finished in 1800 seconds

The following error occurs when you try to create an on-demand job:

ERROR: (gcloud.alpha.managed-flink.jobs.create) Operation CREATE_OPERATION has not finished in 1800 seconds. The operations may still be underway remotely and may still succeed; use gcloud list and describe commands or https://console.developers.google.com/ to check resource state.

This error occurs when an asynchronous operation is polling. The error doesn't mean that the job creation operation failed. When the job creation operation completes, you can view the job in the Google Cloud console. For more information, see List jobs.

Permissions errors

The following section contains common permissions errors.

Email addresses and domains must be associated with an active Google Account

The following error occurs when you try to grant roles to the Managed Flink Default Workload Identity in the console:

Email addresses and domains must be associated with an active Google Account, Google Workspace account, or Cloud Identity account.

This error occurs when the Managed Flink Default Workload Identity has not yet been provisioned.

To resolve this issue, use the gcloud CLI to create a job or deployment. An error occurs the first time you try create a job or deployment, but the Managed Flink Default Workload Identity is created. After the Managed Flink Default Workload Identity is created, you can give it permissions by granting roles.

For more information, see BigQuery Engine for Apache Flink security and permissions.

Failed to configure node service account IAM

The following error occurs the first time you try to use BigQuery Engine for Apache Flink:

Failed to configure node service account IAM: failed to set iam policy bindings for resource serviceAccount:gmf-node-sa@PROJECT_ID.iam.gserviceaccount.com in PROJECT_ID with error generic::invalid_argument: com.google.apps.framework.request.StatusException: <eye3 title='INVALID_ARGUMENT'/> generic::INVALID_ARGUMENT: <eye3 title='/Projects.SetIamPolicy, INVALID_ARGUMENT'/> APPLICATION_ERROR;google.cloudresourcemanager.v1/Projects.SetIamPolicy;com.google.apps.framework.request.StatusException: <eye3 title='INVALID_ARGUMENT'/> generic::INVALID_ARGUMENT: Exception calling IAM: Service account gmf-node-sa@PROJECT_ID.iam.gserviceaccount.com does not exist.

This error occurs the first time you use BigQuery Engine for Apache Flink or the first time you try to create a deployment in a project, because the Managed Flink Default Workload Identity needs to be created. After you see this error, the Managed Flink Default Workload Identity is created, and you can grant it permissions.

For more information, see Managed Flink Default Workload Identity.

FileNotFoundException... application_default_credentials.json

The following error occurs when you try to create a job by using the gcloud CLI:

java.lang.RuntimeException: java.io.FileNotFoundException: FILEPATH/application_default_credentials.json (No such file or directory)

The error occurs when the Application Default Credentials (ADC) aren't set up in your local environment.

For instructions, see the section Client libraries or third-party tools in "Authenticate to BigQuery Engine for Apache Flink." You might also need to set an environment variable. For more information, see Set up Application Default Credentials.

Managed Flink Default Workload Identity can't access...

When you try to run a job, the job fails with an error similar to the following:

gmf-PROJECT_NUMBER-default@gcp-sa-managedflink-wi.iam.gserviceaccount.com can't access...

This error occurs if the Managed Flink Default Workload Identity for your project doesn't have access to the sources and sinks used by your job.

To resolve this issue, grant the necessary roles or give the required permissions to your Managed Flink Default Workload Identity.

If you use the gcloud CLI to upload a local file, such as a JAR file or a SQL file, to Cloud Storage, the user account running the job needs the storage.objects.create permission to write to the Cloud Storage bucket.

For more information, see Access Google Cloud resources in "BigQuery Engine for Apache Flink security and permissions."