This page describes troubleshooting steps that you might find helpful if you run into problems when you use Vertex AI Workbench.
See also Troubleshooting Vertex AI for help using other components of Vertex AI.
To filter this page's content, click a topic:
Helpful procedures
This section describes procedures that you might find helpful.
Use SSH to connect to your user-managed notebooks instance
Use ssh to connect to your instance by typing the following command in either Cloud Shell or any environment where the Google Cloud CLI is installed.
gcloud compute ssh --project PROJECT_ID \
--zone ZONE \
INSTANCE_NAME -- -L 8080:localhost:8080
Replace the following:
PROJECT_ID
: Your project IDZONE
: The Google Cloud zone where your instance is locatedINSTANCE_NAME
: The name of your instance
You can also connect to your instance by opening your instance's Compute Engine detail page, and then clicking the SSH button.
Re-register with the Inverting Proxy server
To re-register the user-managed notebooks instance with the internal Inverting Proxy server, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:
cd /opt/deeplearning/bin sudo ./attempt-register-vm-on-proxy.sh
Verify the Docker service status
To verify the Docker service status you can use ssh to connect to your user-managed notebooks instance and enter:
sudo service docker status
Verify that the Inverting Proxy agent is running
To verify if the notebook Inverting Proxy agent is running, use ssh to connect to your user-managed notebooks instance and enter:
# Confirm Inverting Proxy agent Docker container is running (proxy-agent) sudo docker ps # Verify State.Status is running and State.Running is true. sudo docker inspect proxy-agent # Grab logs sudo docker logs proxy-agent
Verify the Jupyter service status and collect logs
To verify the Jupyter service status you can use ssh to connect to your user-managed notebooks instance and enter:
sudo service jupyter status
To collect Jupyter service logs:
sudo journalctl -u jupyter.service --no-pager
Verify that the Jupyter internal API is active
The Jupyter API should always run on port 8080. You can verify this by inspecting the instance's syslogs for an entry similar to:
Jupyter Server ... running at: http://localhost:8080
To verify that the Jupyter internal API is active you can also use ssh to connect to your user-managed notebooks instance and enter:
curl http://127.0.0.1:8080/api/kernelspecs
You can also measure the time it takes for the API to respond in case the requests are taking too long:
time curl -V http://127.0.0.1:8080/api/status
time curl -V http://127.0.0.1:8080/api/kernels
time curl -V http://127.0.0.1:8080/api/connections
To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.
Restart the Docker service
To restart the Docker service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:
sudo service docker restart
Restart the Inverting Proxy agent
To restart the Inverting Proxy agent, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:
sudo docker restart proxy-agent
Restart the Jupyter service
To restart the Jupyter service, you can stop and start the VM from the User-managed notebooks page or you can use ssh to connect to your user-managed notebooks instance and enter:
sudo service jupyter restart
Restart the Notebooks Collection Agent
The Notebooks Collection Agent service runs a Python process in the background that verifies the status of the Vertex AI Workbench instance's core services.
To restart the Notebooks Collection Agent service, you can stop and start the VM from the Google Cloud console or you can use ssh to connect to your Vertex AI Workbench instance and enter:
sudo systemctl stop notebooks-collection-agent.service
followed by:
sudo systemctl start notebooks-collection-agent.service
To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.
Modify the Notebooks Collection Agent script
To access and edit the script open a terminal in our instance or use ssh to connect to your Vertex AI Workbench instance, and enter:
nano /opt/deeplearning/bin/notebooks_collection_agent.py
After editing the file, remember to save it.
Then, you must restart the Notebooks Collection Agent service.
Verify the instance can resolve the required DNS domains
To verify that the instance can resolve the required DNS domains, you can use ssh to connect to your user-managed notebooks instance and enter:
host notebooks.googleapis.com
host *.notebooks.cloud.google.com
host *.notebooks.googleusercontent.com
host *.kernels.googleusercontent.com
or:
curl --silent --output /dev/null "https://notebooks.cloud.google.com"; echo $?
If the instance has Dataproc enabled, you can verify that the instance
resolves *.kernels.googleusercontent.com
by running:
curl --verbose -H "Authorization: Bearer $(gcloud auth print-access-token)" https://${PROJECT_NUMBER}-dot-${REGION}.kernels.googleusercontent.com/api/kernelspecs | jq .
To run these commands in your Vertex AI Workbench instance, open JupyterLab, and create a new terminal.
Make a copy of the user data on an instance
To store a copy of an instance's user data in Cloud Storage, complete the following steps.
Create a Cloud Storage bucket (optional)
In the same project where your instance is located, create a Cloud Storage bucket where you can store your user data. If you already have a Cloud Storage bucket, skip this step.
-
Create a Cloud Storage bucket:
Replacegcloud storage buckets create gs://BUCKET_NAME
BUCKET_NAME
with a bucket name that meets the bucket naming requirements.
Copy your user data
In your instance's JupyterLab interface, select File > New > Terminal to open a terminal window. For user-managed notebooks instances, you can instead connect to your instance's terminal by using SSH.
Use the gcloud CLI to copy your user data to a Cloud Storage bucket. The following example command copies all of the files from your instance's
/home/jupyter/
directory to a directory in a Cloud Storage bucket.gcloud storage cp /home/jupyter/* gs://BUCKET_NAMEPATH --recursive
Replace the following:
BUCKET_NAME
: the name of your Cloud Storage bucketPATH
: the path to the directory where you want to copy your files, for example:/copy/jupyter/
Investigate an instance stuck in provisioning by using gcpdiag
gcpdiag
is an open source tool. It is not an officially supported Google Cloud product.
You can use the gcpdiag
tool to help you identify and fix Google Cloud
project issues. For more information, see the
gcpdiag project on GitHub.
gcpdiag
runbook investigates potential causes for a
Vertex AI Workbench instance to get stuck in provisioning status,
including the following areas:
- Status: Checks the instance's current status to ensure that it is stuck in provisioning and not stopped or active.
- Instance's Compute Engine VM boot disk image:
Checks whether the instance was created with a custom container, an official
workbench-instances
image, Deep Learning VM Images, or unsupported images that might cause the instance to get stuck in provisioning status. - Custom scripts: Checks whether the instance is using custom startup or post-startup scripts that change the default Jupyter port or break dependencies that might cause the instance to get stuck in provisioning status.
- Environment version: Checks whether the instance is using the latest environment version by checking its upgradability. Earlier versions might cause the instance to get stuck in provisioning status.
- Instance's Compute Engine VM performance: Checks the VM's current performance to ensure that it isn't impaired by high CPU usage, insufficient memory, or disk space issues that might disrupt normal operations.
- Instance's Compute Engine serial port or
system logging: Checks whether the instance has
serial port logs, which are analyzed to
ensure that Jupyter is running on port
127.0.0.1:8080
. - Instance's Compute Engine SSH and terminal access: Checks whether the instance's Compute Engine VM is running so that the user can SSH and open a terminal to verify that space usage in 'home/jupyter' is lower than 85%. If no space is left, this might cause the instance to get stuck in provisioning status.
- External IP turned off: Checks whether external IP access is turned off. An incorrect networking configuration can cause the instance to get stuck in provisioning status.
Google Cloud console
- Complete and then copy the following command.
- Open the Google Cloud console and activate Cloud Shell. Open Cloud console
- Paste the copied command.
- Run the
gcpdiag
command, which downloads thegcpdiag
docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.
gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \
--project=PROJECT_ID \
--parameter instance_name=INSTANCE_NAME \
--parameter zone=ZONE
Docker
You can
run gcpdiag
using a wrapper that starts gcpdiag
in a
Docker container. Docker or
Podman must be installed.
- Copy and run the following command on your local workstation.
curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag
- Execute the
gcpdiag
command../gcpdiag runbook vertex/workbench-instance-stuck-in-provisioning \ --project=PROJECT_ID \ --parameter instance_name=INSTANCE_NAME \ --parameter zone=ZONE
View available parameters for this runbook.
Replace the following:
- PROJECT_ID: The ID of the project containing the resource
- INSTANCE_NAME: The name of the target Vertex AI Workbench instance within your project.
- ZONE: The zone in which your target Vertex AI Workbench instance is located.
Useful flags:
--project
: The PROJECT_ID--universe-domain
: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource--parameter
or-p
: Runbook parameters
For a list and description of all gcpdiag
tool flags, see the
gcpdiag
usage instructions.