Stay organized with collections
Save and categorize content based on your preferences.
You can install additional components like Jupyter when you create a Dataproc
cluster using the
Optional components
feature. This page describes the Jupyter component.
The Jupyter component
is a Web-based single-user notebook for interactive data analytics and supports the
JupyterLab
Web UI. The Jupyter Web UI is available on port 8123 on the cluster's first master node.
Configure Jupyter. Jupyter can be configured by providing dataproc:jupytercluster properties.
To reduce the risk of remote code execution over unsecured notebook server
APIs, the default dataproc:jupyter.listen.all.interfaces cluster property
setting is false, which restricts connections to localhost (127.0.0.1) when
the Component Gateway is
enabled (Component Gateway activation is required when installing the Jupyter component).
The Jupyter notebook provides a Python kernel to run Spark code, and a
PySpark kernel. By default, notebooks are saved in Cloud Storage
in the Dataproc staging bucket, which is specified by the user or
auto-created
when the cluster is created. The location can be changed at cluster creation time using the
dataproc:jupyter.notebook.gcs.dir cluster property.
Work with data files. You can use a Jupyter notebook to work with data files that have been
uploaded to Cloud Storage.
Since the Cloud Storage connector
is pre-installed on a Dataproc cluster, you can reference the
files directly in your notebook. Here's an example that accesses CSV files in
Cloud Storage:
To create a Dataproc cluster that includes the Jupyter component,
use the
gcloud dataproc clusters createcluster-name command with the --optional-components flag.
Latest default image version example
The following example installs the Jupyter
component on a cluster that uses the latest default image version.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[[["\u003cp\u003eThe Jupyter component is a single-user, web-based notebook for interactive data analytics, accessible via port 8123 on the cluster's first master node, and it also supports the JupyterLab Web UI.\u003c/p\u003e\n"],["\u003cp\u003eTo enable multi-user notebook access, you can utilize a Dataproc-enabled Vertex AI Workbench instance or install the Dataproc JupyterLab plugin on a VM.\u003c/p\u003e\n"],["\u003cp\u003eJupyter notebooks can be configured using specific cluster properties, and by default, notebooks are saved in Cloud Storage, with the location being customizable at cluster creation.\u003c/p\u003e\n"],["\u003cp\u003eThe Jupyter component can be installed when creating a Dataproc cluster through the Google Cloud console, gcloud CLI, or REST API, but requires the Component Gateway to be enabled.\u003c/p\u003e\n"],["\u003cp\u003eJupyter notebooks support working directly with data files in Cloud Storage, and you can also attach GPUs to master and worker nodes to enhance machine learning tasks within Jupyter.\u003c/p\u003e\n"]]],[],null,["# Dataproc optional Jupyter component\n\nYou can install additional components like Jupyter when you create a Dataproc\ncluster using the\n[Optional components](/dataproc/docs/concepts/components/overview#available_optional_components)\nfeature. This page describes the Jupyter component.\n\nThe [Jupyter](http://jupyter.org/) component\nis a Web-based **single-user** notebook for interactive data analytics and supports the\n[JupyterLab](https://jupyterlab.readthedocs.io/en/stable/index.html)\nWeb UI. The Jupyter Web UI is available on port `8123` on the cluster's first master node.\n\n**Launch notebooks for multiple users.** You can create a Dataproc-enabled\n[Vertex AI Workbench instance](/vertex-ai/docs/workbench/instances/create-dataproc-enabled)\nor [install the Dataproc JupyterLab plugin](/dataproc-serverless/docs/quickstarts/jupyterlab-sessions)\non a VM to to serve notebooks to multiple users.\n\n**Configure Jupyter.** Jupyter can be configured by providing `dataproc:jupyter`\n[cluster properties](/dataproc/docs/concepts/configuring-clusters/cluster-properties#service_properties).\nTo reduce the risk of remote code execution over unsecured notebook server\nAPIs, the default `dataproc:jupyter.listen.all.interfaces` cluster property\nsetting is `false`, which restricts connections to `localhost (127.0.0.1)` when\nthe [Component Gateway](/dataproc/docs/concepts/accessing/dataproc-gateways) is\nenabled (Component Gateway activation is required when installing the Jupyter component).\n\nThe Jupyter notebook provides a Python kernel to run [Spark](https://spark.apache.org/) code, and a\nPySpark kernel. By default, notebooks are [saved in Cloud Storage](https://github.com/src-d/jgscm)\nin the Dataproc staging bucket, which is specified by the user or\n[auto-created](/dataproc/docs/guides/create-cluster#auto-created_staging_bucket)\nwhen the cluster is created. The location can be changed at cluster creation time using the\n[`dataproc:jupyter.notebook.gcs.dir`](/dataproc/docs/concepts/configuring-clusters/cluster-properties#dataproc-properties) cluster property.\n\n**Work with data files.** You can use a Jupyter notebook to work with data files that have been\n[uploaded to Cloud Storage](/storage/docs/uploading-objects).\nSince the [Cloud Storage connector](/dataproc/docs/concepts/connectors/cloud-storage)\nis pre-installed on a Dataproc cluster, you can reference the\nfiles directly in your notebook. Here's an example that accesses CSV files in\nCloud Storage: \n\n```\ndf = spark.read.csv(\"gs://bucket/path/file.csv\")\ndf.show()\n```\n\nSee\n[Generic Load and Save Functions](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html)\nfor PySpark examples.\n\nInstall Jupyter\n---------------\n\nInstall the component when you create a Dataproc cluster.\nThe Jupyter component requires activation of the Dataproc\n[Component Gateway](/dataproc/docs/concepts/accessing/dataproc-gateways).\n**Note:** Only when using [image version 1.5](/dataproc/docs/concepts/versioning/dataproc-version-clusters#unsupported-dataproc-image-versions), installation of the Jupyter component also requires installation of the [Anaconda](/dataproc/docs/concepts/components/anaconda) component. \n\n### Console\n\n1. Enable the component.\n - In the Google Cloud console, open the Dataproc [Create a cluster](https://console.cloud.google.com/dataproc/clustersAdd) page. The **Set up cluster** panel is selected.\n - In the **Components** section:\n - Under **Optional components** , select the **Jupyter** component.\n - Under **Component Gateway** , select **Enable component gateway** (see [Viewing and Accessing Component Gateway URLs](/dataproc/docs/concepts/accessing/dataproc-gateways#viewing_and_accessing_component_gateway_urls)).\n\n### gcloud CLI\n\nTo create a Dataproc cluster that includes the Jupyter component,\nuse the\n[gcloud dataproc clusters create](/sdk/gcloud/reference/dataproc/clusters/create) \u003cvar translate=\"no\"\u003ecluster-name\u003c/var\u003e command with the `--optional-components` flag.\n\n**Latest default image version example**\n\nThe following example installs the Jupyter\ncomponent on a cluster that uses the latest default image version. \n\n```\ngcloud dataproc clusters create cluster-name \\\n --optional-components=JUPYTER \\\n --region=region \\\n --enable-component-gateway \\\n ... other flags\n```\n\n### REST API\n\nThe Jupyter component\ncan be installed through the Dataproc API using\n[`SoftwareConfig.Component`](/dataproc/docs/reference/rest/v1/ClusterConfig#Component)\nas part of a\n[`clusters.create`](/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\nrequest.\n\n- Set the [EndpointConfig.enableHttpPortAccess](/dataproc/docs/reference/rest/v1/ClusterConfig#endpointconfig) property to `true` as part of the `clusters.create` request to enable connecting to the Jupyter notebook Web UI using the [Component Gateway](/dataproc/docs/concepts/accessing/dataproc-gateways).\n\nOpen the Jupyter and JupyterLab UIs\n-----------------------------------\n\nClick the [Google Cloud console Component Gateway links](/dataproc/docs/concepts/accessing/dataproc-gateways#viewing_and_accessing_component_gateway_urls)\nto open in your local browser the Jupyter notebook or JupyterLab UI running on\nthe cluster master node.\n\n**Select \"GCS\" or \"Local Disk\" to create a new Jupyter Notebook in\neither location.**\n\nAttach GPUs to master and worker nodes\n--------------------------------------\n\nYou can [add GPUs](https://cloud.google.com/dataproc/docs/concepts/compute/gpus)\nto your cluster's master and worker nodes when using a Jupyter notebook to:\n\n1. Preprocess data in Spark, then collect a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) onto the master and run [TensorFlow](https://www.tensorflow.org/)\n2. Use Spark to orchestrate TensorFlow runs in parallel\n3. Run [Tensorflow-on-YARN](https://github.com/criteo/tf-yarn)\n4. Use with other machine learning scenarios that use GPUs"]]