Tetap teratur dengan koleksi
Simpan dan kategorikan konten berdasarkan preferensi Anda.
Untuk membaca data dari Cloud Storage ke Dataflow, gunakan
konektor I/OTextIO atau AvroIO
Apache Beam.
Menyertakan dependensi library Google Cloud
Untuk menggunakan konektor TextIO atau AvroIO dengan Cloud Storage, sertakan
dependensi berikut. Library ini menyediakan handler skema untuk nama file "gs://".
Mengaktifkan konektor I/O gRPC di Apache Beam pada Dataflow
Anda dapat terhubung ke Cloud Storage menggunakan gRPC melalui
konektor I/O Apache Beam di Dataflow. gRPC adalah framework panggilan prosedur jarak jauh (RPC) open source berperforma tinggi yang dikembangkan oleh Google yang dapat Anda gunakan untuk berinteraksi dengan Cloud Storage.
Untuk mempercepat permintaan baca tugas Dataflow Anda ke Cloud Storage, Anda dapat mengaktifkan konektor I/O Apache Beam di Dataflow untuk menggunakan gRPC.
Anda dapat mengonfigurasi konektor I/O Apache Beam di Dataflow untuk menghasilkan metrik terkait gRPC di Cloud Monitoring. Metrik terkait gRPC dapat membantu Anda melakukan hal berikut:
Pantau dan optimalkan performa permintaan gRPC ke Cloud Storage.
Memecahkan masalah dan melakukan proses debug.
Dapatkan insight tentang penggunaan dan perilaku aplikasi Anda.
Untuk mengetahui informasi tentang cara mengonfigurasi konektor I/O Apache Beam di Dataflow
untuk membuat metrik terkait gRPC, lihat Menggunakan metrik sisi klien.
Jika pengumpulan metrik tidak diperlukan untuk kasus penggunaan Anda, Anda dapat memilih untuk tidak ikut pengumpulan metrik.
Untuk mengetahui petunjuknya, lihat Memilih tidak mengaktifkan metrik sisi klien.
Keparalelan
Konektor TextIO dan AvroIO mendukung dua tingkat paralelisme:
Setiap file diberi kunci secara terpisah, sehingga beberapa pekerja dapat membacanya.
Jika file tidak dikompresi, konektor dapat membaca sub-rentang setiap file secara terpisah, sehingga menghasilkan tingkat paralelisme yang sangat tinggi. Pemecahan ini hanya mungkin dilakukan jika setiap baris dalam file adalah kumpulan data yang bermakna. Misalnya, fitur ini tidak tersedia secara default untuk file JSON.
Performa
Tabel berikut menunjukkan metrik performa untuk membaca dari
Cloud Storage. Beban kerja dijalankan di satu pekerja e2-standard2, menggunakan Apache Beam SDK 2.49.0 untuk Java. Mereka tidak menggunakan Runner v2.
Metrik ini didasarkan pada pipeline batch sederhana. Benchmark ini dimaksudkan untuk membandingkan performa antara konektor I/O, dan tidak selalu mewakili pipeline dunia nyata.
Performa pipeline Dataflow bersifat kompleks, dan merupakan fungsi dari jenis VM, data yang diproses, performa sumber dan sink eksternal, serta kode pengguna. Metrik didasarkan pada
menjalankan Java SDK, dan tidak mewakili karakteristik performa SDK bahasa lainnya. Untuk mengetahui informasi selengkapnya, lihat Performa IO Beam.
Praktik terbaik
Hindari penggunaan watchForNewFiles dengan
Cloud Storage. Pendekatan ini tidak dapat diskalakan dengan baik untuk pipeline produksi besar, karena konektor harus menyimpan daftar file yang telah dilihat dalam memori. Daftar tidak dapat dihapus dari memori, yang mengurangi memori kerja pekerja dari waktu ke waktu. Pertimbangkan untuk menggunakan
notifikasi Pub/Sub untuk Cloud Storage
sebagai gantinya. Untuk mengetahui informasi selengkapnya, lihat
Pola pemrosesan file.
Jika nama file dan konten file adalah data yang berguna, gunakan class
FileIO untuk membaca nama file. Misalnya, nama file dapat berisi metadata yang berguna saat memproses data dalam file. Untuk informasi selengkapnya, lihat Mengakses nama file.
Dokumentasi FileIO juga menunjukkan contoh pola ini.
Contoh
Contoh berikut menunjukkan cara membaca dari Cloud Storage.
importorg.apache.beam.sdk.Pipeline;importorg.apache.beam.sdk.PipelineResult;importorg.apache.beam.sdk.io.TextIO;importorg.apache.beam.sdk.options.Description;importorg.apache.beam.sdk.options.PipelineOptions;importorg.apache.beam.sdk.options.PipelineOptionsFactory;importorg.apache.beam.sdk.transforms.MapElements;importorg.apache.beam.sdk.values.TypeDescriptors;publicclassReadFromStorage{publicstaticPipelinecreatePipeline(Optionsoptions){varpipeline=Pipeline.create(options);pipeline// Read from a text file..apply(TextIO.read().from("gs://"+options.getBucket()+"/*.txt")).apply(MapElements.into(TypeDescriptors.strings()).via((x->{System.out.println(x);returnx;})));returnpipeline;}}
[[["Mudah dipahami","easyToUnderstand","thumb-up"],["Memecahkan masalah saya","solvedMyProblem","thumb-up"],["Lainnya","otherUp","thumb-up"]],[["Sulit dipahami","hardToUnderstand","thumb-down"],["Informasi atau kode contoh salah","incorrectInformationOrSampleCode","thumb-down"],["Informasi/contoh yang saya butuhkan tidak ada","missingTheInformationSamplesINeed","thumb-down"],["Masalah terjemahan","translationIssue","thumb-down"],["Lainnya","otherDown","thumb-down"]],["Terakhir diperbarui pada 2025-08-18 UTC."],[[["\u003cp\u003eTo read data from Cloud Storage to Dataflow, use the Apache Beam \u003ccode\u003eTextIO\u003c/code\u003e or \u003ccode\u003eAvroIO\u003c/code\u003e I/O connector and include the Google Cloud library dependency, which provides a schema handler for \u003ccode\u003e"gs://"\u003c/code\u003e filenames.\u003c/p\u003e\n"],["\u003cp\u003eEnabling gRPC through the Apache Beam I/O connector on Dataflow can accelerate Dataflow job read requests to Cloud Storage, using the pipeline option \u003ccode\u003e--additional-experiments=use_grpc_for_gcs\u003c/code\u003e or \u003ccode\u003e--experiments=use_grpc_for_gcs\u003c/code\u003e, and requires Apache Beam SDK version 2.55.0 or later.\u003c/p\u003e\n"],["\u003cp\u003eThe \u003ccode\u003eTextIO\u003c/code\u003e and \u003ccode\u003eAvroIO\u003c/code\u003e connectors offer parallelism by allowing multiple workers to read individual files separately, and for uncompressed files, sub-ranges of each file can be read separately, enhancing parallelism.\u003c/p\u003e\n"],["\u003cp\u003eAvoid using \u003ccode\u003ewatchForNewFiles\u003c/code\u003e with Cloud Storage in large production pipelines, and use Pub/Sub notifications instead to prevent memory issues, and use \u003ccode\u003eFileIO\u003c/code\u003e class when both the filename and file contents are valuable data.\u003c/p\u003e\n"],["\u003cp\u003ePerformance metrics for reading from Cloud Storage using the Apache Beam SDK 2.49.0 for Java on an \u003ccode\u003ee2-standard2\u003c/code\u003e worker showed throughputs of 320 MBps and 320,000 elements per second for 100M records with 1 kB and 1 column.\u003c/p\u003e\n"]]],[],null,["# Read from Cloud Storage to Dataflow\n\nTo read data from Cloud Storage to Dataflow, use the\nApache Beam `TextIO` or `AvroIO`\n[I/O connector](https://beam.apache.org/documentation/io/connectors/).\n| **Note:** Depending on your scenario, consider using one of the [Google-provided Dataflow templates](/dataflow/docs/guides/templates/provided-templates). Several of these templates read from Cloud Storage.\n\nInclude the Google Cloud library dependency\n-------------------------------------------\n\nTo use the `TextIO` or `AvroIO` connector with Cloud Storage, include\nthe following dependency. This library provides a schema handler for `\"gs://\"`\nfilenames. \n\n### Java\n\n \u003cdependency\u003e\n \u003cgroupId\u003eorg.apache.beam\u003c/groupId\u003e\n \u003cartifactId\u003ebeam-sdks-java-io-google-cloud-platform\u003c/artifactId\u003e\n \u003cversion\u003e${beam.version}\u003c/version\u003e\n \u003c/dependency\u003e\n\n### Python\n\n apache-beam[gcp]==\u003cvar translate=\"no\"\u003e\u003cspan class=\"devsite-syntax-n\"\u003eVERSION\u003c/span\u003e\u003c/var\u003e\n\n### Go\n\n import _ \"github.com/apache/beam/sdks/v2/go/pkg/beam/io/filesystem/gcs\"\n\nFor more information, see\n[Install the Apache Beam SDK](/dataflow/docs/guides/installing-beam-sdk).\n\nEnable gRPC on Apache Beam I/O connector on Dataflow\n----------------------------------------------------\n\nYou can [connect to Cloud Storage using gRPC](/storage/docs/enable-grpc-api) through the\nApache Beam I/O connector on Dataflow. [gRPC](https://grpc.io/) is a\nhigh performance open-source remote procedure call (RPC) framework developed by\nGoogle that you can use to interact with\nCloud Storage.\n\nTo speed up your Dataflow job's read requests to Cloud Storage, you\ncan enable the Apache Beam I/O connector on Dataflow to use gRPC.\n\n### Command line\n\n1. Ensure that you use the [Apache Beam SDK](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options) version [2.55.0](https://beam.apache.org/get-started/beam-overview/) or later.\n2. To run a Dataflow job, use `--additional-experiments=use_grpc_for_gcs` pipeline option. For information about the different pipeline options, see [Optional flags](/sdk/gcloud/reference/dataflow/jobs/run#--additional-experiments).\n\n### Apache Beam SDK\n\n1. Ensure that you use the [Apache Beam SDK](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options) version [2.55.0](https://beam.apache.org/get-started/beam-overview/) or later.\n2. To run a Dataflow job, use `--experiments=use_grpc_for_gcs` pipeline option. For information about the different pipeline options, see [Basic\n options](/dataflow/docs/reference/pipeline-options#basic_options).\n\nYou can configure Apache Beam I/O connector on Dataflow to generate gRPC\nrelated metrics in Cloud Monitoring. The gRPC related metrics can help you to do the following:\n\n- Monitor and optimize the performance of gRPC requests to Cloud Storage.\n- Troubleshoot and debug issues.\n- Gain insights into your application's usage and behavior.\n\n\u003cbr /\u003e\n\nFor information about how to configure Apache Beam I/O connector on Dataflow to generate gRPC related metrics, see [Use client-side metrics](/storage/docs/client-side-metrics). If gathering metrics isn't necessary for your use case, you can choose to opt-out of metrics collection. For instructions, see [Opt-out of client-side\nmetrics](/storage/docs/client-side-metrics#opt_out_of_client-side_metrics).\n\n\u003cbr /\u003e\n\nParallelism\n-----------\n\nThe `TextIO` and `AvroIO` connectors support two levels of parallelism:\n\n- Individual files are keyed separately, so that multiple workers can read them.\n- If the files are uncompressed, the connector can read sub-ranges of each file separately, leading to a very high level of parallelism. This splitting is only possible if each line in the file is a meaningful record. For example, it's not available by default for JSON files.\n\nPerformance\n-----------\n\nThe following table shows performance metrics for reading from\nCloud Storage. The workloads were run on one `e2-standard2` worker,\nusing the Apache Beam SDK 2.49.0 for Java. They did not use Runner v2.\n\n\nThese metrics are based on simple batch pipelines. They are intended to compare performance\nbetween I/O connectors, and are not necessarily representative of real-world pipelines.\nDataflow pipeline performance is complex, and is a function of VM type, the data\nbeing processed, the performance of external sources and sinks, and user code. Metrics are based\non running the Java SDK, and aren't representative of the performance characteristics of other\nlanguage SDKs. For more information, see [Beam IO\nPerformance](https://beam.apache.org/performance/).\n\n\u003cbr /\u003e\n\nBest practices\n--------------\n\n- Avoid using [`watchForNewFiles`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-) with\n Cloud Storage. This approach scales poorly for large production\n pipelines, because the connector must keep a list of seen files in memory. The\n list can't be flushed from memory, which reduces the working memory of workers\n over time. Consider using\n [Pub/Sub notifications for Cloud Storage](/storage/docs/pubsub-notifications)\n instead. For more information, see\n [File processing patterns](https://beam.apache.org/documentation/patterns/file-processing/).\n\n- If both the filename and the file contents are useful data, use the\n [`FileIO`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html) class to read filenames. For example, a filename might\n contain metadata that is useful when processing the data in the file. For more\n information, see\n [Accessing filenames](https://beam.apache.org/documentation/patterns/file-processing/).\n The [`FileIO` documentation](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html) also shows an example of this pattern.\n\nExample\n-------\n\nThe following example shows how to read from Cloud Storage. \n\n### Java\n\n\nTo authenticate to Dataflow, set up Application Default Credentials.\nFor more information, see\n\n[Set up authentication for a local development environment](/docs/authentication/set-up-adc-local-dev-environment).\n\n import org.apache.beam.sdk.Pipeline;\n import org.apache.beam.sdk.PipelineResult;\n import org.apache.beam.sdk.io.TextIO;\n import org.apache.beam.sdk.options.Description;\n import org.apache.beam.sdk.options.PipelineOptions;\n import org.apache.beam.sdk.options.PipelineOptionsFactory;\n import org.apache.beam.sdk.transforms.MapElements;\n import org.apache.beam.sdk.values.TypeDescriptors;\n\n public class ReadFromStorage {\n public static Pipeline createPipeline(Options options) {\n var pipeline = Pipeline.create(options);\n pipeline\n // Read from a text file.\n .apply(TextIO.read().from(\n \"gs://\" + options.getBucket() + \"/*.txt\"))\n .apply(\n MapElements.into(TypeDescriptors.strings())\n .via(\n (x -\u003e {\n System.out.println(x);\n return x;\n })));\n return pipeline;\n }\n }\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Read the [`TextIO`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.html) API documentation.\n- See the list of [Google-provided templates](/dataflow/docs/guides/templates/provided-templates)."]]