本頁面由 Cloud Translation API 翻譯而成。

使用 BigQuery 連接器編寫 MapReduce 工作

根據預設，Hadoop BigQuery 連接器會安裝在所有 Dataproc 1.0-1.2 叢集節點的 /usr/lib/hadoop/lib/ 下。在 Spark 和 PySpark 環境中均可使用。

Dataproc 映像檔 1.5 以上版本：根據預設，BigQuery 連接器不會安裝在 Dataproc 映像檔 1.5 以上版本中。如要搭配這些版本使用：

使用這個初始化動作安裝 BigQuery 連接器。

提交工作時，請在 jars 參數中指定 BigQuery 連接器：

--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop3-latest.jar

在應用程式的 jar-with-dependencies 中納入 BigQuery 連接器類別。

避免發生衝突：如果應用程式使用的連接器版本不同於部署在 Dataproc 叢集中的連接器版本，您必須採取下列一項動作：

使用初始化動作建立新叢集，以安裝應用程式使用的連接器版本，或
將您使用的版本的連接器類別和連接器依附元件納入應用程式的 JAR 並重新安置，以免連接器版本與在 Dataproc 叢集中部署的連接器版本發生衝突 (請參閱這個Maven 中依附元件重新安置的範例)。

GsonBigQueryInputFormat 類別

BigQueryInputFormat 已重新命名為 GsonBigQueryInputFormat，以強調其以 Gson} 為基礎的格式。

GsonBigQueryInputFormat 會透過下列主要作業，以 JsonObject 格式為 Hadoop 提供 BigQuery 物件：

使用使用者指定的查詢選取 BigQuery 物件
將查詢結果平均分配至 Hadoop 節點
將分割項目剖析為 Java 物件，以便傳遞至 Mapper。Hadoop Mapper 類別會接收每個所選 BigQuery 物件的 JsonObject 表示法。

BigQueryInputFormat 類別可透過 Hadoop InputFormat 類別的擴充功能，提供 BigQuery 記錄的存取權。如要使用 BigQueryInputFormat 類別：

您必須在主要 Hadoop 工作中加入行，才能在 Hadoop 設定中設定參數。
InputFormat 類別必須設為 GsonBigQueryInputFormat。

請參閱下列各節，瞭解如何符合這些規定。

輸入參數

QualifiedInputTableId: 要讀取的 BigQuery 資料表，格式如下： optional-projectId:datasetId。tableId
範例： publicdata:samples.shakespeare
專案 ID: 所有輸入作業都會在這個 BigQuery 專案 ID 下執行。
示例： my-first-cloud-project

// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId);

// Configure input parameters.
BigQueryConfiguration.configureBigQueryInput(conf, inputQualifiedTableId);

// Set InputFormat.
job.setInputFormatClass(GsonBigQueryInputFormat.class);

注意：

job 是指 org.apache.hadoop.mapreduce.Job，也就是要執行的 Hadoop 工作。
conf 是指 Hadoop 工作的 org.apache.hadoop.Configuration。

Mapper

GsonBigQueryInputFormat 類別會從 BigQuery 讀取資料，並一次傳遞一個 BigQuery 物件，做為 Hadoop Mapper 函式的輸入內容。輸入內容的形式為一組，包含下列項目：

LongWritable，記錄號碼
JsonObject，以 Json 格式呈現的 BigQuery 記錄

Mapper 會接受 LongWritable 和 JsonObject pair 做為輸入內容。

以下是 Mapper 的程式碼片段，適用於WordCount 範例工作。

  // private static final LongWritable ONE = new LongWritable(1);
  // The configuration key used to specify the BigQuery field name
  // ("column name").
  public static final String WORDCOUNT_WORD_FIELDNAME_KEY =
      "mapred.bq.samples.wordcount.word.key";

  // Default value for the configuration entry specified by
  // WORDCOUNT_WORD_FIELDNAME_KEY. Examples: 'word' in
  // publicdata:samples.shakespeare or 'repository_name'
  // in publicdata:samples.github_timeline.
  public static final String WORDCOUNT_WORD_FIELDNAME_VALUE_DEFAULT = "word";

  /**
   * The mapper function for WordCount.
   */
  public static class Map
      extends Mapper <LongWritable, JsonObject, Text, LongWritable> {
    private static final LongWritable ONE = new LongWritable(1);
    private Text word = new Text();
    private String wordKey;

    @Override
    public void setup(Context context)
        throws IOException, InterruptedException {
      // Find the runtime-configured key for the field name we're looking for
      // in the map task.
      Configuration conf = context.getConfiguration();
      wordKey = conf.get(WORDCOUNT_WORD_FIELDNAME_KEY,
          WORDCOUNT_WORD_FIELDNAME_VALUE_DEFAULT);
    }

    @Override
    public void map(LongWritable key, JsonObject value, Context context)
        throws IOException, InterruptedException {
      JsonElement countElement = value.get(wordKey);
      if (countElement != null) {
        String wordInRecord = countElement.getAsString();
        word.set(wordInRecord);
        // Write out the key, value pair (write out a value of 1, which will be
        // added to the total count for this word in the Reducer).
        context.write(word, ONE);
      }
    }
  }

IndirectBigQueryOutputFormat 類別

IndirectBigQueryOutputFormat 提供 Hadoop 且可讓您將 JsonObject 值直接寫入 BigQuery 表格。這個類別可透過 Hadoop OutputFormat 類別的擴充功能，提供 BigQuery 記錄的存取權。如要正確使用，必須在 Hadoop 設定中設定數個參數，且 OutputFormat 類別必須設為 IndirectBigQueryOutputFormat。下列範例說明要設定的參數，以及正確使用 IndirectBigQueryOutputFormat 所需的幾行程式碼。

IndirectBigQueryOutputFormat 會先將所有資料緩衝到 Cloud Storage 臨時資料表，然後在 commitJob 中，透過單一作業將 Cloud Storage 中的所有資料複製到 BigQuery。建議將此方法用於大型工作，因為相較於 BigQueryOutputFormat，此方法只需要每個 Hadoop/Spark 工作執行一個 BigQuery「載入」工作，但後者卻需要每個 Hadoop/Spark 工作執行一個 BigQuery 工作。

輸出參數

專案 ID: 所有輸出作業都會在這個 BigQuery projectId 下執行。
範例： "my-first-cloud-project"
QualifiedOutputTableId: 要將最終作業結果寫入的 BigQuery 資料集，格式為 optional-projectId:datasetId.tableId。您的專案中應已存在 datasetId。outputDatasetId系統會在 BigQuery 中建立 _hadoop_temporary 資料集，用於暫時性結果。請確認這不會與現有資料集發生衝突。
示例：
test_output_dataset.wordcount_output
my-first-cloud-project:test_output_dataset.wordcount_output
outputTableFieldSchema: 定義輸出 BigQuery 資料表的結構定義
GcsOutputPath: 用於儲存暫時性 Cloud Storage 資料的輸出路徑 (gs://bucket/dir/)

    // Define the schema we will be using for the output BigQuery table.
    List<TableFieldSchema> outputTableFieldSchema = new ArrayList<TableFieldSchema>();
    outputTableFieldSchema.add(new TableFieldSchema().setName("Word").setType("STRING"));
    outputTableFieldSchema.add(new TableFieldSchema().setName("Count").setType("INTEGER"));
    TableSchema outputSchema = new TableSchema().setFields(outputTableFieldSchema);

    // Create the job and get its configuration.
    Job job = new Job(parser.getConfiguration(), "wordcount");
    Configuration conf = job.getConfiguration();

    // Set the job-level projectId.
    conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId);

    // Configure input.
    BigQueryConfiguration.configureBigQueryInput(conf, inputQualifiedTableId);

    // Configure output.
    BigQueryOutputConfiguration.configure(
        conf,
        outputQualifiedTableId,
        outputSchema,
        outputGcsPath,
        BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
        TextOutputFormat.class);

    // (Optional) Configure the KMS key used to encrypt the output table.
    BigQueryOutputConfiguration.setKmsKeyName(
        conf,
        "projects/myproject/locations/us-west1/keyRings/r1/cryptoKeys/k1");
);

Reducer

IndirectBigQueryOutputFormat 類別會將資料寫入 BigQuery。這項函式會將鍵和 JsonObject 值做為輸入內容，並只將 JsonObject 值寫入 BigQuery (系統會忽略鍵)。JsonObject 應包含以 JSON 格式編寫的 BigQuery 記錄。Reducer 應輸出任意類型的鍵 (WordCount 範例工作使用 NullWritable) 和 JsonObject 值組合。以下是 WordCount 範例工作的 Reducer。

  /**
   * Reducer function for WordCount.
   */
  public static class Reduce
      extends Reducer<Text, LongWritable, JsonObject, NullWritable> {

    @Override
    public void reduce(Text key, Iterable<LongWritable> values, Context context)
        throws IOException, InterruptedException {
      // Add up the values to get a total number of occurrences of our word.
      long count = 0;
      for (LongWritable val : values) {
        count = count + val.get();
      }

      JsonObject jsonObject = new JsonObject();
      jsonObject.addProperty("Word", key.toString());
      jsonObject.addProperty("Count", count);
      // Key does not matter.
      context.write(jsonObject, NullWritable.get());
    }
  }

清除所用資源

工作完成後，請清理 Cloud Storage 匯出路徑。

job.waitForCompletion(true);
GsonBigQueryInputFormat.cleanupJob(job.getConfiguration(), job.getJobID());

您可以在 Google Cloud 控制台中，查看 BigQuery 輸出資料表中的字詞計數。

WordCount 範例作业的完整程式碼

以下程式碼是簡易的 WordCount 工作範例，可匯總 BigQuery 中物件的字數。

package com.google.cloud.hadoop.io.bigquery.samples;

import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration;
import com.google.cloud.hadoop.io.bigquery.BigQueryFileFormat;
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat;
import com.google.cloud.hadoop.io.bigquery.output.BigQueryOutputConfiguration;
import com.google.cloud.hadoop.io.bigquery.output.IndirectBigQueryOutputFormat;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

/**
 * Sample program to run the Hadoop Wordcount example over tables in BigQuery.
 */
public class WordCount {

 // The configuration key used to specify the BigQuery field name
  // ("column name").
  public static final String WORDCOUNT_WORD_FIELDNAME_KEY =
      "mapred.bq.samples.wordcount.word.key";

  // Default value for the configuration entry specified by
  // WORDCOUNT_WORD_FIELDNAME_KEY. Examples: 'word' in
  // publicdata:samples.shakespeare or 'repository_name'
  // in publicdata:samples.github_timeline.
  public static final String WORDCOUNT_WORD_FIELDNAME_VALUE_DEFAULT = "word";

  // Guava might not be available, so define a null / empty helper:
  private static boolean isStringNullOrEmpty(String toTest) {
    return toTest == null || "".equals(toTest);
  }

  /**
   * The mapper function for WordCount. For input, it consumes a LongWritable
   * and JsonObject as the key and value. These correspond to a row identifier
   * and Json representation of the row's values/columns.
   * For output, it produces Text and a LongWritable as the key and value.
   * These correspond to the word and a count for the number of times it has
   * occurred.
   */

  public static class Map
      extends Mapper <LongWritable, JsonObject, Text, LongWritable> {
    private static final LongWritable ONE = new LongWritable(1);
    private Text word = new Text();
    private String wordKey;

    @Override
    public void setup(Context context)
        throws IOException, InterruptedException {
      // Find the runtime-configured key for the field name we're looking for in
      // the map task.
      Configuration conf = context.getConfiguration();
      wordKey = conf.get(WORDCOUNT_WORD_FIELDNAME_KEY, WORDCOUNT_WORD_FIELDNAME_VALUE_DEFAULT);
    }

    @Override
    public void map(LongWritable key, JsonObject value, Context context)
        throws IOException, InterruptedException {
      JsonElement countElement = value.get(wordKey);
      if (countElement != null) {
        String wordInRecord = countElement.getAsString();
        word.set(wordInRecord);
        // Write out the key, value pair (write out a value of 1, which will be
        // added to the total count for this word in the Reducer).
        context.write(word, ONE);
      }
    }
  }

  /**
   * Reducer function for WordCount. For input, it consumes the Text and
   * LongWritable that the mapper produced. For output, it produces a JsonObject
   * and NullWritable. The JsonObject represents the data that will be
   * loaded into BigQuery.
   */
  public static class Reduce
      extends Reducer<Text, LongWritable, JsonObject, NullWritable> {

    @Override
    public void reduce(Text key, Iterable<LongWritable> values, Context context)
        throws IOException, InterruptedException {
      // Add up the values to get a total number of occurrences of our word.
      long count = 0;
      for (LongWritable val : values) {
        count = count + val.get();
      }

      JsonObject jsonObject = new JsonObject();
      jsonObject.addProperty("Word", key.toString());
      jsonObject.addProperty("Count", count);
      // Key does not matter.
      context.write(jsonObject, NullWritable.get());
    }
  }

  /**
   * Configures and runs the main Hadoop job. Takes a String[] of 5 parameters:
   * [ProjectId] [QualifiedInputTableId] [InputTableFieldName]
   * [QualifiedOutputTableId] [GcsOutputPath]
   *
   * ProjectId - Project under which to issue the BigQuery
   * operations. Also serves as the default project for table IDs that don't
   * specify a project for the table.
   *
   * QualifiedInputTableId - Input table ID of the form
   * (Optional ProjectId):[DatasetId].[TableId]
   *
   * InputTableFieldName - Name of the field to count in the
   * input table, e.g., 'word' in publicdata:samples.shakespeare or
   * 'repository_name' in publicdata:samples.github_timeline.
   *
   * QualifiedOutputTableId - Input table ID of the form
   * (Optional ProjectId):[DatasetId].[TableId]
   *
   * GcsOutputPath - The output path to store temporary
   * Cloud Storage data, e.g., gs://bucket/dir/
   *
   * @param args a String[] containing ProjectId, QualifiedInputTableId,
   *     InputTableFieldName, QualifiedOutputTableId, and GcsOutputPath.
   * @throws IOException on IO Error.
   * @throws InterruptedException on Interrupt.
   * @throws ClassNotFoundException if not all classes are present.
   */
  public static void main(String[] args)
      throws IOException, InterruptedException, ClassNotFoundException {

    // GenericOptionsParser is a utility to parse command line arguments
    // generic to the Hadoop framework. This example doesn't cover the specifics,
    // but recognizes several standard command line arguments, enabling
    // applications to easily specify a NameNode, a ResourceManager, additional
    // configuration resources, etc.
    GenericOptionsParser parser = new GenericOptionsParser(args);
    args = parser.getRemainingArgs();

    // Make sure we have the right parameters.
    if (args.length != 5) {
      System.out.println(
          "Usage: hadoop jar bigquery_wordcount.jar [ProjectId] [QualifiedInputTableId] "
              + "[InputTableFieldName] [QualifiedOutputTableId] [GcsOutputPath]\n"
              + "    ProjectId - Project under which to issue the BigQuery operations. Also serves "
              + "as the default project for table IDs that don't explicitly specify a project for "
              + "the table.\n"
              + "    QualifiedInputTableId - Input table ID of the form "
              + "(Optional ProjectId):[DatasetId].[TableId]\n"
              + "    InputTableFieldName - Name of the field to count in the input table, e.g., "
              + "'word' in publicdata:samples.shakespeare or 'repository_name' in "
              + "publicdata:samples.github_timeline.\n"
              + "    QualifiedOutputTableId - Input table ID of the form "
              + "(Optional ProjectId):[DatasetId].[TableId]\n"
              + "    GcsOutputPath - The output path to store temporary Cloud Storage data, e.g., "
              + "gs://bucket/dir/");
      System.exit(1);
    }

    // Get the individual parameters from the command line.
    String projectId = args[0];
    String inputQualifiedTableId = args[1];
    String inputTableFieldId = args[2];
    String outputQualifiedTableId = args[3];
    String outputGcsPath = args[4];

   // Define the schema we will be using for the output BigQuery table.
    List<TableFieldSchema> outputTableFieldSchema = new ArrayList<TableFieldSchema>();
    outputTableFieldSchema.add(new TableFieldSchema().setName("Word").setType("STRING"));
    outputTableFieldSchema.add(new TableFieldSchema().setName("Count").setType("INTEGER"));
    TableSchema outputSchema = new TableSchema().setFields(outputTableFieldSchema);

    // Create the job and get its configuration.
    Job job = new Job(parser.getConfiguration(), "wordcount");
    Configuration conf = job.getConfiguration();

    // Set the job-level projectId.
    conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId);

    // Configure input.
    BigQueryConfiguration.configureBigQueryInput(conf, inputQualifiedTableId);

    // Configure output.
    BigQueryOutputConfiguration.configure(
        conf,
        outputQualifiedTableId,
        outputSchema,
        outputGcsPath,
        BigQueryFileFormat.NEWLINE_DELIMITED_JSON,
        TextOutputFormat.class);

    // (Optional) Configure the KMS key used to encrypt the output table.
    BigQueryOutputConfiguration.setKmsKeyName(
        conf,
        "projects/myproject/locations/us-west1/keyRings/r1/cryptoKeys/k1");

    conf.set(WORDCOUNT_WORD_FIELDNAME_KEY, inputTableFieldId);

    // This helps Hadoop identify the Jar which contains the mapper and reducer
    // by specifying a class in that Jar. This is required if the jar is being
    // passed on the command line to Hadoop.
    job.setJarByClass(WordCount.class);

    // Tell the job what data the mapper will output.
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);
    job.setInputFormatClass(GsonBigQueryInputFormat.class);

    // Instead of using BigQueryOutputFormat, we use the newer
    // IndirectBigQueryOutputFormat, which works by first buffering all the data
    // into a Cloud Storage temporary file, and then on commitJob, copies all data from
    // Cloud Storage into BigQuery in one operation. Its use is recommended for large jobs
    // since it only requires one BigQuery "load" job per Hadoop/Spark job, as
    // compared to BigQueryOutputFormat, which performs one BigQuery job for each
    // Hadoop/Spark task.
    job.setOutputFormatClass(IndirectBigQueryOutputFormat.class);

    job.waitForCompletion(true);

    // After the job completes, clean up the Cloud Storage export paths.
    GsonBigQueryInputFormat.cleanupJob(job.getConfiguration(), job.getJobID());

    // You can view word counts in the BigQuery output table at
    // https://console.cloud.google.com/.
  }
}

Java 版本

BigQuery 連接器需要 Java 8。

Apache Maven 依附元件資訊

<dependency>
    <groupId>com.google.cloud.bigdataoss</groupId>
    <artifactId>bigquery-connector</artifactId>
    <version>insert "hadoopX-X.X.X" connector version number here</version>
</dependency>

詳情請參閱 BigQuery 連接器的版本資訊和 Javadoc 參考資料。

使用 BigQuery 連接器編寫 MapReduce 工作 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

GsonBigQueryInputFormat 類別

輸入參數

Mapper

IndirectBigQueryOutputFormat 類別

輸出參數

Reducer

清除所用資源

WordCount 範例作业的完整程式碼

Java 版本

Apache Maven 依附元件資訊

使用 BigQuery 連接器編寫 MapReduce 工作