Analyze your data

Last reviewed 2023-08-08 UTC

This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key data-analytics services, and how they can help at the various stages of the data lifecycle. These best practices help you to meet your data analytics needs and create your system design.

Core principles

Businesses want to analyze data and generate actionable insights from that data. Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization. Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam.

Your particular focus, team expertise, and strategic outlook help you to determine which Google Cloud services you adopt to support your data analytics needs. For example, Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs. Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.

In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.

Key services

The following table provides a high-level overview of Google Cloud analytics services:

Google Cloud service Description
Pub/Sub Simple, reliable, and scalable foundation for stream analytics and event-driven computing systems.
Dataflow A fully managed service to transform and enrich data in stream (real time) and batch (historical) modes.
Dataprep by Trifacta Intelligent data service to visually explore, clean, and prepare structured and unstructured data for analysis.
Dataproc Fast, easy-to-use, and fully managed cloud service to run Apache Spark and Apache Hadoop clusters.
Cloud Data Fusion Fully managed, data integration service that's built for the cloud and lets you build and manage ETL/ELT data pipelines. Cloud DataFusion provides a graphical interface and a broad open source library of preconfigured connectors and transformations.
BigQuery Fully managed, low-cost, serverless data warehouse that scales with your storage and compute power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data.
Cloud Composer Fully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span clouds and on-premises data centers.
Data Catalog Fully managed and scalable metadata management service that helps you discover, manage, and understand all your data.
Looker Studio Fully managed visual analytics service that can help you unlock insights from data through interactive dashboards.
Looker Enterprise platform that connects, analyzes, and visualizes data across multi-cloud environments.
Dataform Fully managed product to help you collaborate, create, and deploy data pipelines, and ensure data quality.
Dataplex Managed data lake service that centrally manages, monitors, and governs data across data lakes, data warehouses, and data marts using consistent controls.
AnalyticsHub Platform that efficiently and securely exchanges data analytics assets across your organization to address challenges of data reliability and cost.

Data lifecycle

When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.

The data lifecycle includes the following stages and example services:

The following stages and services run across the entire data lifecycle:

  • Data integration includes services such as Data Fusion.
  • Metadata management and governance includes services such as Data Catalog.
  • Workflow management includes services such as Cloud Composer.

Data ingestion

Apply the following data ingestion best practices to your own environment.

Determine the data source for ingestion

Data typically comes from another cloud provider or service, or from an on-premises location:

Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.

Identify streaming or batch data sources

Consider how you need to use your data and identify where you have streaming or batch use cases. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery.

If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template. For batch workloads, the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or Storage Transfer Service to ingest data.

Ingest data with automated tools

Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.

Use migration tools to ingest from another data warehouse

If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance. The BigQuery Data Transfer Service also provides third-party transfers that help you ingest data on a schedule from external sources. For more information, see the detailed migration approaches for each data warehouse.

Estimate your data ingestion needs

The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second. Capacity, storage, and regional requirements for your data help you to determine whether Pub/Sub Lite is a better option for your system design. For more information, see Choosing Pub/Sub or Pub/Sub Lite.

For batch ingestion of data, estimate how much data you want to transfer in total, and how quickly you want to do it. Review the available migration options, including an estimate on time and comparison of online versus offline transfers.

Use appropriate tools to regularly ingest data on a schedule

Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs. For fine-grain control of the timing of ingestion or the source and destination system, use a workflow-management system like Cloud Composer. If you want a more manual approach, you can use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.

Review FTP/SFTP server data ingest needs

If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins. If you want to modernize and create a long-term workflow solution, Cloud Composer is a fully managed service that lets you read and write from various sources and sinks.

Use Apache Kafka connectors to ingest data

If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors. For example, the open source Pub/Sub Kafka connector lets you ingest data from Pub/Sub or Pub/Sub Lite.

Additional resources

Data storage

Apply the following data storage best practices to your own environment.

Choose the appropriate data store for your needs

To help you choose what type of storage solution to use, review and understand the downstream usage of your data. The following common use cases for your data give recommendations for which Google Cloud product to use:

Data use case Product recommendation
File-based Filestore
Object-based Cloud Storage
Low latency Bigtable
Time series Bigtable
Online cache Memorystore
Transaction processing Cloud SQL
Business intelligence (BI) & analytics BigQuery
Batch processing Cloud Storage

Bigtable if incoming data is time series and you need low latency access to it.

BigQuery if you use SQL.

Review your data structure needs

For most unstructured data, such as documents and text files, audio and video files, or logs, an object-based store is the most suitable choice. You can then load and process the data from object storage when you need it.

For semi-structured data, such as XML or JSON, your use cases and data access patterns help guide your choice. You can load such datasets into BigQuery for automatic schema detection. If you have low latency requirements, you can load your JSON data into Bigtable. If you have legacy requirements or your applications work with relational databases, you can also load datasets into a relation store.

For structured data, such as CSV, Parquet, Avro, or ORC, you can use BigQuery if you have BI and analytics requirements that use SQL. For more information, see how to batch load data. If you want to create a data lake on open standards and technologies, you can use Cloud Storage.

Migrate data and reduce costs for HDFS

Look for ways to move Hadoop Distributed File System (HDFS) data from on-premises or from another cloud provider to a cheaper object-storage system. Cloud Storage is the most common choice that enterprises make as an alternative data store. For information about the advantages and disadvantages of this choice, see HDFS vs. Cloud Storage.

You can move data with a push or pull method. Both methods use the hadoop distcp command. For more information, see Migrating HDFS Data from On-Premises to Google Cloud.

You can also use the open source Cloud Storage connector to let Hadoop and Spark jobs access data in Cloud Storage. The connector is installed by default on Dataproc clusters, and can be manually installed on other clusters.

Use object storage to build a cohesive data lake

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. You can use Cloud Composer and Cloud Data Fusion to build a data lake.

To build a modern data platform, you can use BigQuery as your central data source instead of Cloud Storage. BigQuery is a modern data warehouse with separation of storage and compute. A data lake built on BigQuery lets you perform traditional analytics from BigQuery in the Cloud console. It also lets you access the data stored from other frameworks such as Apache Spark.

Additional resources

Process and transform data

Apply the following data analytics best practices to your own environment when you process and transform data.

Explore the open source software you can use in Google Cloud

Many Google Cloud services use open source software to help make your transition seamless. Google Cloud offers managed and serverless solutions that have Open APIs and are compatible with open source frameworks to reduce vendor lock-in.

Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper. It also provides Hive Metastore as a managed service to remove itself as a single point of failure in the Hadoop ecosystem.

You can migrate to