Optimize your cloud resources

Before your peak capacity event occurs, manage and optimize the resources that are used by your Google Cloud workloads. This involves right-sizing resources based on actual usage and demand, using autoscaling for dynamic resource allocation, and reviewing architecture and security recommendations. Both Cloud Monitoring and Recommender (Active Assist) can help you to identify opportunities to optimize your cloud resources. By using these tools, you can gain insights into resource usage and make informed decisions prior to your event.

Review Google Cloud best practices

Many peak capacity event issues can be avoided by following the recommended best practices for the Google Cloud product that you are using. The following are examples of some best practice guides:

General	Google Cloud security best practices center Patterns for scalable and resilient apps
BigQuery	Optimize query computation
Cloud Storage	Best practices for Cloud Storage
Compute Engine	Best practices for the Compute Engine API Choose a Compute Engine deployment strategy for your workload Designing resilient systems
Dataflow	Dataflow pipeline best practices Best practices for highly parallel workflows
Dataproc	Dataproc best practices for production
Google Kubernetes Engine	Best practices for onboarding to GKE Best practices for GKE networking Best practices for running batch workloads on GKE About GKE scalability

Review scalability

Autoscaling can ensure that your cloud-based applications have the resources that they need to handle varying workloads, while avoiding over provisioning and unnecessary costs. Google Cloud offers several product-specific autoscaling options, including the following:

Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. With MIGs, you can define autoscaling policies that specify the minimum and maximum number of VMs to maintain in the group, and the conditions that trigger autoscaling.
Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application's needs. It offers tools that can optimize resource utilization, ensure application performance, and simplify cluster management.
Cloud Run offers built-in autoscaling, which automatically adjusts the number of instances based on the incoming traffic.

Before your event, we recommend that you scale up manually. Although you might have autoscaling configured, due to the velocity of event traffic, autoscaling might not be able to catch up with demand. So pre-warm resources ahead of time, including the following:

Virtual machines
Caches if you want to pre-load
Serverless components to prevent cold starts

Review Active Assist recommendations

Active Assist refers to the portfolio of tools used in Google Cloud to generate recommendations and insights to help you optimize your Google Cloud projects. For more information, see What is Active Assist.

Review your product versions

Ensure that all your cloud products and services are up-to-date with the latest stable version.

Review alerts and dashboards

Proactively identify and address issues by evaluating the alerts and dashboards provided to you through Google Cloud Observability tools and third-party solutions.

Check your Google Cloud Observability metrics, logs, and traces to gain insights into resource utilization, performance characteristics, and the overall health of your resources. Monitor important metrics that align with system health indicators such as CPU utilization, memory usage, network traffic, disk I/O, and application response times. You should also consider business-specific metrics. By tracking these metrics, you can identify potential bottlenecks, performance issues, and resource constraints. Additionally, you can set up alerts to notify relevant teams proactively about potential issues or anomalies.

For alerts, focus on critical metrics, set appropriate thresholds to minimize alert fatigue, and ensure timely responses to significant issues. This targeted approach lets you proactively maintain workload reliability. For more information, see the Alerting overview.

What's next

Conduct load testing