This document in the Google Cloud Architecture Framework shows you how to assess toil and mitigate its impacts on your systems and your teams.
Toil is manual and repetitive work with no enduring value, and it increases as a service grows. Continually aim to reduce or eliminate toil. Otherwise, operational work can eventually overwhelm operators, and any growth in product use or complexity can require additional staffing.
Automation is a key way to minimize toil. Automation also improves release velocity and helps minimize human-induced errors.
For more information, see Eliminating Toil.
Create an inventory and assess the cost of toil
Start by creating an inventory and assessing the cost of toil on the teams managing your systems. Make this a continuous process, followed by investing in customized automation to extend what's already provided by Google Cloud services and partners. You can often modify Google Cloud's own automation—for example, Compute Engine's autoscaler.
Prioritize eliminating toil
Automation is useful but isn't a solution to all operational problems. As a first step in addressing known toil, we recommend reviewing your inventory of existing toil and prioritize eliminating as much toil as you can. Then, you can focus on automation.
Automate necessary toil
Some toil in your systems cannot be eliminated. As a second step in addressing known toil, automate this toil using the solutions that Google Cloud provides through configurable automation.
The following are some areas where configurable automation or customized automation can assist your organization in eliminating toil:
- Identity management—for example, Cloud Identity and Identity and Access Management.
- Google Cloud hosted solutions, as opposed to self-designed solutions—for example, cluster management (Google Kubernetes Engine (GKE)), relational database management (Cloud SQL), data warehouse management (BigQuery), and API management (Apigee).
- Google Cloud services and tenant provisioning—for example Terraform and Cloud Foundation Toolkit.
- Automated workflow orchestration for multi-step operations—for example, Cloud Composer.
- Additional capacity provisioning—for example, several Google Cloud products, like Compute Engine and GKE, offer configurable autoscaling. Evaluate the Google Cloud services you are using to determine if they include configurable autoscaling.
- CI/CD pipelines with automated deployment—for example, Cloud Build.
- Canary analysis to validate deployments.
- Automated model training (for machine learning)—for example, AutoML.
If a Google Cloud product or service only partially satisfies your technical needs when automating or eliminating manual workflows, consider filing a feature request through your Google Cloud account representative. Your issue might be a priority for other customers or already a part of our roadmap. If so, knowing the feature's priority and timeline helps you to better assess the trade-offs of building your own solution versus waiting to use a Google Cloud feature.
Build or buy solutions for high-cost toil
The third step, which can be completed in parallel with the first and second steps, entails evaluating building or buying other solutions if your toil cost stays high—for example, if toil takes a significant amount of time for any team managing your production systems.
When building or buying solutions, consider integration, security, privacy, and compliance costs. Designing and implementing your own automation comes with maintenance costs and risks to reliability beyond its initial development and setup costs, so consider this option as a last resort.
What's next
Explore other categories in the Architecture Framework such as system design, security, privacy, and compliance, reliability, cost optimization, and performance optimization.