This document in the Google Cloud Well-Architected Framework: FSI perspective provides an overview of principles and recommendations to optimize the performance of your financial services industry (FSI) workloads in Google Cloud. The recommendations in this document align with the performance optimization pillar of the Well-Architected Framework.
Performance optimization has a long history in financial services. It has helped FSI organizations surpass technical challenges and it's nearly always been an enabler or accelerator for the creation of new business models. For example, ATMs (introduced in 1967) automated the cash dispensation process and they helped banks to decrease the cost of their core business. Techniques like bypassing the OS kernel and pinning application threads to compute cores helped to achieve deterministic and low latency for trading applications. The reduction in latency facilitated higher and firmer liquidity with tighter spreads in the financial markets.
The cloud creates new opportunities for performance optimization. It also challenges some of the historically accepted optimization patterns. Specifically, the following trade-offs are more transparent and controllable in the cloud:
- Time to market versus cost.
- End-to-end performance at the system level versus performance at the node level.
- Talent availability versus agility of technology-related decision making.
For example, adapting hardware and IT resources to specific skill requirements is a trivial task in the cloud. To support GPU programming, you can easily create GPU-based VMs. You can scale capacity in the cloud to accommodate demand spikes without over-provisioning resources. This capability helps to ensure that your workloads can handle peak loads, such as on nonfarm payroll days and when trading volumes are significantly greater than historical levels. Instead of spending on writing highly optimized code at the level of individual servers (like highly fine-tuned code in the C language) or writing code for conventional high performance computing (HPC) environments, you can scale out optimally by using a well-architected Kubernetes-based distributed system.
The performance optimization recommendations in this document are mapped to the following core principles:
- Align technology performance metrics with key business indicators
- Prioritize security without sacrificing performance for unproven risks
- Rethink your architecture to adapt to new opportunities and requirements
- Future-proof your technology to meet present and future business needs
Align technology performance metrics with key business indicators
You can map performance optimization to business-value outcomes in several ways. For example, in a buy-side research desk, a business objective could be to optimize the output per research hour or to prioritize experiments from teams that have a proven track record, such as higher Sharpe ratios. On the sell side, you can use analytics to track client interest and accordingly prioritize the throughput to AI models that support the most interesting research.
Connecting performance goals to business key performance indicators (KPIs) is also important for funding performance improvements. Business innovation and transformation initiatives (sometimes called change-the-bank efforts) have different budgets and they have potentially different degrees of access to resources when compared to business-as-usual (BAU) or run-the-bank operations. For example, Google Cloud helped the risk management and technology teams of a G-SIFI to collaborate with the front-office quantitative analysts on a solution to perform risk analytics calculations (such as XVA) in minutes instead of hours or days. This solution helped the organization to meet relevant compliance requirements. It also enabled the traders to have higher quality conversations with their clients, potentially offering tighter spreads, firmer liquidity, and more cost-effective hedging.
When you align your performance metrics with business indicators, consider the following recommendations:
- Connect each technology initiative to the relevant business objectives and key results (OKRs), such as increasing revenue or profit, reducing costs, and mitigating risk more efficiently or holistically.
- Focus on optimizing performance at the system level. Look beyond the conventional change-the-bank versus run-the-bank separation and the front-office versus back-office silos.
Prioritize security without sacrificing performance for unproven risks
Security and regulatory compliance in FSI organizations must be unequivocally of a high standard. Maintaining a high standard is essential to avoid losing clients and to prevent irreparable damage to an organization's brand. Often, the highest value is derived through technology innovations such as generative AI and unique, managed services like Spanner. Don't automatically discard such technology options due to a blanket misconception about prohibitive operational risk or inadequate regulatory compliance posture.
Google Cloud has worked closely with G-SIFIs to make sure that an AI-based approach for Anti-Money Laundering (AML) can be used across the jurisdictions where the institutions serve customers. For example, HSBC significantly enhanced the performance of its financial crime (Fincrime) unit with the following results:
- Nearly two to four times more confirmed suspicious activity.
- Lower operational costs due to elimination of over 60% of false positives and focused investigation time only on high-risk, actionable alerts.
- Auditable and explainable outputs to support regulatory compliance.
Consider the following recommendations:
- Confirm that the products that you intend to use can help meet the security, resilience, and compliance requirements for the jurisdictions where you operate. To achieve this objective, work with Google Cloud account teams, risk teams, and product teams.
- Create more powerful models and provide transparency to customers by leveraging AI explainability (for example, Shapley value attribution). Techniques like Shapley value attribution can attribute model decisions to particular features at the input level.
Achieve transparency for generative AI workloads by using techniques like citations to sources, grounding, and RAG.
When explainability isn't enough, separate out the decision making steps in your value streams and use AI to automate only the non-decision making steps. In some cases, explainable AI might not be sufficient or a process might require human intervention due to regulatory concerns (for example, GDPR, Article 22). In such cases, present all the information that the human agent needs for decision making in a single control pane, but automate the data gathering, ingestion, manipulation, and summarization tasks.
Rethink your architecture to adapt to new opportunities and requirements
Augmenting your current architectures with cloud-based capabilities can provide significant value. To achieve more transformative outcomes, you need to periodically rethink your architecture by using a cloud-first approach.
Consider the following recommendations to periodically rethink the architecture of your workloads to further optimize performance.
Use cloud-based alternatives to on-premises HPC systems and schedulers
To take advantage of higher elasticity, improved security posture, and extensive monitoring and governance capabilities, you can run HPC workloads in the cloud or burst on-premises workloads to the cloud. However, for certain numerical modeling use cases like simulation of investment strategies or XVA modeling, combining Kubernetes with Kueue might offer a more powerful solution.
Switch to graph-based programming for simulations
Monte Carlo simulations might be much more performant in a graph-based execution system such as Dataflow. For example, HSBC uses Dataflow to run risk calculations 16 times faster compared to their previous approach.
Run cloud-based exchanges and trading platforms
Conversations with Google Cloud customers reveal that the 80/20 Pareto principle applies to the performance requirements of markets and trading applications.
- More than 80% of trading applications don't need extremely low latency. However, they get significant benefits from the resilience, security, and elasticity capabilities of the cloud. For example, BidFX, a foreign exchange multi-dealer platform uses the cloud to launch new products quickly and to significantly increase their availability and footprint without increasing resources.
- The remaining applications (less than 20%) need low latency (less than a millisecond), determinism, and fairness in the delivery of messages. Conventionally, these systems run in rigid and expensive colocated facilities. Increasingly, even this category of applications is being replatformed on the cloud, either at the edge or as cloud-first applications.
Future-proof your technology to meet present and future business needs
Historically, many FSI organizations built proprietary technologies to gain a competitive edge. For example, in the early 2000s, successful investment banks and trading firms had their own implementations of foundational technologies such as pub-sub systems and message brokers. With the evolution of open source technologies and the cloud, such technologies have become commodities and don't offer incremental business value.
Consider the following recommendations to future-proof your technology.
Adopt a data-as-a-service (DaaS) approach for faster time to market and cost transparency
FSI organizations often evolve through a combination of organic growth and mergers and acquisitions (M&A). As a result, the organizations need to integrate disparate technologies. They also need to manage duplicate resources, such as data vendors, data licenses, and integration points. Google Cloud provides opportunities to create differentiated value in post-merger integrations.
For example, you can use services like BigQuery sharing to build an analysis-ready data-as-a-service (DaaS) platform. The platform can provide both market data and inputs from alternative sources. This approach eliminates the need to build redundant data pipelines and it lets you focus on more valuable initiatives. Further, the merged or acquired companies can quickly and efficiently rationalize their post-merger data licensing and infrastructure needs. Instead of spending effort on adapting and merging legacy data estates and operations, the combined business can focus on new business opportunities.
Build an abstraction layer to isolate existing systems and address emerging business models
Increasingly, the competitive advantage for banks isn't the core banking system but their customer experience layer. However, legacy banking systems often use monolithic applications that were developed in languages like Cobol and are integrated across the entire banking value chain. This integration made it difficult to separate the layers of the value chain, so it was nearly impossible to upgrade and modernize such systems.
One solution to address this challenge is to use an isolation layer such as an API management system or a staging layer like Spanner that duplicates the book of record and facilitates the modernization of services with advanced analytics and AI. For example, Deutsche Bank used Spanner to isolate their legacy core banking estate and start their innovation journey.