Build a hybrid render farm

Last reviewed 2024-01-09 UTC

This document provides guidance on extending your existing, on-premises render farm to use compute resources on Google Cloud. The document assumes that you have already implemented a render farm on-premises and are familiar with the basic concepts of visual effects (VFX) and animation pipelines, queue management software, and common software licensing methods.

Overview

Rendering 2D or 3D elements for animation, film, commercials, or video games is both compute- and time-intensive. Rendering these elements requires a substantial investment in hardware and infrastructure along with a dedicated team of IT professionals to deploy and maintain hardware and software.

When an on-premises render farm is at 100-percent utilization, managing jobs can become a challenge. Task priorities and dependencies, restarting dropped frames, and network, disk, and CPU load all become part of the complex equation that you must closely monitor and control, often under tight deadlines.

To manage these jobs, VFX facilities have incorporated queue management software into their pipelines. Queue management software can:

  • Deploy jobs to on-premises and cloud-based resources.
  • Manage inter-job dependencies.
  • Communicate with asset management systems.
  • Provide users with a user interface and APIs for common languages such as Python.

While some queue management software can deploy jobs to cloud-based workers, you are still responsible for connecting to the cloud, synchronizing assets, choosing a storage framework, managing image templates, and providing your own software licensing.

The following options are available for building and managing render pipelines and workflows in a cloud or hybrid cloud environment:

  • If you don't already have on-premises or cloud resources, you can use a software as a service (SaaS) cloud-based render service such as Conductor.
  • If you want to manage your own infrastructure, you can build and deploy the cloud resources described in this document.
  • If you want to build a custom workflow based on your specific requirements, you can work with Google Cloud service integrator partners like Gunpowder or AppsBroker. This option has the benefit of running all the cloud services in your own secure Google Cloud environment.

To help determine the ideal solution for your facility, contact your Google Cloud representative.

Note: Production notes appear periodically throughout this document. These notes offer best practices to follow as you build your render farm.

Connecting to the cloud

Depending on your workload, decide how your facility connects to Google Cloud, whether through a partner ISP, a direct connection, or over the public internet.

Connecting over the internet

Without any special connectivity, you can connect to Google's network and use our end-to-end security model by accessing Google Cloud services over the internet. Utilities such as the gcloud and gsutil command-line tools and resources such as the Compute Engine API all use secure authentication, authorization, and encryption to help safeguard your data.

Cloud VPN

No matter how you're connected, we recommend that you use a virtual private network (VPN) to secure your connection.

Cloud VPN helps you securely connect your on-premises network to your Google Virtual Private Cloud (VPC) network through an IPsec VPN connection. Data that is in transit gets encrypted before it passes through one or more VPN tunnels.

Learn how to create a VPN for your project.

Customer-supplied VPN

Although you can set up your own VPN gateway to connect directly with Google, we recommend using Cloud VPN, which offers more flexibility and better integration with Google Cloud.

Cloud Interconnect

Google supports multiple ways to connect your infrastructure to Google Cloud. These enterprise-grade connections, known collectively as Cloud Interconnect, offer higher availability and lower latency than standard internet connections, along with reduced egress pricing.

Cross-Cloud Interconnect lets you establish high-bandwidth, dedicated connectivity to Google Cloud for your data in another cloud. Doing so reduces network complexity, reduces data transfer costs, and enables high-throughput, multicloud render farms.

Dedicated Interconnect

Dedicated Interconnect provides direct physical connections and RFC 1918 communication between your on-premises network and Google's network. It delivers connection capacity over the following types of connections:

  • One or more 10 Gbps Ethernet connections, with a maximum of eight connections or 80 Gbps total per interconnect.
  • One or more 100 Gbps Ethernet connections, with a maximum of two connections or 200 Gbps total per interconnect.

Dedicated Interconnect traffic is not encrypted. If you need to transmit data across Dedicated Interconnect in a secure manner, you must establish your own VPN connection. Cloud VPN is not compatible with Dedicated Interconnect, so you must supply your own VPN in this case.

Partner Interconnect

Partner Interconnect provides connectivity between your on-premises network and your VPC network through a supported service provider. A Partner Interconnect connection is useful if your infrastructure is in a physical location that can't reach a Dedicated Interconnect colocation facility or if your data needs don't warrant an entire 10-Gbps connection.

Other connection types

Other ways to connect to Google might be available in your specific location. For help in determining the best and most cost-effective way to connect to Google Cloud, contact your Google Cloud representative.

Securing your content

To run their content on any public cloud platform, content owners like major Hollywood studios require vendors to comply with security best practices that are defined both internally and by organizations such as the MPAA. Google Cloud offers zero-trust security models that are built into products like Google Workspace, BeyondCorp Enterprise, and BeyondProd.

Each studio has different requirements for securing rendering workloads. You can find security whitepapers and compliance documentation at cloud.google.com/security.

If you have questions about the security compliance audit process, contact your Google Cloud representative.

Organizing your projects

Projects are a core organizational component of Google Cloud. In your facility, you can organize jobs under their own project or break them apart into multiple projects. For example, you might want to create separate projects for the previsualization, research and development, and production phases of a film.

Projects establish an isolation boundary for both network data and project administration. However, you can share networks across projects with Shared VPC, which provides separate projects with access to common resources.

Production notes: Create a Shared VPC host project that contains resources with all your production tools. You can designate all projects that are created under your organization as Shared VPC service projects. This designation means that any project in your organization can access the same libraries, scripts, and software that the host project provides.

The Organization resource

You can manage projects under an Organization resource, which you might have established already. Migrating all your projects into an organization provides a number of benefits.

Production notes: Designate production managers as owners of their individual projects and studio management as owners of the Organization resource.

Defining access to resources

Projects require secure access to resources coupled with restrictions on where users or services are permitted to operate. To help you define access, Google Cloud offers Identity and Access Management (IAM), which you can use to manage access control by defining which roles have what levels of access to which resources.

Production notes: To restrict users' access to only the resources that are necessary to perform specific tasks based on their role, implement the principle of least privilege both on premises and in the cloud.

For example, consider a render worker, which is a virtual machine (VM) that you can deploy from a predefined instance template that uses your custom image. The render worker that is running under a service account can read from Cloud Storage and write to attached storage, such as a cloud filer or persistent disk. However, you don't need to add individual artists to Google Cloud projects at all, because they don't need direct access to cloud resources.

You can assign roles to render wranglers or project administrators who have access to all Compute Engine resources, which permits them to perform functions on resources that are inaccessible to other users.

Define a policy to determine which roles can access which types of resources in your organization. The following table shows how typical production tasks map to IAM roles in Google Cloud.

Production task Role name Resource type
Studio manager resourcemanager.organizationAdmin Organization
Project
Production manager owner, editor Project
Render wrangler compute.admin, iam.serviceAccountActor Project
Queue management account compute.admin, iam.serviceAccountActor Organization
Project
Individual artist [no access] Not applicable

Access scopes

Access scopes offer you a way to control the permissions of a running instance no matter who is logged in. You can specify scopes when you create an instance yourself or when your queue management software deploys resources from an instance template.

Scopes take precedence over the IAM permissions of an individual user or service account. This precedence means that an access scope can prevent a project administrator from signing in to an instance to delete a storage bucket or change a firewall setting.

Production notes: By default, instances can read but not write to Cloud Storage. If your render pipeline writes finished renders back to Cloud Storage, add the scope devstorage.read_write to your instance at the time of creation.

Choosing how to deploy resources

With cloud rendering, you can use resources only when needed, but you can choose from a number of ways to make resources available to your render farm.

Deploy on demand

For optimal resource usage, you can choose to deploy render workers only when you send a job to the render farm. You can deploy many VMs to be shared across all frames in a job, or even create one VM per frame.

Your queue management system can monitor running instances, which can be requeued if a VM is preempted, and terminated when individual tasks are completed.

Deploy a pool of resources

You can also choose to deploy a group of instances, unrelated to any specific job, that your on-premises queue management system can access as additional resources. If you use Google Cloud's Spot VMs, a group of running instances can accept multiple jobs per VM, using all cores and maximizing resource usage. This approach might be the most straightforward strategy to implement because it mimics how an on-premises render farm is populated with jobs.

Licensing the software

Third-party software licensing can vary widely from package to package. Here are some of the licensing schemes and models that you might encounter in a VFX pipeline. For each scheme, the third column shows the recommended licensing approach.

Scheme Description Recommendation
Node locked Licensed to a specific MAC address, IP address, or CPU ID. Can be run only by a single process. Instance based
Node based Licensed to a specific node (instance). An arbitrary number of users or processes can run on a licensed node. Instance based
Floating Checked out from a license server that keeps track of usage. License server
Software licensing
Interactive Allows user to run software interactively in a graphics-based environment. License server or instance based
Batch Allows user to run software only in a command-line environment. License server
Cloud-based licensing
Usage based Checked out only when a process runs on a cloud instance. When the process finishes or terminates, the license is released. Cloud-based license server
Uptime based Checked out while an instance is active and running. When the instance is stopped or deleted, the license is released. Cloud-based license server

Using instance-based licensing

Some software programs or plugins are licensed directly to the hardware on which they run. This approach to licensing can present a problem in the cloud, where hardware identifiers such as MAC or IP addresses are assigned dynamically.

MAC addresses

When they are created, instances are assigned a MAC address that is retained so long as the instance is not deleted. You can stop or restart an instance, and the MAC address will be retained. You can use this MAC address for license creation and validation until the instance is deleted.

Assigning a static IP address

When you create an instance, it is assigned an internal and, optionally, an external IP address. To retain an instance's external IP address, you can reserve a static IP address and assign it to your instance. This IP address will be reserved only for this instance. Because static IP addresses are a project-based resource, they are subject to regional quotas.

You can also assign an internal IP address when you create an instance, which is helpful if you want the internal IP addresses of a group of instances to fall within the same range.

Hardware dongles

Older software might still be licensed through a dongle, a hardware key that is programmed with a product license. Most software companies have stopped using hardware dongles, but some users might have legacy software that is keyed to one of these devices. If you encounter this problem, contact the software manufacturer to see if they can provide you with an updated license for your particular software.

If the software manufacturer cannot provide such a license, you could implement a network-attached USB hub or USB over IP solution.

Using a license server

Most modern software offers a floating license option. This option makes the most sense in a cloud environment, but it requires stronger license management and access control to prevent overconsumption of a limited number of licenses.

To help avoid exceeding your license capacity, you can as part of your job queue process choose which licenses to use and control the number of jobs that use licenses.

On-premises license server

You can use your existing, on-premises license server to provide licenses to instances that are running in the cloud. If you choose this method, you must provide a way for your render workers to communicate with your on-premises network, either through a VPN or some other secure connection.

Cloud-based license server

In the cloud, you can run a license server that serves instances in your project or even across projects by using Shared VPC. Floating licenses are sometimes linked to a hardware MAC address, so a small, long-running instance with a static IP address can easily serve licenses to many render instances.

Hybrid license server

Some software can use multiple license servers in a prioritized order. For example, a renderer might query the number of licenses that are available from an on-premises server, and if none are available, use a cloud-based license server. This strategy can help maximize your use of permanent licenses before you check out other license types.

Production notes: Define one or more license servers in an environment variable and define the order of priority; Autodesk Arnold, a popular renderer, helps you do this. If the job cannot acquire a license by using the first server, the job tries to use any other servers that are listed, as in the following example:

export solidangle_LICENSE=5053@x.x.0.1;5053@x.x.0.2

In the preceding example, the Arnold renderer tries to obtain a license from the server at x.x.0.1, port 5053. If that attempt fails, it then tries to obtain a license from the same port at the IP address x.x.0.2.

Cloud-based licensing

Some vendors offer cloud-based licensing that provides software licenses on demand for your instances. Cloud-based licensing is generally billed in two ways: usage based and uptime based.

Usage-based licensing

Usage-based licensing is billed based on how much time the software is in use. Typically with this type of licensing, a license is checked out from a cloud-based server when the process starts and is released when the process completes. So long as a license is checked out, you are billed for the use of that license. This type of licensing is typically used for rendering software.

Uptime-based licensing

Uptime-based or metered licenses are billed based on the uptime of your Compute Engine instance. The instance is configured to register with the cloud-based license server during the startup process. So long as the instance is running, the license is checked out. When the instance is stopped or deleted, the license is released. This type of licensing is typically used for render workers that a queue manager deploys.

Choosing how to store your data

The type of storage that you choose on Google Cloud depends on your chosen storage strategy along with factors such as durability requirements and cost.

Persistent disk

You might be able to avoid implementing a file server altogether by incorporating persistent disks (PDs) into your workload. PDs are a type of POSIX-compliant block storage, up to 64 TB in size, that are familiar to most VFX facilities. Persistent disks are available as both standard drives and solid-state drives (SSD). You can attach a PD in read-write mode to a single instance, or in read-only mode to a large number of instances, such as a group of render workers.

Pros Cons Ideal use case
Mounts as a standard NFS or SMB volume.

Can dynamically resize.

Up to 128 PDs can be attached to a single instance.

The same PD can be mounted as read-only on hundreds or thousands of instances.
Maximum size of 64 TB.

Can write to PD only when attached to a single instance.

Can be accessed only by resources that are in the same region.
Advanced pipelines that can build a new disk on a per-job basis.

Pipelines that serve infrequently updated data, such as software or common libraries, to render workers.

Object storage

Cloud Storage is highly redundant, highly durable storage that, unlike traditional file systems, is unstructured and practically unlimited in capacity. Files on Cloud Storage are stored in buckets, which are similar to folders, and are accessible worldwide.

Unlike traditional storage, object storage cannot be mounted as a logical volume by an operating system (OS). If you decide to incorporate object storage into your render pipeline, you must modify the way that you read and write data, either through command-line utilities such as gsutil or through the Cloud Storage API.

Pros Cons Ideal use case
Durable, highly available storage for files of all sizes.

Single API across storage classes.

Inexpensive.

Data is available worldwide.

Virtually unlimited capacity.
Not POSIX-compliant.

Must be accessed through API or command-line utility.

In a render pipeline, data must be transferred locally before use.
Render pipelines with an asset management system that can publish data to Cloud Storage.

Render pipelines with a queue management system that can fetch data from Cloud Storage before rendering.

Other storage products

Other storage products are available as managed services, through third-party channels such as the Cloud Marketplace, or as open source projects through software repositories or GitHub.

Product Pros Cons Ideal use case
Filestore Clustered file system that can support thousands of simultaneous NFS connections.

Able to synchronize with on-premises NAS cluster.
No way to selectively sync files. No bidirectional sync. Medium to large VFX facilities with hundreds of TBs of data to present on the cloud.
Pixit Media, PixStor Scale-out file system that can support thousands of simultaneous NFS or POSIX clients. Data can be cached on demand from on-premises NAS, with updates automatically sent back to on-premises storage. Cost, third-party support from Pixit. Medium to large VFX facilities with hundreds of TBs of data to present on the cloud.
Google Cloud NetApp Volumes Fully managed storage solution on Google Cloud.

Supports NFS, SMB, and multiprotocol environments.

Point in time snapshots with instance recovery
Not available in all Google Cloud regions. VFX