Deploy automated malware scanning for files uploaded to Cloud Storage

Last reviewed 2024-12-02 UTC

This document describes how you deploy the architecture in Automate malware scanning for files uploaded to Cloud Storage.

This deployment guide assumes that you're familiar with the basic functionality of the following technologies:

Architecture

The following diagram shows the deployment architecture that you create in this document:

Architecture of malware-scanning pipeline.

The diagram shows the following two pipelines that are managed by this architecture:

File scanning pipeline, which checks if an uploaded file contains malware.
ClamAV malware database mirror update pipeline, which maintains an up-to-date mirror of the database of malware that ClamAV uses.

For more information about the architecture, see Automate malware scanning for files uploaded to Cloud Storage.

Objectives

Build a mirror of the ClamAV malware definitions database in a Cloud Storage bucket.
Build a Cloud Run service with the following functions:
- Scanning files in a Cloud Storage bucket for malware using ClamAV and move scanned files to clean or quarantined buckets based on the outcome of the scan.
- Maintaining a mirror of the ClamAV malware definitions database in Cloud Storage.
Create an Eventarc trigger to trigger the malware-scanning service when a file is uploaded to Cloud Storage.
Create a Cloud Scheduler job to trigger the malware-scanning service to refresh the mirror of the malware definitions database in Cloud Storage.

Costs

This architecture uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Artifact Registry, Cloud Build, Resource Manager, Cloud Scheduler, Eventarc, Logging, Monitoring, Pub/Sub, Cloud Run, and Service Usage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Artifact Registry, Cloud Build, Resource Manager, Cloud Scheduler, Eventarc, Logging, Monitoring, Pub/Sub, Cloud Run, and Service Usage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

Deploy the architecture

You can deploy the architecture described in this document by using one of the following methods:

Use Cloud Shell: Use this method if you want to see how each component of the solution is deployed and configured using the Google Cloud CLI command line tool.

To use this deployment method, follow the instructions in Deploy using Cloud Shell.
Use the Terraform CLI: Use this method if you want to deploy the solution in as few manual steps as possible. This method relies on Terraform to deploy and configure the individual components.

To use this deployment method, follow the instructions in Deploy using the Terraform CLI.

Deploy using Cloud Shell

To manually deploy the architecture described in this document, complete the steps in the following subsections.

Prepare your environment

In this section, you assign settings for values that are used throughout the deployment, such as region and zone. In this deployment, you use us-central1 as the region for the Cloud Run service and us as the location for the Eventarc trigger and Cloud Storage buckets.

In Cloud Shell, set common shell variables including region and location:

REGION=us-central1
LOCATION=us
PROJECT_ID=PROJECT_ID
SERVICE_NAME="malware-scanner"
SERVICE_ACCOUNT="${SERVICE_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"

Replace PROJECT_ID with your project ID.

Initialize the gcloud environment with your project ID:
```
gcloud config set project "${PROJECT_ID}"
```
Create three Cloud Storage buckets with unique names:
```
gcloud storage buckets create "gs://unscanned-${PROJECT_ID}" --location="${LOCATION}"
gcloud storage buckets create "gs://quarantined-${PROJECT_ID}" --location="${LOCATION}"
gcloud storage buckets create "gs://clean-${PROJECT_ID}" --location="${LOCATION}"
```
${PROJECT_ID} is used to make sure that the bucket names are unique.

These three buckets hold the uploaded files at various stages during the file scanning pipeline:
- unscanned-PROJECT_ID: Holds files before they're scanned. Your users upload their files to this bucket.
- quarantined-PROJECT_ID: Holds files that the malware-scanner service has scanned and deemed to contain malware.
- clean-PROJECT_ID: Holds files that the malware-scanner service has scanned and found to be uninfected.
Create a fourth Cloud Storage bucket:
```
gcloud storage buckets create "gs://cvd-mirror-${PROJECT_ID}" --location="${LOCATION}"
```
${PROJECT_ID} is used to make sure that the bucket name is unique.

This bucket cvd-mirror-PROJECT_ID is used to maintain a local mirror of the malware definitions database, which prevents rate limiting from being triggered by the ClamAV CDN.

Set up a service account for the malware-scanner service

In this section, you create a service account to use for the malware scanner service. You then grant the appropriate roles to the service account so that it has permissions to read and write to the Cloud Storage buckets. The roles ensure that the account has minimal permissions and that it only has access to the resources that it needs.

Create the malware-scanner service account:

gcloud iam service-accounts create ${SERVICE_NAME}

Grant the Object Admin role to the buckets. The role allows the service to read and delete files from the unscanned bucket, and to write files to the quarantined and clean buckets.

gcloud storage buckets add-iam-policy-binding "gs://unscanned-${PROJECT_ID}" \
    --member="serviceAccount:${SERVICE_ACCOUNT}" --role=roles/storage.objectAdmin
gcloud storage buckets add-iam-policy-binding "gs://clean-${PROJECT_ID}" \
    --member="serviceAccount:${SERVICE_ACCOUNT}" --role=roles/storage.objectAdmin
gcloud storage buckets add-iam-policy-binding "gs://quarantined-${PROJECT_ID}" \
    --member="serviceAccount:${SERVICE_ACCOUNT}" --role=roles/storage.objectAdmin
gcloud storage buckets add-iam-policy-binding "gs://cvd-mirror-${PROJECT_ID}" \
    --member="serviceAccount:${SERVICE_ACCOUNT}" --role=roles/storage.objectAdmin

Grant the Metric Writer role, which allows the service to write metrics to Monitoring:

gcloud projects add-iam-policy-binding \
      "${PROJECT_ID}" \
      --member="serviceAccount:${SERVICE_ACCOUNT}" \
      --role=roles/monitoring.metricWriter

Create the malware-scanner service in Cloud Run

In this section, you deploy the malware-scanner service to Cloud Run. The service runs in a Docker container that contains the following:

A Dockerfile to build a container image with the service, Node.js runtime, Google Cloud SDK, and ClamAV binaries.
The TypeScript files for the malware-scanner Cloud Run service.
A config.json configuration file to specify your Cloud Storage bucket names.
A updateCvdMirror.sh shell script to refresh the ClamAV malware definitions database mirror in Cloud Storage.
A bootstrap.sh shell script to run the necessary services on instance startup.

To deploy the service, do the following:

In Cloud Shell, clone the GitHub repository that contains the code files:

git clone https://github.com/GoogleCloudPlatform/docker-clamav-malware-scanner.git

Change to the cloudrun-malware-scanner directory:

cd docker-clamav-malware-scanner/cloudrun-malware-scanner

Create the config.json configuration file based on the config.json.tmpl template file in the GitHub repository:
```
sed "s/-bucket-name/-${PROJECT_ID}/" config.json.tmpl > config.json
```
The preceding command uses a search and replace operation to give the Cloud Storage buckets unique names that are based on the Project ID.
Optional: View the updated configuration file:
```
cat config.json
```
Perform an initial population of the ClamAV malware database mirror in Cloud Storage:
```
python3 -m venv pyenv
. pyenv/bin/activate
pip3 install crcmod cvdupdate
./updateCvdMirror.sh "cvd-mirror-${PROJECT_ID}"
deactivate
```
These commands performs a local install of the CVDUpdate tool, and then runs the updateCvdMirror.sh script which uses CVDUpdate to copy the ClamAV malware database to the cvd-mirror-PROJECT_ID bucket that you created earlier.

You can check the contents of the mirror bucket:
```
gcloud storage ls "gs://cvd-mirror-${PROJECT_ID}/cvds"
```
The bucket should contain several CVD files that contain the full malware database, several .cdiff files that contain the daily differential updates, and two JSON files with configuration and state information.
Create and deploy the Cloud Run service using the service account that you created earlier:
```
gcloud beta run deploy "${SERVICE_NAME}" \
  --source . \
  --region "${REGION}" \
  --no-allow-unauthenticated \
  --memory 4Gi \
  --cpu 1 \
  --concurrency 20 \
  --min-instances 1 \
  --max-instances 5 \
  --no-cpu-throttling \
  --cpu-boost \
  --timeout 300s \
  --service-account="${SERVICE_ACCOUNT}"
```
The command creates a Cloud Run instance that has 1 vCPU and uses 4 GiB of RAM. This size is acceptable for this deployment. However, in a production environment, you might want to choose a larger CPU and memory size for the instance, and a larger --max-instances parameter. The resource sizes that you might need depend on how much traffic the service needs to handle.

The command includes the following specifications:
- The --concurrency parameter specifies the number of simultaneous requests that each instance can process.
- The --no-cpu-throttling parameter lets the instance perform operations in the background, such as updating malware definitions.
- The --cpu-boost parameter doubles the number of vCPUs on instance startup to reduce startup latency.
- The --min-instances 1 parameter maintains at least one instance active, because the startup time for each instance is relatively high.
- The --max-instances 5 parameter prevents the service from being scaled up too high.

When prompted, enter Y to build and deploy the service. The build and deployment takes about 10 minutes. When it's complete, the following message is displayed:

Service [malware-scanner] revision [malware-scanner-UNIQUE_ID] has been deployed and is serving 100 percent of traffic.
Service URL: https://malware-scanner-UNIQUE_ID.a.run.app

Store the Service URL value from the output of the deployment command in a shell variable. You use the value later when you create a Cloud Scheduler job.
```
SERVICE_URL="SERVICE_URL"
```
Optional: To check the running service and the ClamAV version, run the following command:
```
curl -D - -H "Authorization: Bearer $(gcloud auth print-identity-token)"  \
    ${SERVICE_URL}
```
The output looks like the following sample. It shows the version of the malware-scanner service, the version of ClamAV, and the version of the malware definitions with the date that they were last updated.
```
gcs-malware-scanner version 3.2.0
Using Clam AV version: ClamAV 1.4.1/27479/Fri Dec  6 09:40:14 2024
```

The Cloud Run service requires that all invocations are authenticated, and the authenticating identities must have the run.routes.invoke permission on the service. You add the permission in the next section.

Create an Eventarc Cloud Storage trigger

In this section, you add permissions to allow Eventarc to capture Cloud Storage events and create a trigger to send these events to the Cloud Run malware-scanner service.

If you're using an existing project that was created before April 8, 2021, add the iam.serviceAccountTokenCreator role to the Pub/Sub service account:

PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
PUBSUB_SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${PUBSUB_SERVICE_ACCOUNT}"\
    --role='roles/iam.serviceAccountTokenCreator'

This role addition is only required for older projects and allows Pub/Sub to invoke the Cloud Run service.

In Cloud Shell, grant the Pub/Sub Publisher role to the Cloud Storage service account:

STORAGE_SERVICE_ACCOUNT=$(gcloud storage service-agent --project="${PROJECT_ID}")

gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:${STORAGE_SERVICE_ACCOUNT}" \
  --role "roles/pubsub.publisher"

Allow the malware-scanner service account to invoke the Cloud Run service, and act as an Eventarc event receiver:

gcloud run services add-iam-policy-binding "${SERVICE_NAME}" \
  --region="${REGION}" \
  --member "serviceAccount:${SERVICE_ACCOUNT}" \
  --role roles/run.invoker
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member "serviceAccount:${SERVICE_ACCOUNT}" \
  --role "roles/eventarc.eventReceiver"

Create an Eventarc trigger to capture the finalized object event in the unscanned Cloud Storage bucket and send it to your Cloud Run service. The trigger uses the malware-scanner service account for authentication:

BUCKET_NAME="unscanned-${PROJECT_ID}"
gcloud eventarc triggers create "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
  --destination-run-service="${SERVICE_NAME}" \
  --destination-run-region="${REGION}" \
  --location="${LOCATION}" \
  --event-filters="type=google.cloud.storage.object.v1.finalized" \
  --event-filters="bucket=${BUCKET_NAME}" \
  --service-account="${SERVICE_ACCOUNT}"

If you receive one of the following errors, wait one minute and then run the commands again:

ERROR: (gcloud.eventarc.triggers.create) INVALID_ARGUMENT: The request was invalid: Bucket "unscanned-PROJECT_ID" was not found. Please verify that the bucket exists.

ERROR: (gcloud.eventarc.triggers.create) FAILED_PRECONDITION: Invalid resource state for "": Permission denied while using the Eventarc Service Agent. If you recently started to use Eventarc, it may take a few minutes before all necessary permissions are propagated to the Service Agent. Otherwise, verify that it has Eventarc Service Agent role.

Change the message acknowledgement deadline to five minutes in the underlying Pub/Sub subscription that's used by the Eventarc trigger. The default value of 10 seconds is too short for large files or high loads.
```
SUBSCRIPTION_NAME=$(gcloud eventarc triggers describe \
    "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
    --location="${LOCATION}" \
    --format="get(transport.pubsub.subscription)")
gcloud pubsub subscriptions update "${SUBSCRIPTION_NAME}" --ack-deadline=300
```
Although your trigger is created immediately, it can take up to two minutes for that trigger to be fully functional.

Create a Cloud Scheduler job to trigger ClamAV database mirror updates

Create a Cloud Scheduler job that executes an HTTP POST request on the Cloud Run service with a command to update the mirror of the malware definitions database. To avoid having too many clients use the same time slot, ClamAV requires that you schedule the job at a random minute between 3 and 57, avoiding multiples of 10.

while : ; do
  # set MINUTE to a random number between 3 and 57
  MINUTE="$((RANDOM%55 + 3))"
  # exit loop if MINUTE isn't a multiple of 10
  [[ $((MINUTE % 10)) != 0 ]] && break
done

gcloud scheduler jobs create http \
    "${SERVICE_NAME}-mirror-update" \
    --location="${REGION}" \
    --schedule="${MINUTE} */2 * * *" \
    --oidc-service-account-email="${SERVICE_ACCOUNT}" \
    --uri="${SERVICE_URL}" \
    --http-method=post \
    --message-body='{"kind":"schedule#cvd_update"}' \
    --headers="Content-Type=application/json"

The --schedule command-line argument defines when the job runs using the unix-cron string format. The value given indicates that the job should run at the specific randomly-generated minute every two hours.

This job only updates the ClamAV mirror in Cloud Storage. The ClamAV freshclam daemon in each instance of the Cloud Run checks the mirror every 30 minutes for new definitions and updates the ClamAV daemon.

Deploy using the Terraform CLI

This section describes deploying the architecture described in this document by using the Terraform CLI.

Clone the GitHub Repository

In Cloud Shell, clone the GitHub repository that contains the code and Terraform files:
```
git clone https://github.com/GoogleCloudPlatform/docker-clamav-malware-scanner.git
```

Prepare the environment

In Cloud Shell, set common shell variables including region and location:
```
REGION=us-central1
LOCATION=us
PROJECT_ID=PROJECT_ID
```
Replace PROJECT_ID with your project ID.
Initialize the gcloud CLI environment with your project ID:
```
gcloud config set project "${PROJECT_ID}"
```
Create the config.json configuration file based on the config.json.tmpl template file in the GitHub repository:
```
sed "s/-bucket-name/-${PROJECT_ID}/" \
  docker-clamav-malware-scanner/cloudrun-malware-scanner/config.json.tmpl \
  > docker-clamav-malware-scanner/cloudrun-malware-scanner/config.json
```
The preceding command uses a search and replace operation to give the Cloud Storage buckets unique names that are based on the Project ID.

Optional: View the updated configuration file:

cat docker-clamav-malware-scanner/cloudrun-malware-scanner/config.json

Configure the Terraform variables. The contents of the config.json configuration file are passed to Terraform by using the TF_VAR_config_json variable, so that Terraform knows which Cloud Storage buckets are to create. The value of this variable is also passed to Cloud Run to configure the service.

TF_VAR_project_id=$PROJECT_ID
TF_VAR_region=us-central1
TF_VAR_bucket_location=us
TF_VAR_config_json="$(cat docker-clamav-malware-scanner/cloudrun-malware-scanner/config.json)"
TF_VAR_create_buckets=true
export TF_VAR_project_id TF_VAR_region TF_VAR_bucket_location TF_VAR_config_json TF_VAR_create_buckets

Deploy the base infrastructure

In Cloud Shell, run the following commands to deploy the base infrastructure:
```
gcloud services enable \
  cloudresourcemanager.googleapis.com \
  serviceusage.googleapis.com
cd docker-clamav-malware-scanner/terraform/infra
terraform init
terraform apply
```
Respond yes when prompted.

This Terraform script performs the following tasks:
- Creates the service accounts
- Creates the Artifact Registry
- Creates the Cloud Storage buckets
- Sets the appropriate roles and permissions
- Performs an initial population of the Cloud Storage bucket that contains the mirror of ClamAV malware definitions database

Build the container for the service

In Cloud Shell, run the following commands to launch a Cloud Build job to create the container image for the service:

cd ../../cloudrun-malware-scanner
gcloud builds submit \
  --region="$TF_VAR_region" \
  --config=cloudbuild.yaml \
  --service-account="projects/$PROJECT_ID/serviceAccounts/malware-scanner-build@$PROJECT_ID.iam.gserviceaccount.com" \
  .

Wait a few minutes for the build to complete.

Deploy the service and trigger

In Cloud Shell, run the following commands to deploy the Cloud Run service:
```
cd ../terraform/service/
terraform init
terraform apply
```
Respond yes when prompted.

It can take several minutes for the service to deploy and start.

This terraform script performs the following tasks:
- Deploys the Cloud Run service by using the container image that you just built.
- Sets up the Eventarc triggers on the unscanned Cloud Storage buckets. Although your trigger is created immediately, it can take up to two minutes for that trigger to be fully functional.
- Creates the Cloud Scheduler job to update to the ClamAV malware definitions mirror.
If the deployment fails with one of the following errors, wait one minute and then run the terraform apply command again to retry creating the Eventarc trigger.
```
Error: Error creating Trigger: googleapi: Error 400: Invalid resource state for "": The request was invalid: Bucket "unscanned-PROJECT_ID" was not found. Please verify that the bucket exists.
```
```
Error: Error creating Trigger: googleapi: Error 400: Invalid resource state for "": Permission denied while using the Eventarc Service Agent. If you recently started to use Eventarc, it may take a few minutes before all necessary permissions are propagated to the Service Agent. Otherwise, verify that it has Eventarc Service Agent role..
```
Optional: To check the running service and the ClamAV version in use, run the following commands:
```
MALWARE_SCANNER_URL="$(terraform output -raw cloud_run_uri)"
curl -H "Authorization: Bearer $(gcloud auth print-identity-token)"  \
  "${MALWARE_SCANNER_URL}"
```
The output looks like the following sample. It shows the version of the malware-scanner service, the version of ClamAV, and the version of the malware definitions with the date that they were last updated.
```
gcs-malware-scanner version 3.2.0
Using Clam AV version: ClamAV 1.4.1/27479/Fri Dec  6 09:40:14 2024
```

Test the pipeline by uploading files

To test the pipeline, you upload one clean (malware-free) file and one test file that mimics an infected file:

Create a sample text file or use an existing clean file to test the pipeline processes.
In Cloud Shell, copy the sample data file to the unscanned bucket:
```
gcloud storage cp FILENAME "gs://unscanned-${PROJECT_ID}"
```
Replace FILENAME with the name of the clean text file. The malware-scanner service inspects each file and moves it to an appropriate bucket. This file is moved to the clean bucket.
Give the pipeline a few seconds to process the file and then check your clean bucket to see if the processed file is there:
```
gcloud storage ls "gs://clean-${PROJECT_ID}" --recursive
```
You can check that the file was removed from the unscanned bucket:
```
gcloud storage ls "gs://unscanned-${PROJECT_ID}" --recursive
```
Upload a file called eicar-infected.txt that contains the EICAR standard anti-malware test signature to your unscanned bucket:
```
echo -e 'X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*' \
    | gcloud storage cp - "gs://unscanned-${PROJECT_ID}/eicar-infected.txt"
```
This text string has a signature that triggers malware scanners for testing purposes. This test file is a widely used test—it isn't actual malware and it's harmless to your workstation. If you try to create a file that contains this string on a computer that has a malware scanner installed, you can trigger an alert.
Wait a few seconds and then check your quarantined bucket to see if your file successfully went through the pipeline:
```
gcloud storage ls "gs://quarantined-${PROJECT_ID}" --recursive
```
The service also logs a Logging log entry when a malware infected file is detected.

You can check that the file was removed from the unscanned bucket:
```
gcloud storage ls "gs://unscanned-${PROJECT_ID}" --recursive
```

Test the malware definitions database update mechanism

In Cloud Shell, trigger the check for updates by forcing the Cloud Scheduler job to run:
```
gcloud scheduler jobs run "${SERVICE_NAME}-mirror-update" --location="${REGION}"
```
The results of this command are only shown in the detailed logs.

Monitor the service

You can monitor the service by using Cloud Logging and Cloud Monitoring.

View detailed logs

In the Google Cloud console, go to the Cloud Logging Logs Explorer page.

Go to Logs Explorer
If the Log fields filter isn't displayed, click Log Fields.
In the Log Fields filter, click Cloud Run Revision.
In the Service Name section of the Log Fields filter, click malware-scanner.

The logs query results shows the logs from the service, including several lines that show the scan requests and status for the two files that you uploaded:

Scan request for gs://unscanned-PROJECT_ID/FILENAME, (##### bytes) scanning with clam ClamAV CLAMAV_VERSION_STRING
Scan status for gs://unscanned-PROJECT_ID/FILENAME: CLEAN (##### bytes in #### ms)
...
Scan request for gs://unscanned-PROJECT_ID/eicar-infected.txt, (69 bytes) scanning with clam ClamAV CLAMAV_VERSION_STRING
Scan status for gs://unscanned-PROJECT_ID/eicar-infected.txt: INFECTED stream: Eicar-Signature FOUND (69 bytes in ### ms)

The output shows the ClamAV version and malware database signature revision, along with the malware name for the infected test file. You can use these log messages to set up alerts for when malware has been found, or for when failures occurred while scanning.

The output also shows the malware definitions mirror update logs:

Starting CVD Mirror update
CVD Mirror update check complete. output: ...

If the mirror was updated, the output shows additional lines:

CVD Mirror updated: DATE_TIME - INFO: Downloaded daily.cvd. Version: VERSION_INFO

Freshclam update logs appear every 30 mins:

DATE_TIME -> Received signal: wake up
DATE_TIME -> ClamAV update process started at DATE_TIME
DATE_TIME -> daily.cvd database is up-to-date (version: VERSION_INFO)
DATE_TIME -> main.cvd database is up-to-date (version: VERSION_INFO)
DATE_TIME -> bytecode.cvd database is up-to-date (version: VERSION_INFO)

If the database was updated, the freshclam log lines are instead similar to the following:

DATE_TIME -> daily.cld updated (version: VERSION_INFO)

View Metrics

The service generates the following metrics for monitoring and alerting purposes:

Number of clean files processed:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/clean-files
Number of infected files processed:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/infected-files
Number of files ignored and not scanned:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/ignored-files
Time spent scanning files:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/scan-duration
Total number of bytes scanned:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/bytes-scanned
Number of failed malware scans:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/scans-failed
Number of CVD Mirror update checks:
workload.googleapis.com/googlecloudplatform/gcs-malware-scanning/cvd-mirror-updates

You can view these metrics in the Cloud Monitoring Metrics Explorer:

In the Google Cloud console, go to the Cloud Monitoring Metrics Explorer page.

Go to Metrics Explorer
Click the Select a metric field and enter the filter string malware.
Expand the Generic Task resource.
Expand the Googlecloudplatform category.
Select the googlecloudplatform/gcs-malware-scanning/clean-files metric. The graph shows a data point that indicates when the clean file was scanned.

You can use metrics to monitor the pipeline and to create alerts for when malware is detected, or when files fail processing.

The generated metrics have the following labels, which you can use for filtering and aggregation to view more fine-grained details with Metrics Explorer:

source_bucket
destination_bucket
clam_version
cloud_run_revision

In the ignored_files metric, the following reason labels define why files are ignored:

ZERO_LENGTH_FILE: If the ignoreZeroLengthFiles config value is set, and the file is empty.
FILE_TOO_LARGE: When the file exceeds the maximum scan size of 500 MiB.
REGEXP_MATCH: When the filename matches one of the patterns defined in fileExclusionPatterns.
FILE_SIZE_MISMATCH: If the file size changes while it is being examined.

Advanced configuration

The following sections describe how you can configure the scanner with more advanced parameters.

Handle multiple buckets

The malware scanner service can scan files from multiple source buckets and send the files to separate clean and quarantined buckets. Although this advanced configuration is out of the scope of this deployment, the following is a summary of the required steps:

Create unscanned, clean, and quarantined Cloud Storage buckets that have unique names.
Grant the appropriate roles to the malware-scanner service account on each bucket.

Edit the config.json configuration file to specify the bucket names for each configuration:

{
  "buckets": [
    {
      "unscanned": "unscanned-bucket-1-name",
      "clean": "clean-bucket-1-name",
      "quarantined": "quarantined-bucket-1-name"
    },
    {
      "unscanned": "unscanned-bucket-2-name",
      "clean": "clean-bucket-2-name",
      "quarantined": "quarantined-bucket-2-name"
    }
  ],
  "ClamCvdMirrorBucket": "cvd-mirror-bucket-name"
}

For each of the unscanned buckets, create an Eventarc trigger. Make sure to create a unique trigger name for each bucket.

The Cloud Storage bucket must be in the same project and region as the Eventarc trigger.

If you are using the Terraform deployment, the steps in this section are automatically applied when you pass your updated config.json configuration file in the terraform configuration variable TF_VAR_config_json.

Ignoring temporary files

Some uploading services, such as SFTP to Cloud Storage gateways, create one or more temporary files during the upload process. These services then rename these files to the final filename once the upload is complete.

The normal behavior of the scanner is to scan and move all files, including these temporary files as soon as they are written, which may cause the uploader service to fail when it can't find its temporary files.

The fileExclusionPatterns section of the config.json configuration file lets you use regular expressions to specify a list of filename patterns to ignore. Any files matching these regular expressions are left in the unscanned bucket.

When this rule is triggered, the ignored-files counter is incremented, and a message is logged to indicate that the file matching the pattern was ignored.

The following code sample shows a config.json configuration file with the fileExclusionPatterns list set to ignore files ending in .tmp or containing the string .partial_upload..

{
  "buckets": [
    {
      "unscanned": "unscanned-bucket-name",
      "clean": "clean-bucket-name",
      "quarantined": "quarantined-bucket-name"
    },
  ],
  "ClamCvdMirrorBucket": "cvd-mirror-bucket-name",
  "fileExclusionPatterns": [
    "\\.tmp$",
    "\\.partial_upload\\."
  ]
}

Take care when using \ characters in the regular expression as they will need to be escaped in the JSON file with another \. For example, to specify a literal . in a regular expression, the symbol needs to be escaped twice - once for the regular expression, and again for the text in the JSON file, therefore becoming \\., as in the last line of the preceding code sample.

Ignore zero-length files

Similarly to temporary files, some upload services create a zero-length file on Cloud Storage, then update this file later with more contents.

These files can also be ignored by setting the config.json parameter ignoreZeroLengthFiles to true, for example:

{
  "buckets": [
    {
      "unscanned": "unscanned-bucket-name",
      "clean": "clean-bucket-name",
      "quarantined": "quarantined-bucket-name"
    },
  ],
  "ClamCvdMirrorBucket": "cvd-mirror-bucket-name",
  "ignoreZeroLengthFiles": true
}

When this rule is triggered, the ignored-files metric is incremented, and a message is logged to indicate that a zero-length file was ignored.

Maximum scan file size

The default maximum scan file size is 500 MiB. This is chosen because it takes approximately 5 minutes to scan a file of this size.

Files that are larger than 500 MiB are ignored, and are left in the unscanned bucket. The files-ignored metric is incremented and a message is logged.

If you need to increase this limit, then update the following limits so they accommodate the new maximum file size and scan duration values:

The Cloud Run service request timeout is 5 minutes
The Pub/Sub subscription message acknowledgement deadline is 5 minutes
The Scanner code has a MAX_FILE_SIZE constant of 500 MiB.
The ClamAV service config has StreamMaxLength, MaxScanSize, and MaxFileSize settings of 512 MB. These settings are set by the bootstrap.sh script.

Clean up

The following section explains how you can avoid future charges for the Google Cloud project that you used in this deployment.

Delete the Google Cloud project

To avoid incurring charges to your Google Cloud account for the resources used in this deployment, you can delete the Google Cloud project.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Explore Cloud Storage documentation.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.