Deploy automated malware scanning for files uploaded to Cloud Storage

Last reviewed 2023-06-20 UTC

This document describes how you deploy the architecture in Automate malware scanning for files uploaded to Cloud Storage.

This deployment guide assumes that you're familiar with the basic functionality of the following technologies:

Architecture

The following diagram shows the deployment architecture that you create in this document:

Architecture of malware-scanning pipeline.

The diagram shows the following two pipelines that are managed by this architecture:

  • File scanning pipeline, which checks if an uploaded file contains malware.
  • ClamAV malware database mirror update pipeline, which maintains an up-to-date mirror of the database of malware that ClamAV uses.

For more information about the architecture, see Automate malware scanning for files uploaded to Cloud Storage.

Objectives

  • Build a mirror of the ClamAV malware definitions database in a Cloud Storage bucket.

  • Build a Cloud Run service with the following functions:

    • Scanning files in a Cloud Storage bucket for malware using ClamAV and move scanned files to clean or quarantined buckets based on the outcome of the scan.
    • Maintaining a mirror of the ClamAV malware definitions database in Cloud Storage.
  • Create an Eventarc trigger to trigger the malware-scanning service when a file is uploaded to Cloud Storage.

  • Create a Cloud Scheduler job to trigger the malware-scanning service to refresh the mirror of the malware definitions database in Cloud Storage.

Costs

This architecture uses the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Artifact Registry, Cloud Run, Eventarc, Logging, Cloud Scheduler, Pub/Sub, and Cloud Build APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Artifact Registry, Cloud Run, Eventarc, Logging, Cloud Scheduler, Pub/Sub, and Cloud Build APIs.

    Enable the APIs

  8. In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

  9. In this deployment, you run all commands from Cloud Shell.

Set up your environment

In this section, you assign settings for values that are used throughout the deployment, such as region and zone. In this deployment, you use us-central1 as the region for the Cloud Run service and us as the location for the Eventarc trigger and Cloud Storage buckets.

  1. In Cloud Shell, set common shell variables including region and location:

    REGION=us-central1
    LOCATION=us
    PROJECT_ID=PROJECT_ID
    SERVICE_NAME="malware-scanner"
    SERVICE_ACCOUNT="${SERVICE_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
    

    Replace PROJECT_ID with your project ID.

  2. Initialize the gcloud environment with your project ID:

    gcloud config set project "${PROJECT_ID}"
    
  3. Create three Cloud Storage buckets with unique names:

    gsutil mb -l "${LOCATION}" "gs://unscanned-${PROJECT_ID}"
    gsutil mb -l "${LOCATION}" "gs://quarantined-${PROJECT_ID}"
    gsutil mb -l "${LOCATION}" "gs://clean-${PROJECT_ID}"
    

    ${PROJECT_ID} is used to make sure that the bucket names are unique.

    These three buckets hold the uploaded files at various stages during the file scanning pipeline:

    • unscanned-PROJECT_ID: Holds files before they're scanned. Your users upload their files to this bucket.

    • quarantined-PROJECT_ID: Holds files that the malware-scanner service has scanned and deemed to contain malware.

    • clean-PROJECT_ID: Holds files that the malware-scanner service has scanned and found to be uninfected.

  4. Create a fourth Cloud Storage bucket:

    gsutil mb -l "${LOCATION}" "gs://cvd-mirror-${PROJECT_ID}"
    

    ${PROJECT_ID} is used to make sure that the bucket name is unique.

    This bucket cvd-mirror-PROJECT_ID is used to maintain a local mirror of the malware definitions database, which prevents rate limiting from being triggered by the ClamAV CDN.

Set up a service account for the malware-scanner service

In this section, you create a service account to use for the malware scanner service. You then grant the appropriate roles to the service account so that it has permissions to read and write to the Cloud Storage buckets. The roles ensure that the account has minimal permissions and that it only has access to the resources that it needs.

  1. Create the malware-scanner service account:

    gcloud iam service-accounts create ${SERVICE_NAME}
    
  2. Grant the Object Admin role to the buckets. The role allows the service to read and delete files from the unscanned bucket, and to write files to the quarantined and clean buckets.

    gsutil iam ch \
        "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
        "gs://unscanned-${PROJECT_ID}"
    gsutil iam ch \
        "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
        "gs://clean-${PROJECT_ID}"
    gsutil iam ch \
        "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
        "gs://quarantined-${PROJECT_ID}"
    gsutil iam ch \
        "serviceAccount:${SERVICE_ACCOUNT}:objectAdmin" \
        "gs://cvd-mirror-${PROJECT_ID}"
    
  3. Grant the Metric Writer role, which allows the service to write metrics to Monitoring:

    gcloud projects add-iam-policy-binding \
          "${PROJECT_ID}" \
          --member="serviceAccount:${SERVICE_ACCOUNT}" \
          --role=roles/monitoring.metricWriter
    

Create the malware-scanner service in Cloud Run

In this section, you deploy the malware-scanner service to Cloud Run. The service runs in a Docker container that contains the following:

  • A Dockerfile to build a container image with the service, Node.js runtime, Google Cloud SDK, and ClamAV binaries.
  • The Node.js files for the malware-scanner Cloud Run service.
  • A config.json configuration file to specify your Cloud Storage bucket names.
  • A updateCvdMirror.sh shell script to refresh the ClamAV malware definitions database mirror in Cloud Storage.
  • A cloud-run-proxy service to proxy freshclam HTTP requests, which provide authenticated access to Cloud Storage APIs.
  • A bootstrap.sh shell script to run the necessary services on instance startup.

To deploy the service, do the following:

  1. In Cloud Shell, clone the GitHub repository that contains the code files:

    git clone https://github.com/GoogleCloudPlatform/docker-clamav-malware-scanner.git
    
  2. Change to the cloudrun-malware-scanner directory:

    cd docker-clamav-malware-scanner/cloudrun-malware-scanner
    
  3. Edit the config.json configuration file to specify the Cloud Storage buckets that you created. Because the bucket names are based on the project ID, you can use a search and replace operation:

    sed "s/-bucket-name/-${PROJECT_ID}/" config.json.tmpl > config.json
    

    You can view the updated configuration file:

    cat config.json
    
  4. Perform an initial population of the ClamAV malware database mirror in Cloud Storage:

    python3 -m venv pyenv
    . pyenv/bin/activate
    pip3 install crcmod cvdupdate
    ./updateCvdMirror.sh "cvd-mirror-${PROJECT_ID}"
    deactivate
    

    The command performs a local install of the CVDUpdate tool and uses it to download the malware database. The command then uploads the database to the cvd-mirror-PROJECT_ID bucket that you created earlier.

    You can check the contents of the mirror bucket:

    gsutil ls "gs://cvd-mirror-${PROJECT_ID}/cvds"
    

    The bucket should contain several CVD files that contain the full malware database, several .cdiff files that contain the daily differential updates, and two .json files with configuration and state information.

  5. Create and deploy the Cloud Run service using the service account that you created earlier:

    gcloud beta run deploy "${SERVICE_NAME}" \
      --source . \
      --region "${REGION}" \
      --no-allow-unauthenticated \
      --memory 4Gi \
      --cpu 1 \
      --concurrency 20 \
      --min-instances 1 \
      --max-instances 5 \
      --no-cpu-throttling \
      --cpu-boost \
      --service-account="${SERVICE_ACCOUNT}"
    

    The command creates a cloud run instance that has 1 vCPU and uses 4 GiB of RAM. This size is acceptable for this deployment. However, in a production environment, you might want to choose a larger CPU and memory size for the instance, and a larger --max-instances parameter. The resource sizes that you might need depend on how much traffic the service needs to handle.

    The command includes the following specifications:

    • The --concurrency parameter specifies the number of simultaneous requests that each instance can process.
    • The --no-cpu-throttling parameter lets the instance perform operations in the background, such as updating malware definitions.
    • The --cpu-boost parameter doubles the number of vCPUs on instance startup to reduce startup latency.
    • The --min-instances 1 parameter maintains at least one instance active, because the startup time for each instance is relatively high.
    • The --max-instances 5 parameter prevents the service from being scaled up too high.
  6. When prompted, enter Y to build and deploy the service. The build and deployment takes about 10 minutes. When it's complete, the following message is displayed:

    Service [malware-scanner] revision [malware-scanner-UNIQUE_ID] has been deployed and is serving 100 percent of traffic.
    Service URL: https://malware-scanner-UNIQUE_ID.a.run.app
    
  7. Store the Service URL value from the output of the deployment command in a shell variable. You use the value later when you create a Cloud Scheduler job.

    SERVICE_URL="SERVICE_URL"
    

To check the running service and the ClamAV version, run the following command:

curl -D - -H "Authorization: Bearer $(gcloud auth print-identity-token)"  \
     ${SERVICE_URL}

The Cloud Run service requires that all invocations are authenticated, and the authenticating identities must have the run.routes.invoke permission on the service. You add the permission in the next section.

Create an Eventarc Cloud Storage trigger

In this section, you add permissions to allow Eventarc to capture Cloud Storage events and create a trigger to send these events to the Cloud Run malware-scanner service.

  1. If you're using an existing project that was created before April 8, 2021, add the iam.serviceAccountTokenCreator role to the Pub/Sub service account:

    PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
    PUBSUB_SERVICE_ACCOUNT="service-${PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com"
    gcloud projects add-iam-policy-binding ${PROJECT_ID} \
        --member="serviceAccount:${PUBSUB_SERVICE_ACCOUNT}"\
        --role='roles/iam.serviceAccountTokenCreator'
    

    This role addition is only required for older projects and allows Pub/Sub to invoke the Cloud Run service.

  2. In Cloud Shell, grant the Pub/Sub Publisher role to the Cloud Storage service account:

    STORAGE_SERVICE_ACCOUNT=$(gsutil kms serviceaccount -p "${PROJECT_ID}")
    
    gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
      --member "serviceAccount:${STORAGE_SERVICE_ACCOUNT}" \
      --role "roles/pubsub.publisher"
    
  3. Allow the malware-scanner service account to invoke the Cloud Run service, and act as an Eventarc event receiver:

    gcloud run services add-iam-policy-binding "${SERVICE_NAME}" \
      --region="${REGION}" \
      --member "serviceAccount:${SERVICE_ACCOUNT}" \
      --role roles/run.invoker
    gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
      --member "serviceAccount:${SERVICE_ACCOUNT}" \
      --role "roles/eventarc.eventReceiver"
    
  4. Create an Eventarc trigger to capture the finalized object event in the unscanned Cloud Storage bucket and send it to your Cloud Run service. The trigger uses the malware-scanner service account for authentication:

    BUCKET_NAME="unscanned-${PROJECT_ID}"
    gcloud eventarc triggers create "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
      --destination-run-service="${SERVICE_NAME}" \
      --destination-run-region="${REGION}" \
      --location="${LOCATION}" \
      --event-filters="type=google.cloud.storage.object.v1.finalized" \
      --event-filters="bucket=${BUCKET_NAME}" \
      --service-account="${SERVICE_ACCOUNT}"
    

    If you receive one of the following errors, wait one minute and then run the command again:

    ERROR: (gcloud.eventarc.triggers.create) INVALID_ARGUMENT: The request was invalid: Bucket "unscanned-PROJECT_ID" was not found. Please verify that the bucket exists.
    
    ERROR: (gcloud.eventarc.triggers.create) FAILED_PRECONDITION: Invalid resource state for "": Permission denied while using the Eventarc Service Agent. If you recently started to use Eventarc, it may take a few minutes before all necessary permissions are propagated to the Service Agent. Otherwise, verify that it has Eventarc Service Agent role.
    
  5. Change the message acknowledgement deadline to two minutes in the underlying Pub/Sub subscription that's used by the Eventarc trigger. The default value of 10 seconds is too short for large files or high loads.

    SUBSCRIPTION_NAME=$(gcloud eventarc triggers describe \
        "trigger-${BUCKET_NAME}-${SERVICE_NAME}" \
        --location="${LOCATION}" \
        --format="get(transport.pubsub.subscription)")
    gcloud pubsub subscriptions update "${SUBSCRIPTION_NAME}" --ack-deadline=120
    

    Although your trigger is created immediately, it can take up to 10 minutes for a trigger to propagate and filter events.

Create an Cloud Scheduler job to trigger ClamAV database mirror updates

  • Create a Cloud Scheduler job that executes an HTTP POST request on the Cloud Run service with a command to update the mirror of the malware definitions database. To avoid having too many clients use the same time slot, ClamAV requires that you schedule the job at a random minute between 3 and 57, avoiding multiples of 10.

    while : ; do
      # set MINUTE to a random number between 3 and 57
      MINUTE="$((RANDOM%55 + 3))"
      # exit loop if MINUTE isn't a multiple of 10
      [[ $((MINUTE % 10)) != 0 ]] && break
    done
    
    gcloud scheduler jobs create http \
        "${SERVICE_NAME}-mirror-update" \
        --location="${REGION}" \
        --schedule="${MINUTE} */2 * * *" \
        --oidc-service-account-email="${SERVICE_ACCOUNT}" \
        --uri="${SERVICE_URL}" \
        --http-method=post \
        --message-body='{"kind":"schedule#cvd_update"}' \
        --headers="Content-Type=application/json"
    

    The --schedule command-line argument defines when the job runs using the unix-cron string format. The value given indicates that the job should run at the specific randomly-generated minute every two hours.

This job only updates the ClamAV mirror in Cloud Storage. The ClamAV freshclam daemon in each instance of the Cloud Run checks the mirror every 30 minutes for new definitions and updates the ClamAV daemon.

Test the pipeline by uploading files

To test the pipeline, you upload one clean (malware-free) file and one test file that mimics an infected file:

  1. Create a sample text file or use an existing clean file to test the pipeline processes.

  2. In Cloud Shell, copy the sample data file to the unscanned bucket:

    gsutil cp FILENAME "gs://unscanned-${PROJECT_ID}"
    

    Replace FILENAME with the name of the clean text file. The malware-scanner service inspects each file and moves it to an appropriate bucket. This file is moved to the clean bucket.

  3. Give the pipeline a few seconds to process the file and then check your clean bucket to see if the processed file is there:

    gsutil ls -r "gs://clean-${PROJECT_ID}"
    

    You can check that the file was removed from the unscanned bucket: