Create multi-stage queries in YARA-L

Supported in:

Google secops SIEM

This document describes how multi-stage queries in YARA-L let you feed the output of one query stage directly into the input of a subsequent stage. This process gives you greater control over data transformation than a single, monolithic query.

Integrate multi-stage queries with existing features

Multi-stage queries work in conjunction with the following existing features in Google Security Operations:

Composite detection rules: Multi-stage queries complement composite detection rules. Unlike composite rules, multi-stage queries that use Search can return results in real-time.
Time ranges and multi-event rules: You can use multi-stage queries to detect anomalies by comparing different time windows within your data. For example, you can use your initial query stages to establish a baseline over an extended period, and then use a later stage to evaluate recent activity against that baseline. You can also use multi-event rules to create a similar type of comparison.

Multi-stage queries in YARA-L are supported in both Dashboards and Search.

Joins help correlate data from multiple sources to provide more context for an investigation. By linking related events, entities, and other data, you can investigate complex attack scenarios. For more information, see Use joins in Search.

Define multi-stage YARA-L syntax

As you configure a multi-stage query, be aware of the following:

Limit stage: Multi-stage queries must contain between 1 and 4 named stages, in addition to the root stage.
Order syntax: Always define the named stage syntax before defining the root stage syntax.

Create a multi-stage YARA-L query

To create a multi-stage YARA-L query, complete the following steps.

Stage structure and syntax

Go to Investigation > Search. Follow these structural requirements when you define your query stages:

Syntax: Use the following syntax to name each stage and separate it from other stages:

stage <stage name> { }

Braces: Place all stage syntax inside curly braces {}.
Order: Define the syntax for all named stages before defining the root stage.
Referencing: Each stage can reference stages defined earlier in the query.
Root stage: A query must have a root stage, which is processed after all named stages.

The following example stage, daily_stats, collects daily network statistics:

stage daily_stats {
  metadata.event_type = "NETWORK_CONNECTION"
  $source = principal.hostname
  $target = target.ip
  $source != ""
  $target != ""
  $total_bytes = cast.as_int(network.sent_bytes + network.received_bytes)
  match:
    $source, $target by day
  outcome:
    $exchanged_bytes = sum($total_bytes)
}

Access stage Output

The output of a named stage is accessible to subsequent stages using stage fields. Stage fields correspond with the stage's match and outcome variables and can be used similarly to Unified Data Model (UDM) fields.

Use the following syntax to access a stage field:

$<stage name>.<variable name>

Access window timestamps (optional)

If a named stage uses a hop, sliding, or tumbling window, access the window start and window end for each output row using these reserved fields:

$<stage name>.window_start
$<stage name>.window_end

The window_start and window_end are integer fields expressed in seconds since the Unix epoch. Windows in different stages can vary in size.

Limitations

Multi-stage queries have the following functional and structural constraints:

Structural and stage limits

Root stage: Only one root stage is allowed per query.
Named stages: A maximum of four named stages are supported.
Stage referencing: A stage can only reference stages defined logically before it in the same query.
Joins: A maximum of four non-data-table joins are allowed across all stages.
Outcome requirement: Each named stage (excluding the root stage) must include either a match section or an outcome section. The outcome section doesn't require aggregation.

Window and compatibility limits

Feature support: Multi-stage queries are supported in Search and Dashboards, but not supported in Rules.
Window types: Avoid mixing different window types within a single query.
Window dependency: A stage using a hop or sliding window can't depend on another stage that also uses a hop or sliding window.
Tumbling window size: While tumbling windows in different stages can vary in size, the difference in size must be less than 720x.

Example: Stage aggregation difference

The following example window configuration isn't allowed:

stage monthly_stats {
  metadata.event_type = "NETWORK_CONNECTION"
    $source = principal.hostname
    $target = target.ip
    $source != ""
    $target != ""
    $total_bytes = cast.as_int(network.sent_bytes + network.received_bytes)

  match:
    $source, $target by month

  outcome:
    $exchanged_bytes = sum($total_bytes)
}

$source = $monthly_stats.source
$target = $monthly_stats.target

match:
    $source, $target by minute

If the monthly_stats stage aggregates data by month, and the root stage aggregates the output of monthly_stats by minute, then each row from monthly_stats maps to 43,200 rows in the root stage (because there are 43,200 minutes in one month).

Stage and query limitations

Each individual stage within a multi-stage query has the following constraints:

Most limitations that apply to a single-stage query also apply to each individual stage:
- Output requirement: Every stage must output at least one match or outcome variable (stage field).
- Window in join: The maximum window size (hop, tumbling, or sliding) used in a join is 2 days.
- Maximum number of outcome variables:
  - 20 for customers not opted in to allow larger outcome variable limit
  - 50 for customers opted in to allow larger outcome variable limit
- Minimum and maximum size of a hop window.
- Maximum number of elements in an array-valued outcome variable.
Multi-stage queries are subject to the same limitations as statistics queries:
- Statistics queries: 120 QPH (API and UI)
- Search views from Google SecOps: 100 views per minute
- Multi-stage joins are supported in the user interface and the EventService.UDMSearch API, but not in the SearchService.UDMSearch API. Multi-stage queries without joins are also supported in the user interface.

Event and global limitations

Maximum events:

Multi-stage queries are strictly limited in the number of events they can process simultaneously:

UDM events: A maximum of 2 UDM events are allowed.
Entity Context Graph (ECG) events: A maximum of one ECG event is allowed.

Global query limitations:

These limits are platform-wide constraints that control how far back and how much data a multi-stage query can return.

For a query time range, the maximum time range for a standard query is 30 days.
The maximum total result set size is 10,000 results.

Multi-stage query examples

The examples in this section help to illustrate how you might create a complete multi-stage YARA-L query.

Example: Search for unusually active network connections (hours)

This multi-stage YARA-L example identifies IP address pairs with higher-than-normal network activity, targeting pairs that maintain high activity for more than three hours. The query includes two required components: the named stage, hourly_stats, and the root stage.

The hourly_stats stage searches for principal.ip and target.ip pairs with high levels of network activity.

This stage returns a single hourly value for following fields:

Statistics for the source IP (string): $hourly_stats.src_ip
Statistics for the destination IP (string): $hourly_stats.dst_ip
Statistics for the count of events (integer): $hourly_stats.count
Standard deviation received bytes (float): $hourly_stats.std_recd_bytes
Average received bytes (float): $hourly_stats.avg_recd_bytes
Hour bucket start time in seconds from the Unix epoch (integer): $hourly_stats.window_start
Hour bucket end time in seconds from the Unix epoch (integer): $hourly_stats.window_end

The root stage processes the output of the hourly_stats stage. It calculates statistics for principal.ip and target.ip pairs with activity exceeding the threshold specified by $hourly_stats. It then filters for pairs with more than three hours of high activity.


stage hourly_stats {
  metadata.event_type = "NETWORK_CONNECTION"
  $src_ip = principal.ip
  $dst_ip = target.ip
  $src_ip != ""
  $dst_ip != ""

  match:
    $src_ip, $dst_ip by hour

  outcome:
    $count = count(metadata.id)
    $avg_recd_bytes = avg(network.received_bytes)
    $std_recd_bytes = stddev(network.received_bytes)

  condition:
    $avg_recd_bytes > 100 and $std_recd_bytes > 50
}

$src_ip = $hourly_stats.src_ip
$dst_ip = $hourly_stats.dst_ip
$time_bucket_count = strings.concat(timestamp.get_timestamp($hourly_stats.window_start), "|", $hourly_stats.count)

match:
 $src_ip, $dst_ip

outcome:
 $list = array_distinct($time_bucket_count)
 $count = count_distinct($hourly_stats.window_start)

condition:
 $count > 3

If you alter the match condition in the root stage as follows, you can introduce a windowed aggregation by day for the multi-stage query.

match:
 $src_ip, $dst_ip by day

Example: Search for unusually active network connections (using Z-score)

This multi-stage query compares the daily average network activity against today's activity using a Z-score calculation (measuring the number of standard deviations away from the mean). This query effectively searches for unusually high network activity between internal assets and external systems.

Prerequisite: The query time window must be greater than or equal to 2 days and include the current day for the calculated Z-score to be effective.

This multi-stage query includes the daily_stats stage and the root stage, which work together to calculate the Z-score for network activity:

The daily_stats stage performs the initial daily aggregation. It calculates the total bytes exchanged each day for each IP pair (source and target) and returns the following stage fields (corresponding with columns in output rows):
- $daily_stats.source: singular, string
- $daily_stats.target: singular, string
- $daily_stats.exchanged_bytes: singular, integer
- $daily_stats.window_start: singular, integer
- $daily_stats.window_end: singular, integer
The root stage aggregates the daily_stats stage output for each IP pair. It calculates the average and standard deviation of the daily bytes exchanged across the entire search range, along with the bytes exchanged today. It uses those three calculated values to determine the Z-score.
The output lists the Z-scores for all of today's IP pairs, sorted in descending order.

// Calculate the total bytes exchanged per day by source and target

stage daily_stats {
  metadata.event_type = "NETWORK_CONNECTION"
  $source = principal.hostname
  $target = target.ip
  $source != ""
  $target != ""
  $total_bytes = cast.as_int(network.sent_bytes + network.received_bytes)
  match:
    $source, $target by day
  outcome:
    $exchanged_bytes = sum($total_bytes)
}

// Calculate the average per day over the time window and compare with the bytes
   exchanged today

$source = $daily_stats.source
$target = $daily_stats.target
$date = timestamp.get_date($daily_stats.window_start)

match:
  $source, $target

outcome:
  $today_bytes = sum(if($date = timestamp.get_date(timestamp.current_seconds()), $daily_stats.exchanged_bytes, 0))
  $average_bytes = window.avg($daily_stats.exchanged_bytes)
  $stddev_bytes = window.stddev($daily_stats.exchanged_bytes)
  $zscore = ($today_bytes - $average_bytes) / $stddev_bytes

order:
  $zscore desc

Export unaggregated variables from stages

Named stages can include an unaggregated outcome section. This means that variables defined within that outcome section are output directly from the stage, letting subsequent stages access them as stage fields without requiring a grouped aggregation.

Example: Export unaggregated variable

This example demonstrates how to export unaggregated variables. Note the following logic:

top_5_bytes_sent stage searches for the five events with the highest network activity.
top_5_bytes_sent stage outputs the following stage fields corresponding with columns in output rows:
- $top_5_bytes_sent.bytes_sent: singular, integer
- $top_5_bytes_sent.timestamp_seconds: singular, integer
The root stage computes the latest and earliest timestamps for the five events with the highest network activity.

stage top_5_bytes_sent {
  metadata.event_type = "NETWORK_CONNECTION"
  network.sent_bytes > 0

  outcome:
    $bytes_sent = cast.as_int(network.sent_bytes)
    $timestamp_seconds = metadata.event_timestamp.seconds

  order:
    $bytes_sent desc 
  
  limit:
    5
}

outcome:
  $latest_timestamp = timestamp.get_timestamp(max($top_5_bytes_sent.timestamp_seconds))
  $earliest_timestamp = timestamp.get_timestamp(min($top_5_bytes_sent.timestamp_seconds))

Implement windowing in multi-stage queries

Multi-stage queries support all types of windowing (hop, sliding, and tumbling) in named stages. If a named stage includes a window, the window start and window end for each output row is accessible using the following reserved fields:

$<stage name>.window_start
$<stage name>.window_end

Example: Hop window

The following example illustrates how you might use hop windows in a multi-stage query:

hourly_stats stage searches for IP pairs that have high network activity within the same hour.
hourly_stats outputs the following stage fields corresponding with columns in output rows:
- $hourly_stats.src_ip: singular, string
- $hourly_stats.dst_ip: singular, string
- $hourly_stats.count: singular, integer
- $hourly_stats.std_recd_bytes: singular, float
- $hourly_stats.avg_recd_bytes: singular, float
- $hourly_stats.window_start: singular, integer
- $hourly_stats.window_end: singular, integer
Root stage filters out IP pairs with more than 3 hours of high activity. The hours could be overlapping due to usage of a hop window in the hourly_stats stage.

stage hourly_stats {
  metadata.event_type = "NETWORK_CONNECTION"
  $src_ip = principal.ip
  $dst_ip = target.ip
  $src_ip != ""
  $dst_ip != ""

  match:
    $src_ip, $dst_ip over 1h

  outcome:
    $count = count(metadata.id)
    $avg_recd_bytes = avg(network.received_bytes)
    $std_recd_bytes = stddev(network.received_bytes)

  condition:
    $avg_recd_bytes > 100 and $std_recd_bytes > 50
}

$src_ip = $hourly_stats.src_ip
$dst_ip = $hourly_stats.dst_ip
$time_bucket_count = strings.concat(timestamp.get_timestamp($hourly_stats.window_start), "|", $hourly_stats.count)

match:
 $src_ip, $dst_ip

outcome:
 $list = array_distinct($time_bucket_count)
 $count = count_distinct($hourly_stats.window_start)

condition:
 $count > 3

Known issues

We recommend that you review the following limitations and recommended workarounds when you implement multi-stage queries:

All multi-stage queries behave like statistics Search queries (the output consists of aggregated statistics rather than unaggregated events or data table rows).
The performance of joins with UDM and entity events on one side can experience low performance due to the size of that dataset. We strongly recommend filtering the UDM and entity events side of the join as much as possible (for example, filter on event type).

For general guidance on recommended practices, see Yara-L best practices and for information specific to joins, see Best practices.

Need more help? Get answers from Community members and Google SecOps professionals.