Statistics and aggregations in UDM search using YARA-L 2.0

Supported in:

This page describes how to run statistical queries on UDM events and group the results for analysis using YARA-L 2.0.

Overview

When dealing with a large volume of UDM events generated within your environment, it's important to understand the trends in your UDM search data. You can use statistics and aggregate functions to gain actionable insights from your UDM logs. UDM search supports all the aggregate functions in YARA-L 2.0.

You can use statistical queries for the following use cases:

  • Track critical metrics: You can measure the distribution and frequency of UDM events and the assets involved, such as hosts communicating with known malicious IP addresses.

  • Detect anomalous behaviour: You can detect unusual patterns or spikes in activity that could indicate a security incident, such as unusual network traffic volumes or unexpected login activity during unusual hours.

  • Analyze trends over time: You can identify changes in the security posture to evaluate the impact of controls or to find areas that need improvement, such as tracking the changes in vulnerability counts over time.

You can group and order UDM search query results using the syntax that is similar to the YARA-L structure in detection engine rules. For more information, see YARA-L 2.0 language syntax.

The query structure is as follows:

  1. Filtering statement: The filtering statement specifies the conditions to filter events.

  2. Match (optional): The match section specifies the fields to group by. For more information, see Match section syntax.

  3. Outcome: The outcome section specifies the outputs of the query. For more information, see Outcome section syntax.

  4. Order: The order section specifies the order of the query results returned. If the order (asc or desc) is not specified, it will default to asc.

  5. Limit (optional): The limit section specifies the maximum number of rows that the query returns.

Here is an example of order and limit usage:

metadata.log_type = "OKTA"

match:
    principal.ip
Outcome:
    $user_count_by_ip = count(principal.user.userid)

order:
 $user_count_by_ip desc

limit:
    20

Aggregations

UDM search supports the following aggregate functions:

sum

sum(numericExpression)

Description

The sum function returns the sum of values within a numeric column. It ignores NULL values during the calculation. It is often used with match to calculate the sums within different groups in the data.

Param data types

NUMBER

Return type

NUMBER

Code Samples

Example

Find all the events where target.ip is not empty. For all the events that match on principal.ip, store a sum of network.sent_bytes in a variable called sent_bytes.

  target.ip != ""
  match:
    principal.ip
  outcome:
    $sent_bytes = sum(network.sent_bytes)

min

min(numericExpression)

Description

The min function returns the minimum of the values within a numeric column. It is often used with match to get the minimum value within each group in the data.

Param data types

NUMBER

Return type

NUMBER

Code Samples

Example

Find all the events where target.ip is not empty. For all the events that match on principal.ip, store the minimum of metadata.event_timestamp.seconds in a variable called min_seconds.

  target.ip != ""
  match:
    principal.ip
  outcome:
    $min_seconds = min(metadata.event_timestamp.seconds)

max

max(numericExpression)

Description

The max function returns the maximum of the values within a numeric column. It is often used with match to get the maximum value within each group in the data.

Param data types

NUMBER

Return type

NUMBER

Code Samples

Example

Find all the events where target.ip is not empty. For all the events that match on principal.ip, store the maximum of metadata.event_timestamp.seconds in a variable called max_seconds.

  target.ip != ""
  match:
    principal.ip
  outcome:
    $max_seconds = max(metadata.event_timestamp.seconds)

avg

avg(numericExpression)

Description

The avg function returns the average of values within a numeric column. It ignores NULL values during the calculation. It is often used with match to calculate the averages within specific groups in the data.

Param data types

NUMBER

Return type

NUMBER

Code Samples

Example

Find all the events where target.ip is not empty. For all the events that match on principal.ip, store the average of metadata.event_timestamp.seconds in a variable called avg_seconds.

  target.ip != ""
  match:
    principal.ip
  outcome:
    $avg_seconds = avg(metadata.event_timestamp.seconds)

count

count(expression)

Description

The count function returns the number of rows within a group. It is often used with match to get counts for specific groups in the data.

Param data types

STRING

Return type

NUMBER

Code Samples

Example

Return the count of successful user logins over time.

  metadata.event_type = "USER_LOGIN"
  $security_result = security_result.action
  $security_result = "ALLOW"
  $date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
  match:
      $security_result, $date
  outcome:
      $event_count = count(metadata.id)

count_distinct

count_distinct(expression)

Description

The count_distinct function returns the number of rows that have distinct values within a group. It is often used with match to get counts for specific groups in the data.

Param data types

STRING

Return type

NUMBER

Code Samples

Example

Return the count of distinct successful user logins over time.

  metadata.event_type = "USER_LOGIN"
  $security_result = security_result.action
  $security_result = "ALLOW"
  $date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
  match:
      $security_result, $date
  outcome:
      $event_count = count_distinct(metadata.id)

array

array(expression)

Description

The array function returns all the values in the form of a list. It truncates the list to a maximum of 25 random elements.

Param data types

STRING

Return type

LIST

Code Samples

Example

Return an array containing event types.

  $event_type = metadata.event_type
  outcome:
    $event_type_array = array($event_type)

array_distinct

array_distinct(expression)

Description

The array_distinct function returns all the distinct values in the form of a list. It truncates the list to a maximum of 25 random elements. The deduplication to get a distinct list is applied before truncation.

Param data types

STRING

Return type

LIST

Code Samples

Example

Return an array containing distinct event types.

  $event_type = metadata.event_type
  outcome:
    $event_type_array = array_distinct($event_type)

stddev

stddev(numericExpression)

Description

The stddev function returns the standard deviation over all the possible values.

Param data types

NUMBER

Return type

NUMBER

Code Samples

Example

Find all the events where target.ip is not empty. For all the events that match on principal.ip, store the standard deviation of metadata.event_timestamp.seconds in a variable called stddev_seconds.

  target.ip != ""
  match:
    principal.ip
  outcome:
    $stddev_seconds = stddev(metadata.event_timestamp.seconds)

Differences between using YARA-L 2.0 in search and its use within UDM

  • The over keyword, which enables searching for events within a window, is not applicable in search.

  • The UDM search query structure does not include the condition and option sections.

Group by time granularity

You can group all the event fields and placeholders in the match section by a specified time granularity, similar to a column you might group by in SQL.

The syntax is as follows:

match:
  ... [BY|OVER EVERY] [FIRST] [TIME_GRANULARITY]

To group by time granularity, you can either use the keyword by or over every. The allowed time granularities are:

  • MINUTE or m
  • HOUR or h
  • DAY or d
  • WEEK or w
  • MONTH or mo

Both the by and the over every keywords are functionally equivalent. You can use one over the other.

Examples

Group IP address and hostname by the hour.

$hostname = principal.hostname
match:
  $hostname, target.ip by hour

Group the count of all events by hostname and by the day the event occurred.

$hostname = target.hostname
match:
  $hostname over every day
outcome:
  $events_count = count($hostname)

Some data sources, like the entity context, are valid over a time range (<start_time>, <end_time>) and don't bear singular timestamps.

The first keyword is an optional keyword that treats a data source that is valid over a time range as if it's a data source that is only valid on a singular timestamp. This means that for a data source valid over a time range, the keyword first takes into account only the start time (<start_time>), disregarding the end time of the time range.

For example, consider an entity that has a time range of (1m, 5m) with a time granularity of 1m. Assume that the results are grouped by hosts, which are [h1, h2]. The columns that are then returned are (h1, 1m) and (h2, 1m) as the rest of the time range is disregarded.

The first keyword can be added to both by and over every, which results in the same behavior for both. The use of by first is equivalent to over every first.

The following is an example of a query that uses the by operator with the entity context data source that is valid over a time range. In this query, the entire time range is considered because the first keyword is omitted.

graph.entity.hostname != ""
match:
  graph.entity.ip by hour
outcome:
  $min_seconds = min(graph.metadata.event_metadata.event_timestamp.seconds)