Statistics and aggregations in UDM search using YARA-L 2.0
This page describes how to run statistical queries on UDM events and group the results for analysis using YARA-L 2.0.
Overview
When dealing with a large volume of UDM events generated within your environment, it's important to understand the trends in your UDM search data. You can use statistics and aggregate functions to gain actionable insights from your UDM logs. UDM search supports all the aggregate functions in YARA-L 2.0.
You can use statistical queries for the following use cases:
Track critical metrics: You can measure the distribution and frequency of UDM events and the assets involved, such as hosts communicating with known malicious IP addresses.
Detect anomalous behaviour: You can detect unusual patterns or spikes in activity that could indicate a security incident, such as unusual network traffic volumes or unexpected login activity during unusual hours.
Analyze trends over time: You can identify changes in the security posture to evaluate the impact of controls or to find areas that need improvement, such as tracking the changes in vulnerability counts over time.
YARA-L 2.0 query structure in search
You can group and order UDM search query results using the syntax that is similar to the YARA-L structure in detection engine rules. For more information, see YARA-L 2.0 language syntax.
The query structure is as follows:
Filtering statement: The filtering statement specifies the conditions to filter events.
Match (optional): The match section specifies the fields to group by. For more information, see Match section syntax.
Outcome: The outcome section specifies the outputs of the query. For more information, see Outcome section syntax.
Order: The order section specifies the order of the query results returned. If the order (
asc
ordesc
) is not specified, it will default toasc
.Limit (optional): The limit section specifies the maximum number of rows that the query returns.
Here is an example of order and limit usage:
metadata.log_type = "OKTA"
match:
principal.ip
Outcome:
$user_count_by_ip = count(principal.user.userid)
order:
$user_count_by_ip desc
limit:
20
Aggregations
UDM search supports the following aggregate functions:
sum
sum(numericExpression)
Description
The sum
function returns the sum of values within a numeric column. It ignores
NULL
values during the calculation. It is often used with match
to calculate the sums within different groups in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store a sum of network.sent_bytes
in a variable
called sent_bytes
.
target.ip != ""
match:
principal.ip
outcome:
$sent_bytes = sum(network.sent_bytes)
min
min(numericExpression)
Description
The min
function returns the minimum of the values within a numeric column. It
is often used with match
to get the minimum value within each
group in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the minimum of metadata.event_timestamp.seconds
in a variable called min_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$min_seconds = min(metadata.event_timestamp.seconds)
max
max(numericExpression)
Description
The max
function returns the maximum of the values within a numeric column.
It is often used with match
to get the maximum value within each group in
the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the maximum of metadata.event_timestamp.seconds
in a variable called max_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$max_seconds = max(metadata.event_timestamp.seconds)
avg
avg(numericExpression)
Description
The avg
function returns the average of values within a numeric column. It
ignores NULL
values during the calculation. It is often used
with match
to calculate the averages within specific groups in the data.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the average of metadata.event_timestamp.seconds
in a variable called avg_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$avg_seconds = avg(metadata.event_timestamp.seconds)
count
count(expression)
Description
The count
function returns the number of rows within a group. It is often used
with match
to get counts for specific groups in the data.
Param data types
STRING
Return type
NUMBER
Code Samples
Example
Return the count of successful user logins over time.
metadata.event_type = "USER_LOGIN"
$security_result = security_result.action
$security_result = "ALLOW"
$date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
match:
$security_result, $date
outcome:
$event_count = count(metadata.id)
count_distinct
count_distinct(expression)
Description
The count_distinct
function returns the number of rows that have distinct
values within a group. It is often used with match
to get counts for
specific groups in the data.
Param data types
STRING
Return type
NUMBER
Code Samples
Example
Return the count of distinct successful user logins over time.
metadata.event_type = "USER_LOGIN"
$security_result = security_result.action
$security_result = "ALLOW"
$date = timestamp.get_date(metadata.event_timestamp.seconds, "America/Los_Angeles")
match:
$security_result, $date
outcome:
$event_count = count_distinct(metadata.id)
array
array(expression)
Description
The array
function returns all the values in the form of a list. It truncates
the list to a maximum of 25 random elements.
Param data types
STRING
Return type
LIST
Code Samples
Example
Return an array containing event types.
$event_type = metadata.event_type
outcome:
$event_type_array = array($event_type)
array_distinct
array_distinct(expression)
Description
The array_distinct
function returns all the distinct values in the form of a
list. It truncates the list to a maximum of 25 random elements. The
deduplication to get a distinct list is applied before truncation.
Param data types
STRING
Return type
LIST
Code Samples
Example
Return an array containing distinct event types.
$event_type = metadata.event_type
outcome:
$event_type_array = array_distinct($event_type)
stddev
stddev(numericExpression)
Description
The stddev
function returns the standard deviation over all the possible
values.
Param data types
NUMBER
Return type
NUMBER
Code Samples
Example
Find all the events where target.ip
is not empty. For all the events that
match on principal.ip
, store the standard deviation of
metadata.event_timestamp.seconds
in a variable called stddev_seconds
.
target.ip != ""
match:
principal.ip
outcome:
$stddev_seconds = stddev(metadata.event_timestamp.seconds)
Differences between using YARA-L 2.0 in search and its use within UDM
The
over
keyword, which enables searching for events within a window, is not applicable in search.The UDM search query structure does not include the
condition
andoption
sections.
Group by time granularity
You can group all the event fields and placeholders in the match
section by a
specified time granularity, similar to a column you might group by in SQL.
The syntax is as follows:
match:
... [BY|OVER EVERY] [FIRST] [TIME_GRANULARITY]
To group by time granularity, you can either use the keyword by
or over
every
. The allowed time granularities are:
MINUTE
orm
HOUR
orh
DAY
ord
WEEK
orw
MONTH
ormo
Both the by
and the over every
keywords are functionally equivalent. You can
use one over the other.
Examples
Group IP address and hostname by the hour.
$hostname = principal.hostname
match:
$hostname, target.ip by hour
Group the count of all events by hostname and by the day the event occurred.
$hostname = target.hostname
match:
$hostname over every day
outcome:
$events_count = count($hostname)
Some data sources, like the entity context, are valid over a time range
(<start_time>
, <end_time>
) and don't bear singular timestamps.
The first
keyword is an optional keyword that treats a data source that is
valid over a time range as if it's a data source that is only valid on a
singular timestamp. This means that for a data source valid over a time range,
the keyword first
takes into account only the start time (<start_time>
),
disregarding the end time of the time range.
For example, consider an entity that has a time range of (1m, 5m
) with a time
granularity of 1m. Assume that the results are grouped by hosts, which are
[h1, h2
]. The columns that are then returned are (h1, 1m
) and (h2,
1m
) as the rest of the time range is disregarded.
The first
keyword can be added to both by
and over every
, which results in
the same behavior for both. The use of by first
is equivalent to
over every first
.
The following is an example of a query that uses the by
operator with the
entity context data source that is valid over a time range. In this query, the
entire time range is considered because the first
keyword is omitted.
graph.entity.hostname != ""
match:
graph.entity.ip by hour
outcome:
$min_seconds = min(graph.metadata.event_metadata.event_timestamp.seconds)