Ingestion Delay Variance Race Conditions in SEIM Systems

Full disclosure: this post is fairly technical, not as “first principles” as I’d like it to be (I’m rushing it out a bit as I’d like to refer to it in a pull request against Microsoft’s documentation I’m planning to make regarding the issue) and will require a bit of touching up to be in a state that is typical of stuff I post (I try to make things as accessible and generally readable as possible). I also plan to add more discussion surrounding how to mitigate the issue in your SEIM. For now, I hope the reader can excuse these limitations (I’ve got quite a few of time constraints these days 😅). Read on below for an interesting tale of a subtle flaw in a common design pattern in event processing systems nonetheless (:.

The PR I made against Microsoft’s documentation for the issue: https://github.com/MicrosoftDocs/azure-docs/pull/122521.

Do you work in a security operations center? Or are you what some teams call a “detection engineer”? Well then I have an interesting thing for you: a subtle logical flaw in a common design pattern associated with a very common detection design strategy (SEIM event correlation). In other words, a design flaw that could cause any detections defined in an event process language (like Microsoft’s KQL or Splunk’s SPL) that rely on the presence two events correlated by some field value to silently and spuriously fail. And it’s a design I would be willing to bet you’ve used, and probably are now.

Finding the Flaw

For most of my career I’ve struggled to find a clear description of the job’s I’ve been able to find. I started my career working as what would most properly be called a software engineer who focused mainly on building security products. Still, for the most part, I’m a software engineer building security products, although now I work on a smaller team which exists solely to support an internal security operations team.

While at times I’ve worked on tasks that are more closely related to what people would call “security engineering” (threat emulation, for instance), for the most part I’ve focused almost entirely on software engineering and related tasks (maintaining complex systems with multiple, substantial and disparate components, workshopping design for systems which span multiple teams / technology stacks, root cause analysis for complex, unexpected behaviors, etc.).

The kicker has been I’ve always, just by chance, worked closely with security engineers. This can be difficult at times since we’re all pretty passionate people. They tend to want to do some of the engineering and I tend to want to do some of the analysis, which can lead to poor results when it turns out either of us are missing some fundamentals required to do the other’s job.

But I digress. My software engineer but not quite software engineer job history has given me a unique insight into how other people think about software engineers and how different the role is from other types of technical roles. It’s also taught me how easy it is to overlook complex, nuanced aspects of system design and how bad it can be what complexity is ignored.

The overlap between security engineer and software engineer teams (where I’ve spent most of my career) is one of those places that can happen as security engineers struggle to understand the nuanced associated with system design, underestimate the challenges, and often don’t effectively scope work and have unreasonable expectations of their software engineer counterparts.

What’s worse is most security engineers are familiar with some form of scripting, so I suspect there’s a good deal of confirmation bias where they feel software engineering tasks aren’t really that complicated and they can be done quickly without too much deep thought (or otherwise, they see the deep thought and slow pace taken by software engineering types on problems they are familiar with as unnecessary and strange).

Anyways, this is a good example of a case where a seemingly simple task like correlating the occurrence of two events in a search process language can be quite complex (and a good value case for your software engineers who understand security, which is me :p).

The Race Condition

Index time delays are a known issue in query design: https://learn.microsoft.com/en-us/azure/sentinel/ingestion-delay. The problem is effectively that you are often trying to fix a time window over which you would like to consider events by using a concept of “scheduled searches” with pre-configured time ranges that are relative to the time the search is set to execute.

The simplest case is where you scheduled a search to run every N units of time and it looks back over N units of time of data (where the time between runs and the time over which you look back are equal). You wouldn’t think there would be any problems with this design, but it’s not quite true. Search processing systems are quite complex and layered (sort of like database management systems). Because they are so complicated it can take quite a long time between when an event occurs on a system and when that event actually makes it into the event processing systems data storage component and becomes available for querying.

There’s a lot of “stuff” between the point of the system on which the event was generated and the data store in which the event process system stores the events: the operating system of the system which generates the event, the network, the system(s) on which the event processing system lives and finally the event processing system’s internals.

Because of all that there can be a delay between the time the event is actually generated and the time the event is available for querying. Because of this an event with an actual time of, say, 10:12 might not exist in the event process system’s data stores until 10:22. If you scheduled a search every 15 minutes and look back 15 minutes, and it that search kicked off at a cadence of 10:00, 10:15, 10:30, etc., then you’d miss this event.

The fix for this, however, is a known design pattern. Microsoft’s documentation contains a good description of how to solve this issue in it’s simplest state. However, if you want to correlate on more than a single event, things get complicated.

With two events you aren’t just dealing with a single index time delay, you’re dealing with two index time delays, and they may be different.

Consider the case below, where two alerts, a1 and a2, are intended to be correlated together with no defined upper limit for the time (based on the time the event actually occurred on the system) that has elapsed between them (1), and the search uses the typical pattern of executing every 15 minutes and only looking at events that have been indexed in the last 15 minutes, based on the time the search executes.

In the case above event a1 has a raw event time (the time the event actually occurred on the system) of 10:09:03 and an ingestion time (the time in which the event was ingested by Splunk) of 10:15:03 and a2 has a raw event time (the time the event actually occurred on the system) of 10:10:00 and an ingestion time (the time in which the event was ingested by Splunk) of 10:13:00.

If the ingestion delay were the same for both events the design pattern Microsoft recommends (2) we apply in would suffice to account for the delay. However, given the ingestion delay may vary between the two events, it’s still possible, even though both events will still be considered by a search at some point, that, if the time they are generated and indexed falls close to the time a search will be kicked off, that they are considered by different search execution and therefor correlation does not occur.

This is a pretty nasty issue, it could occur for any search which correlates on two events, regardless of the application of the current design pattern Microsoft recommends to fix other index time delay issues. You can use a search like the one in footnote 3 to get a sense of the “variance” of the ingest time delays in a data source.

Resolving the Issue

Unfortunately I don’t, at this time, see an elegant way of resolving this issue directly using the structure of the query or the configuration of the time parameters. I suspect that the only way to effectively resolve this issue will involve designing your queries using only actual event times with a broad enough range to cover all observed ingest delay times and finding a way to tolerate the ensuing duplication.

I’m not aware of a quick and easy way of doing this in Sentinel although I can imagine a few ways of doing so in Splunk.

TODO: write some discussion of how you can design around the issue, challenge others for a conceptual fix, do some general exploration / reflection of things.

Footnotes

1) The time period for which correlation should occur in this hypothetical case isn’t important to the problem description so long as a1 comes after a2.

2) In SPL you can mitigate race conditions of this type by querying the expected ingestion delays for the target data source with a search like

index=<target data set>
| eval indexTimeDelay = _indextime - _time

then using the earliest/latest and _index_earliest/_index_latest time modifiers to account for the largest observed delay. For instance if the search above showed the worst case index time delay were less than 720 minutes (12 hours)

_index_earliest=-15m@m _index_latest=@m earliest=-720@m latest=@m

would suffice.

3) You can use the SPL below to compute the “variance” (the difference between ingestion delay times for events in a given sample) for 10 minute buckets of events

index=<index name>
| eval indexTimeDelay = _indextime - _time
| bin _time span=10m
| eventstats min(indexTimeDelay) as minIndexDelay, max(indexTimeDelay) as maxIndexDelay by _time
| eval indexTimeDelayVariance = maxIndexDelay - minIndexDelay
| dedup indexTimeDelayVariance,_time
| table indexTimeDelayVariance _time
| sort – indexTimeDelayVariance


Previous
Previous

You Can’t Answer this [Math] Question

Next
Next

We Need to Stop Using Leetcode