How to Leverage Basic Statistical Methods to Detect Anomalies in Times Series
In our previous post, we described a way to automatically isolate given time series values related to trends and seasonality from the remainder.
The remainder component is crucial to detect anomalies in an efficient way. Intuitively speaking, the remainder component is the part of the analyzed time series that does not include the trend effect nor the seasonality effect.
For that reason, it is evident that running any anomaly detection algorithm over the remainder component is a more sensible approach than trying to find any anomaly on an unprocessed time series, because the remainder term is mostly decoupled from trend and seasonality effects. In any other case, learning the seasonality and trend contributions would be part of the anomaly detection system, making it extremely application specific. In other words, you would need to create a different system for every different monitored application.
Before moving forward, let’s discuss the purpose of anomaly detection. The purpose of anomaly detection is to flag, from a time series data, samples that “seem unusual.”
The motivations are vast, for example: identifying fraudulent transactions based on transactions from your credit card. In the monitoring space, a paradigmatic use case is pseudo predictive notification when a monitored system is starting to behave unusually slow.
The different pieces of information used to flag a sample, as anomaly or not, are called features. For example, on a credit card fraud application, we could use the amount spent on a given transaction, that would be our first feature. The geographical zone in which the purchase was made, that could be our second feature. The distance between that point and last place where we executed a transaction, as the third feature, and the final feature could be the number of seconds elapsed since the previous transaction. All these four features together could create a nice starting point for any Machine Learning algorithm.
There is a very comprehensive article from Goldstein and Uchida where 19 of the most relevant Anomaly Detection algorithms are benchmarked across different use cases and datasets. I highly encourage anyone with curiosity to read it.
Nevertheless, for anomaly detection of general single featured time series (aka. univariated) there are other approaches based on classical statistics that provide a better overall result when dealing with multiple heterogeneous applications. As we indicated earlier on this post, the remainder component is fairly decoupled from any seasonality or trend effect.
Imagine a web application, it turns out that the remainder component of this time series is usually the result of several random factors. It may depend on a multitude of uncorrelated factors such as: load on the physical machine that hosts the virtual servers where the webservers run, weather conditions that may change your customer’s habits, a tweet that became viral, and much more.
To sum up, all those different “random” effects are consolidated mainly on the remainder component. Hence, by Central Limit Theorem, it is fair to consider our remainder component a “normal distribution.”
So, our problem of detecting anomalies on any metric measuring the response time of a public facing web platform, can be restated as identifying anomalies/outliers on the samples of a normal distribution: the remainder component of our time series.
Outlier detection on normal distributions has been covered from a rigorous mathematical point of view for the last two centuries, and its rigor and results are unquestionable today. Here are a few of the techniques:
Now, let’s dig into these techniques. My goal is to briefly describe these techniques, this is by no means a detailed description.
Chauvenet’s criterion is probably the most straight forward of the three, once we have a normal distribution we can flag outliers in any sample separated a number of deviations from the average value. This means we need to provide an extra input parameter, the “number of deviations” of what we will consider an outlier, or in probability.
Considering the same data set (Monthly Airline Passenger Numbers 1949-1960) we used in our previous post, we can use the remainder component to illustrate Chauvenet’s criterion:
The red dashed lines belong to the mean value of the remainder component over time plus/minus two times its standard deviations. Chauvenet’s criterion flags outliers as any data above or below the dashed lines. Being those 10 outer points our anomalies.
Grubb’s Test is a more resilient test. The main difference with Chauvenet’s criterion is it employs a significance test to determine if there is an outlier in the dataset. Since the data we are handling may have, and usually has, several anomalies, it is not usable from a practical perspective. However, it establishes the framework for its more sophisticated application, the generalized Extreme Studentized Deviate Test, sometimes referred as Grubb’s Test as well.
Generalized Extreme Studentized Deviate Test
The generalized Extreme Studentized Deviate (ESD) Test starts from the hypothesis that there are at k outliers, then employing Grubb’s Test up to k times, it is checked if there is any remaining outlier, removed from the dataset, and the process is repeated until there are no more outliers. In our working example, this are the results of employing the ESD Test with a significance level of 0.95:
It is simple to see how the ESD Test flags less data as outliers/anomalies. A total of 6 anomalies versus the 10 anomalies flagged by the Chauvenet’s criterion.
Chauvenet’s cirterion, because of the way it is defined, tends to flag maximum values as outliers, no matter how normal our data sample is. The ESD test is much more resilient and does not identify outliers on a given distribution. For this reason, general applications where a multitude of different systems coexist seems to be more suitable, providing a good balance between complexity and results.
In this post we have included some brief ideas about anomaly detection techniques. We have highlighted why, when considering multiple heterogeneous monitored applications, statistical analysis based on the remainder component of frequency decomposition is more adequate than multivariated machine learning techniques. Finally, we provided a basic introduction to some of the statistical techniques available for outlier detection and the advantages of ESD Test versus Chauvenet’s criterion.
Stay tuned for my next post where I will cover how an expert system can be plugged in to the anomaly detection flow, to provide advance diagnose capabilities as well as smart tampering capabilities such as smart blackouts.