Read time: 4 minutes.
When sifting through tons of data, finding the parts that don’t fit a pattern is valuable. The emotion-centered brain wants to see fields of green lights, but businesses thrive and survive from finding the yellows and reds. Value-based data analysis is Where’s Waldo? The faster you spot your striped anomaly, the quicker you can react.
What is an anomaly? An anomaly is an outlier. Using regression-line analysis, an anomaly is any point of data that is too far from the line. “Too far” in data science is typically measured in standard deviations. A standard deviation measures the average distance of a point from the regression line. If points have a large deviation, the standard deviation is higher. When the standard deviation is higher, an anomaly must appear further away from the line. A tight pattern (such as below), the standard deviation would be much smaller, an anomaly would be considered earlier.
Machine learning applies different models to learn about the data it experiences. The most popular of those models are regression-line-based. The machine receives the data and adjusts the line, changing the active standard deviation and including or excluding different points as anomalous.
In the example above, if these data are organized in the time-series of occurrence, the machine would identify and adjust for a wider deviation between 30 and 70. It would then adjust for a narrower deviation between 70 and 100. Why is this important? As the points are occurring between x30 and x70, the deviation is greater. If you are trying to detect and respond to events impacting your business, and your analysis was based on a regression gathered from analyzing x0 through x30, you would discover too many outliers. Too many outliers mean “false positives.”
False positives mean that you are wasting resources disqualifying improperly identified events. They are preferable to what would happen from x70 through x100, where you would have “false negatives.” False negatives are a cardinal sin of the data world. False negatives are valid anomalies, opportunities for actions that would have been valid but were missed because your data analysis was wrong.
The benefit of machine-learning over human-learning? The machine adjusts its expectations of the surrounding data on-the-fly, as experienced, 24 hours a day, seven days a week, 365 days most years. In other words, the possibility for error is in the margin as the machine experiences and responds to the incoming data.
Once you have a plan for identifying anomalies, your next step is to identify your action threshold. These are the distance from the line. At what distance from the line would you want your resources to align to act. There’s a value-based analysis to action that a company must identify to achieve maximum results.
In the outlier image, all the solid dots represent outliers to a tight model. Building a plan around the identification of anomalies is just as important. From lightest blue to darkest blue, there are five rings with data. The lightest blue being ring 1 and progressing in darkness and number out from the center.
In many businesses, ring 1 might be considered inliers for treatment. To clarify, this means that inliers plus the lightest ring of blue receive no special treatment. This could be that the experience is not far enough from the norm to require action, or it might be because the cost of mitigating the least anomalous would come at the expense of treating more essential outliers. But treating the most anomalous might incur too high of a price, also. It might be in your best interest to note anomalies in bands 4 and 5, but take no action, conserving your resources to attend bands 2 and 3.
The luxury of deciding how to act presupposes that you have discovered when you should. The question of when is the power of the anomaly. Want more background? Read my other post and watch our short video.
See edgeCore and AWS Sagemaker in Action