Anomaly Detection

“whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier).
=> Often, this ability is used to clean real data sets
Two important distinctions must be made:
novelty detection:
The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.
outlier detection:
The training data contains outliers, and we need to fit the central mode of the training data, ignoring the deviant observations
  1. 2.
    Medium - good
  2. 3.
  3. 5.
    A survey
  4. 6.
    A great tutorial about AD using 20 algos in a single python package.
  5. 8.
    A comparison of One-class SVM versus Elliptic Envelope versus Isolation Forest versus LOF in sklearn. (The examples below illustrate how the performance of the covariance.EllipticEnvelope degrades as the data is less and less unimodal. The svm.OneClassSVM works better on data with multiple modes and ensemble.IsolationForest andneighbors.LocalOutlierFactor perform well in every cases.)
  6. 9.
    Using Autoencoders - the information is there, but its all over the place.
  7. 10.
    Twitter anomaly -
  8. 11.
    Microsoft anomaly - a well documented black box, i cant find a description of the algorithm, just hints to what they sort of did


  1. 1.
    Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. The outlier detection methods should allow the user to identify global, contextual and collective outliers.
  2. 2.
  1. 3.
    SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large-scale unsupervised outlier detector training and prediction. Notably, anomaly detection is often formulated as an unsupervised problem since the ground truth is expensive to acquire. To compensate for the unstable nature of unsupervised algorithms, practitioners often build a large number of models for further combination and analysis, e.g., taking the average or majority vote. However, this poses scalability challenges in high-dimensional, large datasets, especially for proximity-base models operating in Euclidean space.
SUOD is therefore proposed to address the challenge at three complementary levels: random projection (data level), pseudo-supervised approximation (model level), and balanced parallel scheduling (system level). As mentioned, the key focus is to accelerate the training and prediction when a large number of anomaly detectors are presented, while preserving the prediction capacity. Since its inception in Jan 2019, SUOD has been successfully used in various academic researches and industry applications, include PyOD [2] and IQVIA medical claim analysis. It could be especially useful for outlier ensembles that rely on a large number of base estimators.
  1. 1.
  2. 2.
    1. 1.
    2. 2.
  1. 1.


The best resource to explain isolation forest - the basic idea is that for an anomaly (in the example) only 4 partitions are needed, for a regular point in the middle of a distribution, you need many many more.
Isolation Forest -Isolating observations:
  • randomly selecting a feature
  • randomly selecting a split value between the maximum and minimum values of the selected feature.
Recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeable shorter paths for anomalies.
=> when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
the paper is pretty good too - In the training stage, iTrees are constructed by recursively partitioning the given training set until instances are isolated or a specific tree height is reached of which results a partial model.
Note that the tree height limit l is automatically set by the sub-sampling size ψ: l = ceiling(log2 ψ), which is approximately the average tree height [7].
The rationale of growing trees up to the average tree height is that we are only interested in data points that have shorter-than average path lengths, as those points are more likely to be anomalies


  • LOF computes a score (called local outlier factor) reflecting the degree of abnormality of the observations.
  • It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.
  • In practice the local density is obtained from the k-nearest neighbors.
  • The LOF score of an observation is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density:
    • a normal instance is expected to have a local density similar to that of its neighbors,
    • while abnormal data are expected to have much smaller local density.


  1. 1.
    We assume that the regular data come from a known distribution (e.g. data are Gaussian distributed).
  2. 2.
    From this assumption, we generally try to define the “shape” of the data,
  3. 3.
    And can define outlying observations as observations which stand far enough from the fit shape.


  1. 3.
    It looks like there are two such methods, - The 2nd one: The algorithm obtains a spherical boundary, in feature space, around the data. The volume of this hypersphere is minimized, to minimize the effect of incorporating outliers in the solution.
The resulting hypersphere is characterized by a center and a radius R>0 as distance from the center to (any support vector on) the boundary, of which the volume R2 will be minimized.


For community detection, text clusters, etc.
  1. 1.
    TFIDF, PCA, SILHOUETTE for deciding how many clusters to use, the knee/elbow method.
  2. 3.
    A notebook, using the SuperHeat package, clustering w2v cosine similarity matrix, measuring using silhouette score.