DILOF: Effective and Memory Efficient
Local Outlier Detection in Data Streams

Gyoung S. Na, Donghyun Kim, Hwanjo Yu*

Pohang University of Science and Technology (POSTECH), Pohang, South Korea
{ngs00, kdh5377, hwanjoyu}@postech.ac.kr,

Abstract


With precipitously growing demand to detect outliers in data streams, many studies have been conducted aiming to develop extensions of well-known outlier detection algorithm called Local Outlier Factor (LOF), for data streams. However, existing LOF-based algorithms for data streams still suffer from two inherent limitations: 1) Large amount of memory space is required. 2) A long sequence of outliers is not detected. In this paper, we propose a new outlier detection algorithm for data streams, called DILOF that effectively overcomes the limitations. To this end, we first develop a novel density-based sampling algorithm to summarize past data and then propose a new strategy for detecting a sequence of outliers. It is worth noting that our proposing algorithms do not require any prior knowledge or assumptions on data distribution. Moreover, we accelerate the execution time of DILOF about 15 times by developing a powerful distance approximation technique. Our comprehensive experiments on real-world datasets demonstrate that DILOF significantly outperforms the state-of-the-art competitors in terms of accuracy and execution time. The source code for the proposed algorithm is available at our website: http://di.postech.ac.kr/DILOF.

Code & Dataset


Overview



Figure 1: The behaviour of NDS from the time t_0 to t_current for a 2-dimensional data stream.
The proposed algorithm, DILOF, consists of the two phases – (1) detection phase and (2) summarization phase.We refer to the detection phase as last outlier-aware detection (LOD) and the summarization phase as nonparametric density summarization (NDS). DILOF works to detect outliers in data streams with a given maximum window size W.
  1. In LOD, for current data point p_t , it computes LOF_k(p_t) on the current window of data and determines if p_t is an outlier. Then p_t is added to the window no matter whether it is an outlier or not.
  2. In NDS, if the number of past data points in window reaches W, NDS selects W/4 data points from the oldest W/2 data points such that their density difference is minimized (Figure 1). Then the oldest W/2 data points are removed in memory, and the oldest W/4 of window becomes to contain the selected data points by NDS.
  3. The above two phases are repeated. Note that LOD executes on every data insertion while NDS is executed the number of data points in memory reachesW.

Experiments


Here, we present the setting of experiments that are omitted owing to the space limitation in our submitted paper.

We compare sampling performance of random sampling, DENDIS, and NDS with 10 times execution for each dataset. Table 3 shows the properties of datasets for the experiments.
In the experiments, we set the number of samples to half of the number of data points in datasets. For all datasets, we set step size and lambda of NDS to 0.3 and 0.001, respectively. Maximum number of iteration is set to 30. Since the hyper-parameter of DENDIS called granularity must be selected according to the properties of datasets, the value of granularity for each dataset is shown in Table 4.
Table 4. Properties of datasets for the sampling experiments
Dataset # of data points dimension # of classes
UCI Iris 150 4 3
UCI Vowel 990 13 11
UCI Abalone 4,177 8 29
Synthetic-normal 400 2 3
Synthetic-noisy 1,000 2 5
The hyper-parameter settings of DENDIS and NDS for each dataset are given in Table 5.
Table 5. Parameter settings for the sampling experiments
Dataset granularity of DENDIS # of nearest neighbors of NDS
UCI Iris 0.02 8
UCI Vowel 0.003 6
UCI Abalone 0.001 15
Synthetic-normal 0.00501 6
Synthetic-noisy 0.002 18
To compare the sampling performance quantitatively, we use Earth Mover's Distance (EMD). Table 6 shows EMD and its standard deviation of the three sampling algorithms. In the experiment, NDS shows the lowest EMD for all datasets. Moreover, the standard deviation is always zero because there are no random processes in NDS. Since NDS does not contain any random processes, standard deviation of EMD computed by the sampling result of NDS is always zero.
Table 6. EMD for the five datasets
Dataset Random sampling DENDIS NDS
UCI Iris 367 (109) 270 (91) 188 (0)
UCI Vowel 5,704 (891) 5,460 (587) 4,434 (0)
UCI Abalone 5,326 (145) 5,028 (367) 4,575 (0)
Synthetic-normal 3,077 (276) 2,426 (250) 1,672 (0)
Synthetic-noisy 5,457 (579) 5,080 (305) 4,450 (0)
In addition, Table 7 shows the execution time of DENDIS and NDS in milliseconds. In this experiment, NDS 10~35x faster than DENDIS.
Table 7. Execution time (ms) for the five datasets
Dataset DENDIS NDS
UCI Iris 56 5
UCI Vowel 7,032 144
UCI Abalone 268,477 3,250
Synthetic-normal 467 23
Synthetic-noisy 6,250 192