Local Outlier Detection in Data Streams

With precipitously growing demand to detect outliers in data streams, many studies have been conducted aiming to develop extensions of well-known outlier detection algorithm called Local Outlier Factor (LOF), for data streams. However, existing LOF-based algorithms for data streams still suffer from two inherent limitations: 1) Large amount of memory space is required. 2) A long sequence of outliers is not detected. In this paper, we propose a new outlier detection algorithm for data streams, called DILOF that effectively overcomes the limitations. To this end, we first develop a novel density-based sampling algorithm to summarize past data and then propose a new strategy for detecting a sequence of outliers. It is worth noting that our proposing algorithms do not require any prior knowledge or assumptions on data distribution. Moreover, we accelerate the execution time of DILOF about 15 times by developing a powerful distance approximation technique. Our comprehensive experiments on real-world datasets demonstrate that DILOF significantly outperforms the state-of-the-art competitors in terms of accuracy and execution time. The source code for the proposed algorithm is available at our website: http://di.postech.ac.kr/DILOF.

- The source code and all datasets used in this paper are available here [Github].
- UCI dataset are available at https://archive.ics.uci.edu/ml/index.php.
- The preprocessed KDD cup 99 datasets are available at http://odds.cs.stonybrook.edu.

- In LOD, for current data point p_t , it computes LOF_k(p_t) on the current window of data and determines if p_t is an outlier. Then p_t is added to the window no matter whether it is an outlier or not.
- In NDS, if the number of past data points in window reaches W, NDS selects W/4 data points from the oldest W/2 data points such that their density difference is minimized (Figure 1). Then the oldest W/2 data points are removed in memory, and the oldest W/4 of window becomes to contain the selected data points by NDS.
- The above two phases are repeated. Note that LOD executes on every data insertion while NDS is executed the number of data points in memory reachesW.

We compare sampling performance of random sampling, DENDIS, and NDS with 10 times execution for each dataset.
Table 3 shows the properties of datasets for the experiments.

In the experiments, we set the number of samples to half of the number of data points in datasets. For all datasets, we set step size and lambda of NDS to 0.3 and 0.001, respectively. Maximum number of iteration is set to 30. Since the hyper-parameter of DENDIS called granularity must be selected according to the properties of datasets, the value of granularity for each dataset is shown in Table 4.

The hyper-parameter settings of DENDIS and NDS for each dataset are given in Table 5.

To compare the sampling performance quantitatively, we use Earth Mover's Distance (EMD).
Table 6 shows EMD and its standard deviation of the three sampling algorithms.
In the experiment, NDS shows the lowest EMD for all datasets.
Moreover, the standard deviation is always zero because there are no random processes in NDS.
Since NDS does not contain any random processes, standard deviation of EMD computed by the sampling result of NDS is always zero.

In addition, Table 7 shows the execution time of DENDIS and NDS in milliseconds.
In this experiment, NDS 10~35x faster than DENDIS.

In the experiments, we set the number of samples to half of the number of data points in datasets. For all datasets, we set step size and lambda of NDS to 0.3 and 0.001, respectively. Maximum number of iteration is set to 30. Since the hyper-parameter of DENDIS called granularity must be selected according to the properties of datasets, the value of granularity for each dataset is shown in Table 4.

Dataset | # of data points | dimension | # of classes |
---|---|---|---|

UCI Iris | 150 | 4 | 3 |

UCI Vowel | 990 | 13 | 11 |

UCI Abalone | 4,177 | 8 | 29 |

Synthetic-normal | 400 | 2 | 3 |

Synthetic-noisy | 1,000 | 2 | 5 |

Dataset | granularity of DENDIS | # of nearest neighbors of NDS |
---|---|---|

UCI Iris | 0.02 | 8 |

UCI Vowel | 0.003 | 6 |

UCI Abalone | 0.001 | 15 |

Synthetic-normal | 0.00501 | 6 |

Synthetic-noisy | 0.002 | 18 |

Dataset | Random sampling | DENDIS | NDS |
---|---|---|---|

UCI Iris | 367 (109) | 270 (91) | 188 (0) |

UCI Vowel | 5,704 (891) | 5,460 (587) | 4,434 (0) |

UCI Abalone | 5,326 (145) | 5,028 (367) | 4,575 (0) |

Synthetic-normal | 3,077 (276) | 2,426 (250) | 1,672 (0) |

Synthetic-noisy | 5,457 (579) | 5,080 (305) | 4,450 (0) |

Dataset | DENDIS | NDS |
---|---|---|

UCI Iris | 56 | 5 |

UCI Vowel | 7,032 | 144 |

UCI Abalone | 268,477 | 3,250 |

Synthetic-normal | 467 | 23 |

Synthetic-noisy | 6,250 | 192 |