MemStream
Implementation of
- MemStream: Memory-Based Anomaly Detection in Multi-Aspect Streams with Concept Drift . Siddharth Bhatia, Arjit Jain, Shivin Srivastava, Kenji Kawaguchi, Bryan Hooi
MemStream detects anomalies from a multi-aspect data stream. We output an anomaly score for each record. MemStream is a memory augmented feature extractor, allows for quick retraining, gives a theoretical bound on the memory size for effective drift handling, is robust to memory poisoning, and outperforms 11 state-of-the-art streaming anomaly detection baselines.
After an initial training of the feature extractor on a small subset of normal data, MemStream processes records in two steps: (i) It outputs anomaly scores for each record by querying the memory for K-nearest neighbours to the record encoding and calculating a discounted distance and (ii) It updates the memory, in a FIFO manner, if the anomaly score is within an update threshold β.
Demo
- KDDCUP99: Run
python3 memstream.py --dataset KDD --beta 1 --memlen 256
- NSL-KDD: Run
python3 memstream.py --dataset NSL --beta 0.1 --memlen 2048
- UNSW-NB 15: Run
python3 memstream.py --dataset UNSW --beta 0.1 --memlen 2048
- CICIDS-DoS: Run
python3 memstream.py --dataset DOS --beta 0.1 --memlen 2048
- SYN: Run
python3 memstream-syn.py --dataset SYN --beta 1 --memlen 16
- Ionosphere: Run
python3 memstream.py --dataset ionosphere --beta 0.001 --memlen 4
- Cardiotocography: Run
python3 memstream.py --dataset cardio --beta 1 --memlen 64
- Statlog Landsat Satellite: Run
python3 memstream.py --dataset statlog --beta 0.01 --memlen 32
- Satimage-2: Run
python3 memstream.py --dataset satimage-2 --beta 10 --memlen 256
- Mammography: Run
python3 memstream.py --dataset mammography --beta 0.1 --memlen 128
- Pima Indians Diabetes: Run
python3 memstream.py --dataset pima --beta 0.001 --memlen 64
- Covertype: Run
python3 memstream.py --dataset cover --beta 0.0001 --memlen 2048
Command line options
--dataset
: The dataset to be used for training. Choices 'NSL', 'KDD', 'UNSW', 'DOS'. (default 'NSL')--beta
: The threshold beta to be used. (default: 0.1)--memlen
: The size of the Memory Module (default: 2048)--dev
: Pytorch device to be used for training like "cpu", "cuda:0" etc. (default: 'cuda:0')--lr
: Learning rate (default: 0.01)--epochs
: Number of epochs (default: 5000)
Input file format
MemStream expects the input multi-aspect record stream to be stored in a contains ,
separated file.
Datasets
Processed Datasets can be downloaded from here. Please unzip and place the files in the data folder of the repository.
- KDDCUP99
- NSL-KDD
- UNSW-NB 15
- CICIDS-DoS
- Synthetic Dataset (Introduced in paper)
- Ionosphere
- Cardiotocography
- Statlog Landsat Satellite
- Satimage-2
- Mammography
- Pima Indians Diabetes
- Covertype
Environment
This code has been tested on Debian GNU/Linux 9 with a 12GB Nvidia GeForce RTX 2080 Ti GPU, CUDA Version 10.2 and PyTorch 1.5.