Anomalies in data often convey critical information that can be leveraged in a variety of applications. For the military engaged in combat, this can amount to identifying threats early and preserving a lethal edge over an adversary. In other more benign cases it can corrupt data integrity and lead to ineffective application of other data analysis techniques. To tackle the problem of anomaly detection, there are several common methods provided in statistics and machine learning literature, including variational autoencoders . Using a VAE, we develop a novel objective function to improve its performance detecting anomalies. Additionally, we introduce a modeling pipeline that works in the fully unsupervised context, where one does not know the true proportion of anomalies present in the data. To construct this pipeline, we fit reconstruction errors using a Gaussian mixture model and select the model whose characteristics best match our performance metrics. Using our approach, we observe an increase in anomalies detected against a standard objective function, and we measure an average improvement of 0.4021 in F1 scores. We show our findings using four labeled benchmark data sets and apply our conclusions on an open-source, unlabeled data set taken from USASpending.gov.
Pages
81
Format
Kindle Edition
Release
August 19, 2019
Anomaly Detection Using a Variational Autoencoder Neural Network With a Novel Objective Function and Gaussian Mixture Model Selection Technique
Anomalies in data often convey critical information that can be leveraged in a variety of applications. For the military engaged in combat, this can amount to identifying threats early and preserving a lethal edge over an adversary. In other more benign cases it can corrupt data integrity and lead to ineffective application of other data analysis techniques. To tackle the problem of anomaly detection, there are several common methods provided in statistics and machine learning literature, including variational autoencoders . Using a VAE, we develop a novel objective function to improve its performance detecting anomalies. Additionally, we introduce a modeling pipeline that works in the fully unsupervised context, where one does not know the true proportion of anomalies present in the data. To construct this pipeline, we fit reconstruction errors using a Gaussian mixture model and select the model whose characteristics best match our performance metrics. Using our approach, we observe an increase in anomalies detected against a standard objective function, and we measure an average improvement of 0.4021 in F1 scores. We show our findings using four labeled benchmark data sets and apply our conclusions on an open-source, unlabeled data set taken from USASpending.gov.