
Model Implementation
Classification
For the following prediction classification models, the model will attempt to predict whether a water sample will exceed the maximum contaminant level (MCL). The MCL is the established threshold determining whether water can be delivered to users of the public water system. In-short exceeding this level would likely result in adverse health impacts for users. Models will be fed basic information on when and where the sample was collected then asked to predict whether the sample exceeds the MCL.
For the following analysis adjustments to the original data frame were made to ensure model compatibility - including one-hot encoding of categorical variables.
The data was then split into test and training sets the training set containing 80% of the data. The training sets were then under-sampled to provide adequate examples of observations which both exceed and don’t exceed the MCL level.
To determine the best model for this binary classification problem three models were created, and their performance was evaluated against one another: decision tree, naive bayes, and support vector machines.
The naive bayes was included due to its ability to identify feature interactions and its high interpretability lending itself to classifications. For this naive bayes model a multinomial model was implemented to best manage count and frequency data.
Decision Tree
Multinomial Naive Bayes
As evidenced by the produced confusion matrix (right) and the above performance statistics the multinomial naive bayes model also performed adequately in the context of MCL prediction. However, the NB model over assumes that an observation will exceed the threshold. Additionally, mislabeling about 1/4 of observations that exceed the MCL. These inaccuracies are likely due to the model’s assumption of independence between features.
Snipit of pre-processed data
Snipit of processed training-data
The decision tree model was included due to its high interpretability and strong predictive capabilities for binary classifications. For this model and gini criterion was implemented and tree depth was limited to 5 and the minimum samples split was set to 1% to maximize model performance while preventing overfitting.
As evidenced by the produced confusion matrix (left) and the above performance statistics the tree performed quite well in predicting whether a sample will exceed the MCL. Exhibiting a recall score of 77% on the exceed MCL label. However, the tree tends to over predict an observation will exceed the threshold. Given the context of the problem, this false-positive is preferred to the alternative false-negative. Further tests can be run on contaminated water, but if water is cleared to be safe it could have catastrophic impacts on the public.
Support Vector Machine
The support vector machine was included due to powerful and flexible nature of separating data for classification. For this model an rbf kernel was implemented as the data should not be assumed to be linearly separable. However, it is assumed to be separated in higher dimensional space. Furthermore, a gamma parameter of 1 was implemented to maximize model performance while reducing the rick of overfitting.
As evidenced by the above performance statistics the rbf svm performed very well in these metrics, boasting an accuracy score of 90%. However, the model recall on the exceed MCL label is only 72%. As with the other models, this RBF SVM also exhibits a high false-positive rate.
Classification Model Comparison
In addition to assessing the confusion matrices and performance statistics for each of the three created models the following ROC-AUC and PR comparison curves were also created.
It is clear from the nearly all of the metrics that the rbf support vector machine model does the best job at predicting whether a observation will exceed the MCL. This model boasts he highest AUC (0.89), AP(0.19), accuracy (.9), and F1 scores (.95, .20). Given this performance this will likely be the model implemented. However, given the context of the problem and the high consequence of false negatives, the decision tree model should also be considered given that is has the highest recall (.77) for the exceed MCL category.