Data Exploration
Overview
This project investigates the presence and distribution of per- and polyfluoroalkyl substances (PFAS) in drinking water sources. PFAS are synthetic chemicals widely used in industrial and consumer products that have been linked to environmental and health risks. For the second milestone, we hope to collect, clean and explore two data sets: water quality data from the Environmental Protection Agency (EPA) and a poverty data set from the US Census website. Each data set was collected dynamically and assessed/cleaned for missing values, outliers, or other irregularities. The datasets were integrated, visualized, and examined to determine significant patterns and relationships present. The information provided from the transformations and improvements of this dataset help better answer our research questions relating to PFA contamination across regions and income levels.
Table 3: Part of the raw Census poverty data (1 of 2)
Data Collection
Water Quality
The water quality dataset examined here was obtained from the US Environmental Protection Agency (EPA). UCMR, or the Unregulated Contaminant Monitoring Rule, is how the EPA collects data for contaminants that might be in drinking water but do not have regulatory standards under the Safe Drinking Water Act or National Primary Drinking Water Regulations. The UCMR program was developed as a way to track these contaminants every five years. However for PFOA and PFOS, the EPA has set maximum contaminant levels in drinking water (MCL) at 4.0 ppt and a MCL of 10 ppt for PFHxS and PFNA. There are multiple files that contain water quality data points for PFAS contamination over the years throughout the United States. Only UCMR 5, which contains PFAS data for 2023-2025, and UCMR 3, which contains PFAS data from 2013 to 2015 were used. UCMR 4 contained contaminant data for various types of carcinogens, heavy metals, etc. but not PFAS. The raw data contained a lot of information, including the PWS (Public Water System) names, IDs, and sizes; Facility names, IDs, and water types; Sample types, sample collection dates, contaminants, units, methods, result values, state, region, and more (Tables 1 and 2).
The data was collected through a script that outputs the zip file from a given website. This dataset is relevant to our research questions since it contains the PFAS levels that were sampled in our water sources across the country. We are interested in determining how PFAS contamination varies with other parameters such as geography, time, poverty, state boundaries, water sources, etc. This will allow us to further analyze contamination patterns and identify at-risk areas and populations.
Table 4: Part of the raw Census poverty data (2 of 2)
Poverty
The poverty dataset was obtained from the US Census website and provides small area income and poverty estimates (SAIPE) of income and poverty statistics from states. The data aims to provide estimates of income and poverty for the administration of federal programs. The raw datasets have parameters such as (state) names, median income, child poverty counts, child poverty rates, overall poverty counts and rates, corresponding years, etc (Tables 3 and 4).
The poverty data was collected through an API call from census.gov. The code for the API implementation is given below. This dataset is relevant to our research questions since we are interested in determining if PFAs contamination has any correlation with poverty levels, as well as region. It also provides insight into population/density features and their relationship with poverty levels, which may also provide further details on the risk of PFAs.
Table 1: Part of the EPA raw data (1 of 2)
Table 2: Part of the EPA raw data (2 of 2)
Once both sets of data files were acquired, they were uploaded to Github so that subsequent preprocessing and visualization scripts could read directly from Github, and not the collaborator’s individual local machines.
PWS County Data:
The poverty and EPA data only had geographical data on the state level in addition to the PWS names. We wanted to look even further and figure out which counties the PWS’s are in. We used Google’s Places API to fetch county names from the PWS name, and exported that data as a CSV file.
Data Cleaning / Preprocessing
Preprocessing included understanding the data a little better, cleaning the data of redundancies and unhelpful information, and merging the two datasets.
Water Quality
Once the raw water data files were loaded in, they were easily combined since they all shared the same columns. We checked for any NA's and found that there were quite a few in certain columns (Figure 5). FacilityID and FacilityName both had some NAs; since they refer to the same thing, any FacilityName with a NA will be filled with its corresponding FacilityID, if there is one. FacilityID was then removed to reduce the dataframe size. AssociatedFacilityID and AssociatedSamplePointID had a lot of NAs; these are all null per the dataset technical documents provided by the EPA. These columns were deleted since they are all null and do not provide any valuable insight for our project. MRL had NAs as well because certain contaminants do not have a MRL (minimum reporting level). This value does not have any health implications - it is just the lowest value that labs can report. To make it clear that this value has not been set by governing and research bodies yet, these NAs will be replaced with -1 so that it is obvious that it’s not a real value - any visualizations or analyses that return negative values will tip us off that it’s a contaminant without a set MRL. In the end, the Analytical Result Value tells us the actual concentration. Any NAs under AnalyticalResultValue is because the value is under the MRL per the technical documentation provided by the EPA. For all intents and purposes, these NAs will be replaced with 0 since they are lower than the minimum value labs need to report, and the concentration is then functionally zero. Finally, UCMR1SampleType has a lot of NAs but the column is deleted because it’s not needed for our analysis.
Columns that are redundant (such as PWSID) or not useful for our analysis (ex. MethodID or UCMR1SampleType) were deleted. For the redundant columns, in case we need the PWSID again, a dictionary was created that linked the PWS names to the IDs. We now have a clear overview of our data (Figure 6, 7). There are 16,507 different PWS’s, with the most frequent one being Suffolk County Water Authority in New York. Most PWS’s are large, the most common water type is GW, or groundwater. The most frequently occurring contaminant is PFHpA, which is a PFA chemical. The overview says that there are 65 unique states, but that’s because different tribal territories are included where instead of a state, they just have their EPA region designation (01, 02, etc). Note that there are no longer NAs.
We then looked at some of the unique values of each column out of curiosity and found that some of the contaminants are not PFAS - they’re also unregulated, which is why they’re in the dataset, but some are heavy metals, other carcinogens, etc. We only want to focus on PFAS for the purpose of this project, so we filtered the dataframe for just PFAS values. The other contaminants are important, but beyond the scope of this study. Since we want to know how PFAS contamination behaves over time, we converted the CollectionDate to a datetime format and added columns for Year and Month for easier retrieval and analysis later.
We also calculated which samples had a measured value over or under the MRL, or minimum required level. This is the minimum level labs are required to report to the EPA. While the MRL does not have any health indications, knowing that some contaminants are not at the MRL while some are over is still useful. The number of values that are above or at/below the MRL are counted, and for those that are above, the relative contamination level is calculated by taking the recorded value divided by the MRL. Originally, any contaminant with a measured level at or below the MRL had 'NA' as their relative contamination level. A decision was then made to fill any of those NAs with zeros - since the levels were so low that they functionally zero.
Table 7: Cleaned Poverty Data
Counties
Any rows where the PWS name did not pull a county name was removed - that means that some PWS’s won’t be included in the county-wide analysis of PFAS contamination, but the dataset is so large that removing a few rows would not impact the final analysis. After cleaning, the data was exported to a CSV file and then uploaded to Github so that it can be pulled for future analysis (Figure 9).
Table 5: NAs in the original dataset
Table 6: Data Overview (1 of 2)
Table 6: Data Overview (2 of 2)
Poverty
Since the poverty dataset is just one dataset, we started cleaning immediately. Missing values, duplicates, or any outliers are assessed along with updating the dataset to include more interpretable column names and only include relevant years in relation to the other dataset used (EPA data). We also assessed the data for completeness, consistency, and usability. The original dataset had their own naming convention for the columns (Figure 8). The column names were renamed to be more consistent - for example, SAEMHI_PT was renamed to Median_Income. Since we plan on merging the two datasets by state abbreviations, and the poverty dataset had some redundant columns - NAME (full state name) and STATE (numbers of states) specifically, were removed.
Table 8: Cleaned County Data
Combined Dataset
As mentioned above, the poverty dataset and the water quality dataset need to match by state, but they also need to match by year since the data varies over time for both sets. After merging, some samples show NAs for their poverty data. The samples from those areas are technically on tribal land, and therefore do not belong to a state - they are designated by their EPA regions instead. Any state analysis will therefore not include tribal data, but analysis based on EPA region will. The PWS county data was pulled in from Github and merged with the EPA and poverty data (Table 9). The merged dataframe was then split into 12 smaller dataframes for uploading to Github due to file size restrictions.
Table 9: Combined dataset from EPA, Census, and PWS County data
Visualizations
Poverty
After assessing the correlation table and the heat map, it appears that there are relatively strong (above -0.70) negative correlations between: Child_Poverty_Rate and Median_Income, and Poverty_Rate and Median_Income. This makes sense, since it is suggesting that higher income areas have lower poverty rates. There are relatively strong (above 0.70) positive correlations between: Child_Poverty_Count and All_Poverty_Count, Child_Poverty_Count and All_Child_Poverty_Count, Poverty_Rate and Child_Poverty_Rate, Poverty_Count and Child_Porverty_Count and All_Child_Poverty_Count and All_Poverty_Count. These correlations also make sense given the economic expectations, since it is suggesting that child poverty and general poverty rise and fall together.
Figure 2: Combined dataset from EPA, Census, and PWS County data
Figure 1: Heat Map of Poverty Data
The poverty rate across states tends to fluctuate, with Nevada having the highest mean poverty rate, and North Carolina having the lowest. In general most other states appear to have a poverty rate around 13.0, but there is no clear pattern between the state and the poverty rate present.
Figure 3: Diagnostic Plots for Poverty Rates
From the Q-Q plots, we can determine that the Child_Poverty_Rate and the Poverty_Rate variables are generally normally distributed, but both have some skewness in the left tail, indicating that both of these features likely have a slight right skew.
EPA Data
For our main analysis we focussed on data restricted to year 2023, as most poverty and EPA data was available in this year. Figure 4 shows a rough overview of the samples in this timeframe by facility water type. Most samples were taken from groundwater (left chart). While for PFAS contaminated samples the split is similar (right chart), the higher ratio of surface water samples indicates that these samples have a higher PFAS contamination ratio vs. ground water (samples with any detected amount of PFAS, below chart).
Figure 4: PFAS Contaminated by Facility Water Type
In the following we distinguish between two major types of contamination levels:
MRL, or the minimum reporting level, is the lowest concentration of contaminant that can be reported to the EPA for UCMR substances. The MRL is essentially the reporting threshold, and does not have health implications.
MCL is the maximum contaminant level of a contaminant allowed in drinking water and does have health implications. These are legally enforceable levels, and anything above is considered illegal and unsafe. Samples can have concentrations above the MRL but still be considered to be within the legal limit per its MCL.
For a more meaningful analysis we grouped our data by US states (excluding US territories). Figure 5 displays both the percentage of samples exceeding both MRL and MCL by state. We notice that the MCL is always below or equal the MRL level. The states with most samples exceeding MRL (and MCL) levels are Delaware 9.6% (7.7%), New Jersey 6.4% (5.2%) and Florida with 6.0% (4.9%) respectively.
Figure 6 compares the MRL (left) and MCL (right) sample distributions in more detail. For a comprehensive analysis we display strip plots, violin plots and bar plots. The data shows only a few outliers and appears to be normally or exponentially distributed. The distributions of MRL and MCL exceedance per state are quite similar, with a mean of approximately 2% of all samples being contaminated beyond the thresholds. Since MCL provides a more direct measure of potential health impacts, our further analysis and correlations will focus primarily on this parameter.
Figure 5: PFAS Contaminated by State
Figure 6: Distributions of Water Samples exceeding MRL (and MCL) by state
While the EPA data aggregated by state is easy to interpret, the data points were not granular enough for a further correlation analysis with the poverty data. Our data set was lacking any other location data besides the state information. We used a script to first fetch geo-coordinates of the water facilities the samples were taken at and then converted these into county names. The final data set resulted in aggregated data for approximately 1500 counties scattered across the US (including poverty and EPA data).
Figure 7 (below) shows histograms and corresponding q-q-plots for the aggregated county data. To better analyze the distributions we transformed them into log-scale. The top row shows the distribution of the mean relative MCL contamination (i.e. by how many MCL thresholds a typical sample is contaminated) and the bottom row the ratio of contaminated samples (as defined by simply exceeding the MCL threshold). Comparing the distributions with both normal and exponential distributions, we conclude that an exponential fit describes our data much better than a normal distribution (even though with spread out tails, as would be expected).
Figure 7: Log-Histograms and corresponding q-q-plots for normal and exponential distributions for MCL County data
Figure 8: PFAS Contaminated Samples Excedding MCL per County (2023)
Figure 8 (above) shows the county data plotted across a map of the US (excluding Hawaii and Alaska). Each circle corresponds to sample data aggregated per county. The color scale and circle size indicated the percentage of samples exceeding MCL thresholds for each county. Counties with water samples but without any exceedance of MRL thresholds are plotted as empty circles. A KDE heatmap is overlaid for better clarity.
We can distinguish three larger hotspots with particularly high PFAS water contamination, centered around New Jersey, Fort Worth / Dallas TX, and the region stretching from North Carolina to Georgia. The regions in the mid-west and west, including California, show a remarkably less pronounced water contamination. A more detailed analysis and causal relationships will be explored in our final report.
Integrated Data
Figure 9 shows a correlation heatmap between EPA and Poverty data per county. In order to remove excessive noise which would be diluting meaningful patterns we only included counties with an MCL exceedance of >10%. A higher threshold ensures that the analysis reflects communities actually impacted by contamination. For the features Median_Income, Povert_Rate vs Ratio_MCL (highlighted in black box) we can detect a modest correlation of 0.3 and -0.25 respectively). Ratio_MCL (percentage of contaminated samples exceeding the MCL threshold) appears to be the best correlated feature from the EPA dataset.
Figure 10: Correlation between Contamination and Socioeconomic Factors
Figure 9: Correlation heatmap of EPA and Poverty data for counties exceeding MCL by > 10%
Figure 10 (left) presents mean relative contamination (in units of MCL) at the top and the percentage of samples exceeding the MCL threshold at the bottom. While the data exhibits high variance and a relatively low R^2, two key trends emerge: counties with higher median incomes tend to have fewer samples exceeding the MCL threshold and lower mean contamination levels. Conversely, counties with higher poverty rates show a greater incidence of contaminated samples and higher mean water contamination.