Copyright (c) 2020 Geosyntec Consultants, Inc. Mozilla Public License Version 2.0
This software is provided “as is”, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.
This notebook is the second step in the watershed regression task. It uses the relationships from Part 1 (linear models) to run generalized linear mixed models with censored dependent variables. This approach uses Markov chain Monte Carlo techniques and relies heavily on the MCMCglmm package (Hadfield 2009).
Here are the steps in the code below:
clear workspace and load libraries. (code not shown in html file )
The working directory was changed to C:/Users/cnilsen/Documents/repos/stormwaterheatmap/R-scripts/WatershedRegression inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
The stormwater outfall data is available from the Department of Ecology at https://data.wa.gov/Natural-Resources-Environment/Municipal-Stormwater-Permit-Outfall-Data/d958-q2ci.
A .csv file is saved in WatershedRegression/data/S8_data.csv
all.S8.data <- read_csv("data/S8_data.csv",
col_types = cols(field_collection_end_date = col_date(format = "%m/%d/%Y"),
field_collection_start_date = col_date(format = "%m/%d/%Y")) )Missing column names filled in: 'X1' [1]10838 parsing failures.
row col expected actual file
2384 result_basis 1/0/T/F/TRUE/FALSE Dry 'data/S8_data.csv'
2385 result_basis 1/0/T/F/TRUE/FALSE Dry 'data/S8_data.csv'
2386 result_basis 1/0/T/F/TRUE/FALSE Dry 'data/S8_data.csv'
2387 result_basis 1/0/T/F/TRUE/FALSE Dry 'data/S8_data.csv'
2388 result_basis 1/0/T/F/TRUE/FALSE Dry 'data/S8_data.csv'
.... ............ .................. ...... ..................
See problems(...) for more details.
#filter out rejected data
all.S8.data <- (filter(all.S8.data,!result_data_qualifier %in% 'REJ'))
#filter out replicates
all.S8.data <- (filter(all.S8.data,!sample_replicate_flag %in% 'Y'))
#change nondetect warnings to detects
warnings <- all.S8.data$nondetect_flag == "WARNING"
all.S8.data$nondetect_flag[warnings] <- FALSE
#Change NA to detect
all.S8.data$nondetect_flag[is.na(all.S8.data$nondetect_flag)] <- FALSE
#Change season to factor
all.S8.data$season <- as.factor(all.S8.data$season)The chunk below makes a list of parameters we are interested in.
#Select Parameters
params <- c('Zinc - Water - Total',
'Copper - Water - Total',
'Nitrite-Nitrate - Water - Dissolved',
'Lead - Water - Total',
'Total Phosphorus - Water - Total',
'Total Suspended Solids - Water - Total',
'Total Phthalate - Water - Total',
'Total PAH - Water - Total',
#'Chrysene - Water - Total',
'CPAH - Water - Total',
'HPAH - Water - Total'
#'Total Kjeldahl Nitrogen - Water - Total',
#'Total PCB - Water - Total'
)
#save a list of all the parameters in case we want to use mor.
params.all <- data.frame(unique(all.S8.data$parameter))
s8data <- all.S8.data
#Clean up extracted data and rename columns:
s8data <- all.S8.data %>%
dplyr::select(
study_name,
location_id,parameter,
type,
season,
new_result_value,
nondetect_flag,
study_id,
access_id,
field_collection_end_date,
field_collection_start_date,
type)
#rename some columns
colnames(s8data)[colnames(s8data) == "location_id"] <- "Location"
colnames(s8data)[colnames(s8data) == "new_result_value"] <-
"concentration"
s8data$nondetect_flag <- as.logical(s8data$nondetect_flag)
s8data$concentration <- as.numeric(s8data$concentration)Set up a jitter plot of all the data to look for outliers:
scatter_cocs <- function(df.coc,title) {
p <- ggplot(df.coc, aes(1, concentration)) + geom_jitter() + labs(
title = title,
subtitle = "Data collected 2009-2013",
caption =
" Data source: Ecology, 2015",
x = "Observations"
#y = y.title
)
p + facet_wrap( ~ parameter, ncol=3, scales = 'free')
}
scatter_cocs(s8data[which(s8data$parameter %in% params),],'All Observations')outliers are apprent for TSS, TP, and no2-no3. Remove these.
#remove and replot
outlierParams <- c("Total Suspended Solids - Water - Total", "Total Phosphorus - Water - Total", "Nitrite-Nitrate - Water - Dissolved")
#This removes the highest values
outlierVals <-
top_n(group_by(s8data[which(s8data$parameter %in% outlierParams), ], parameter), 1, concentration)$concentration
s8data <- s8data %>%
group_by(parameter) %>%
slice(which(!(
parameter %in% outlierParams & concentration %in% outlierVals
)))
scatter_cocs(s8data[which(s8data$parameter %in% params),],'All Observations - Outliers Removed')Looks better. Move on to the next Chunk.
#Spatial predcitors have been extracted and saved as a csv file.
spatial_data <- read_csv("C:/Users/cnilsen/Google Drive/repos/spatialPredictors.csv")Parsed with column specification:
cols(
`system:index` = col_character(),
COM = col_double(),
Location = col_character(),
RES = col_double(),
depSplusN = col_double(),
impervious = col_double(),
logPopulation = col_double(),
nighttime_lights = col_double(),
pm25 = col_double(),
rev_logTraffic = col_double(),
roadDensity = col_double(),
.geo = col_logical()
)
#RES and COM are compositional data. Change to a ratio
spatial_data$LU_ratio = spatial_data$COM/spatial_data$RES
spatial_data <- dplyr::select(spatial_data, -c(RES,COM,.geo))
#merge spatial predictors with monitoring data
s8data.wPredictors <<- merge(s8data, spatial_data)%>%
dplyr::select(-c(depSplusN))
# spatial_data<-read_csv("data/spatialPredictors42.csv", col_types = cols(X1 = col_skip()))
#
# #merge spatial predictors with monitoring data
# s8data.wPredictors <- merge(s8data, spatial_data)Some helper functions to help
scatter_predict <- function(model_df,predictions) {
# model_to_predict <- CuModel
# coc = params[2]
# df <- (subset(s8data.wPredictors, parameter == coc)) %>%
# add_surv()
# predictions <- predict(model_to_predict, newdata=df,
# type="response", interval="none", level=0.95, it=NULL,
# posterior="all", verbose=FALSE, approx="numerical")
#
obs <- log(model_df$concentration)
ggstatsplot::ggscatterstats(
data =tibble(p = predictions, obs = obs,L = model_df$Location),
x = p,
y = obs,
type = "bf",
point.width.jitter = 0.02,
#point.height.jitter = 0.1,
marginal = FALSE,
xlab = "Predicted log(µg/L)",
ylab = "Observed log(µg/L)",
title = coc,
results.subtitle = FALSE,
subtitle = "Predictions vs. Observations",
smooth.line.args = list(size = 1, color = "blue"),
messages = FALSE
)
}# add custom functions to extract estimates (tidy) and goodness-of-fit (glance) information
tidy.MCMCglmm <- function(object, ...) {
s <- summary(object, ...)
ret <- tibble::tibble(term = row.names(s$solutions),
estimate = s$solutions[, 1],
conf.low = s$solutions[, 2],
conf.high = s$solutions[, 3])
ret
}
glance.MCMCglmm <- function(object, ...) {
ret <- tibble::tibble(dic = object$DIC,
n = nrow(object$X))
ret
}
# estimate a simple model
#model <- MCMCglmm(PO ~ 1 + plate, random = ~ FSfamily, data = PlodiaPO, verbose=FALSE, pr=TRUE)This is the main funciton to run the mcmc model. It does the following:
1. subsets a dataframe to include only the parameter we want to predict
2. adds a survival object to handle censored data
3. sets up a simiple prior structure
4. performs mcmc modeling on either the log-transformed or non-log transformed responses.
5. returns the results
mcmc_calc <- function(coc.local, fixed_list, lhs) {
df <- (subset(s8data.wPredictors, parameter == coc.local))
data <-
df %>%
add_surv()
#make the prior_structures
prior.1<-list(R=list(V=1, fix=1), G=list(G1=list(V=1, nu=0.002)))
prior.2<-list(R=list(V=2, fix=1), G=list(G1=list(V=1, nu=0)))
if (lhs == 'log') {
mcmc_formula <-
as.formula(paste(
"cbind(cenMin_log, cenMax_log) ~ ",
paste0(fixed_list , collapse = "+")
))
}
else {
mcmc_formula <-
as.formula(paste(
"cbind(cenMin, cenMax) ~ ",
paste0(fixed_list , collapse = "+")
))
}
mcmc_results <-
MCMCglmm(
mcmc_formula,
random = ~ Location,
data = data,
family = "cengaussian",
verbose = FALSE, prior = prior.1, singular.ok = TRUE,
nitt = 60000, thin = 13, burnin = 10000
)
return((mcmc_results))
}# Do some predictive checks
library(bayesplot)
color_scheme_set("blue")
bayes_plots <- function(fit,coc,df) {
#fit <- TSSModel
#coc = 'Total Suspended Solids - Water - Total'
#df <- (subset(s8data.wPredictors, parameter == coc)) %>%
# add_surv()
yrep_c <- predict(fit, newdata=df,
type="response", interval="confidence", level=0.9, it=NULL,
posterior="all", verbose=FALSE, approx="numerical")
yrep_p <- predict(fit, newdata=df,
type="response", interval="prediction", level=0.9, it=NULL,
posterior="all", verbose=FALSE, approx="numerical")
#show uncertainty intervals under esimated posterior density curves
plot.1 <- mcmc_areas(fit$Sol,prob = 0.80, pprob_outer = 0.95,point_est="mean")+ggplot2::labs(title = coc, subtitle = "Posterior distributions with medians and 80% intervals")
#generate scatter plot of predictions
colnames(yrep_p) <- c("fit.p", "lwr.p", "upr.p")
scatterdata <- cbind(df, yrep_c, yrep_p)
#generate scatter plot of predictions
plot.2 <- ggplot(scatterdata) +
geom_ribbon(aes(ymin = lwr.p, ymax = upr.p, x = fit),fill="grey", alpha = 0.5) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = fit), fill = "grey", alpha = 0.8) +
geom_line(aes(x=fit,y=fit),color="blue",linetype=5)+
geom_point(aes(x = fit, y = log(concentration)), alpha = 0.5)+
ggplot2::labs(x="yrep",y="fit",title = coc, subtitle = "Scatter plot of observed data vs simulated",caption="dark shade: confidence intervals \n light shade: prediction intervals")
#simulate with 100 draws
ysim <- (simulate(fit,nsim = 100))
#overlay of predictions
plot.3<- ppc_dens_overlay(log(df$concentration),t(ysim))+ggplot2::labs(x="log concentration, μg/L ",title = coc, subtitle = "Observed (y) vs. simulated draws (yrep)")
#predictions vs observed, densities
scatterdata <- cbind(scatterdata,sim=ysim[,50])
plot.4<-ggplot(scatterdata) + geom_density(aes(x=log(concentration)),fill="lightblue",alpha=0.5)+
geom_density(aes(x=sim),fill="darkblue",alpha=0.5,linetype=2)+facet_wrap(vars(type),scales="free")+ ggplot2::labs(y="log concentration, μg/L ",title = coc, subtitle = "Simulated vs. Observed Concetrations for reported land use types",caption="dark shade: simulated, light shade: observed \n COM: Commercial, HDR: High Density Residential, LDR: Low Density Residential, IND: Industrial")
return(list(plot.1,plot.2,plot.3,plot.4))
}#function that returns dot-whisker plots
dotwhiskerplot <- function(model,coc) {
return(ggstatsplot::ggcoefstats(
x = model,
title = coc,
#subtitle = "multivariate generalized linear mixed model",
#conf.method = "HPDinterval",
meta.analytic.effect = TRUE,
exclude.intercept = FALSE,
robust = TRUE,
meta.type = "bayes",
bf.message = TRUE
)+ ggstatsplot::theme_ggstatsplot())}
#funciton for other diagnostic plots
diagnostic_plots <- function(chains,coc) {
plotTrace(chains,axes=TRUE,same.limits=TRUE)
plotDens(chains,main=paste('Posterior Distributions \n',coc),probs=c(0.050,0.950),same.limits=FALSE)}For each chunk below, we run the model and output diagnostic and prediction plots.
Iterations = 10001:59999
Thinning interval = 13
Sample size = 3847
DIC: 928
G-structure: ~Location
R-structure: ~units
Location effects: cbind(cenMin_log, cenMax_log) ~ impervious
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 3.644 3.239 4.067 3847 <0.0003 ***
impervious 0.714 0.414 1.015 3847 <0.0003 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Iterations = 10001:59999
Thinning interval = 13
Sample size = 3847
DIC: 928
G-structure: ~Location
R-structure: ~units
Location effects: cbind(cenMin_log, cenMax_log) ~ impervious
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 3.644 3.239 4.067 3847 <0.0003 ***
impervious 0.714 0.414 1.015 3847 <0.0003 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
coc <- params[4]
PbModel <- mcmc_calc(params[4],c('impervious'),'log')
mod <- PbModel
summary(mod$Sol)
Iterations = 10001:59999
Thinning interval = 13
Number of chains = 1
Sample size per chain = 3847
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 1.193 0.327 0.00527 0.00527
impervious 0.654 0.233 0.00375 0.00388
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 0.545 0.991 1.196 1.39 1.84
impervious 0.204 0.502 0.652 0.80 1.11
Iterations = 10001:59999
Thinning interval = 13
Sample size = 3847
DIC: 1040
G-structure: ~Location
R-structure: ~units
Location effects: cbind(cenMin_log, cenMax_log) ~ impervious
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 1.193 0.573 1.866 3847 0.0026 **
impervious 0.654 0.236 1.135 3588 0.0078 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
coc = 'Total Phosphorus - Water - Total'
TPModel <- mcmc_calc('Total Phosphorus - Water - Total',c('rev_logTraffic'),'log')
mod <- TPModel
summary(mod$Sol)
Iterations = 10001:59999
Thinning interval = 13
Number of chains = 1
Sample size per chain = 3847
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
(Intercept) 7.23 0.778 0.01255 0.01255
rev_logTraffic 1.71 0.485 0.00782 0.00782
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
(Intercept) 5.70 6.74 7.21 7.72 8.74
rev_logTraffic 0.77 1.40 1.70 2.02 2.67
Iterations = 10001:59999
Thinning interval = 13
Sample size = 3847
DIC: 952
G-structure: ~Location
R-structure: ~units
Location effects: cbind(cenMin_log, cenMax_log) ~ rev_logTraffic
post.mean l-95% CI u-95% CI eff.samp pMCMC
(Intercept) 7.226 5.704 8.735 3847 <0.0003 ***
rev_logTraffic 1.713 0.833 2.705 3847 0.0031 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
coc = 'Total Kjeldahl Nitrogen - Water - Total'
TKNModel<- mcmc_calc('Total Kjeldahl Nitrogen - Water - Total','rev_logTraffic','log')
mod <- TKNModel
df <- (subset(s8data.wPredictors, parameter == coc)) %>%
add_surv()
bayes_plots(mod,coc,df)The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
coc = 'Total Suspended Solids - Water - Total'
TSSModel <- mcmc_calc('Total Suspended Solids - Water - Total','rev_logTraffic','log')
df <- (subset(s8data.wPredictors, parameter == coc)) %>%
add_surv()
bayes_plots(TSSModel,coc,df)The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
‘Total PAH - Water - Total’,
coc = 'Total PAH - Water - Total'
PAHModel<- mcmc_calc('Total PAH - Water - Total','LU_ratio','log')
mod <- PAHModel
df <- (subset(s8data.wPredictors, parameter == coc)) %>%
add_surv()
bayes_plots(mod,coc,df)The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
‘CPAH - Water - Total’
coc = 'CPAH - Water - Total'
CPAHModel<- mcmc_calc('CPAH - Water - Total',c('logPopulation','rev_logTraffic'),'log')
mod <- CPAHModel
df <- (subset(s8data.wPredictors, parameter == coc)) %>%
add_surv()
bayes_plots(mod,coc,df)The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
‘HPAH - Water - Total’
coc = 'HPAH - Water - Total'
HPAHModel<- mcmc_calc('HPAH - Water - Total',c('LU_ratio'),'log')
mod <- HPAHModel
df <- (subset(s8data.wPredictors, parameter == coc)) %>%
add_surv()
bayes_plots(mod,coc,df)The following arguments were unrecognized and ignored: pprob_outer`expand_scale()` is deprecated; use `expansion()` instead.
[[1]]
[[2]]
[[3]]
[[4]]
Summarize posterior results for use in heatmap.
#summarize
metals <- list()
metals[['Total Copper']] <- (CuModel)
metals[['Total Zinc']] <- (ZnModel)
metals[['Total Cadmium']]<- (CdModel)
metals[['Total Lead']]<- (PbModel)
others <- list()
others[['Total Phosphorus']] <- TPModel
others[['Total Kjeldahl Nitrogen']] <- TKNModel
others[['Total Suspended Sediment']] <- TSSModel
#others['FC'] <- FCModel
msummary(metals,title='Total Metals',statistic_vertical = TRUE,statistic = 'conf.int', conf_level = 0.95)
msummary(others,title='Nutrients and TSS',statistic_vertical = TRUE,statistic = 'conf.int', conf_level = 0.95)Print out a messy list of results for the heatmap
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | 5.390 | 3.572 | 7.180 | 3847 | 0.000 |
| rev_logTraffic | 2.197 | 1.024 | 3.295 | 3847 | 0.001 |
| impervious | 0.305 | 0.059 | 0.554 | 4034 | 0.028 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | 3.644 | 3.239 | 4.07 | 3847 | 0 |
| impervious | 0.714 | 0.414 | 1.01 | 3847 | 0 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | -0.142 | -0.297 | 0.026 | 3847 | 0.084 |
| impervious | 0.295 | 0.180 | 0.405 | 3847 | 0.001 |
| logPopulation | -0.254 | -0.360 | -0.144 | 3847 | 0.000 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | 1.193 | 0.573 | 1.87 | 3847 | 0.003 |
| impervious | 0.654 | 0.236 | 1.14 | 3588 | 0.008 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | 7.23 | 5.704 | 8.73 | 3847 | 0.000 |
| rev_logTraffic | 1.71 | 0.833 | 2.71 | 3847 | 0.003 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | -1.826 | -2.369 | -1.204 | 3847 | 0.000 |
| LU_ratio | -0.166 | -0.382 | 0.075 | 3847 | 0.146 |
| post.mean | l-95% CI | u-95% CI | eff.samp | pMCMC | |
|---|---|---|---|---|---|
| (Intercept) | 13.23 | 11.537 | 14.92 | 3847 | 0.000 |
| rev_logTraffic | 1.97 | 0.893 | 3.03 | 3847 | 0.002 |