Introduction:

While Randomized Controlled Trials (RCTs) are the gold standard for causal inference, they are often not feasible in addiction research for ethical and logistic reasons; for example, when studying the impact of smoking on cancer.

Instead, observational data from real-world settings are increasingly being used to inform clinical decisions and public health policies. This paper presents the framework for potential outcomes for causal inference and summarizes best practices in causal analysis for observational data. Among them: Matching, Inverse Probability Weighting (IPW), and Interrupted Time-Series Analysis (ITSA).

These methods will be explained using examples from addiction research, and the resulting results will be compared.

Required packages and libraries:

knitr::opts_chunk$set(echo = TRUE)

install.packages("tidyverse", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=tidyverse")
install.packages("lmtest", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=lmtest")
install.packages("sandwich", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=sandwich")
install.packages("MatchIt", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=MatchIt")
install.packages("twang", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=twang")
install.packages("nlme", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=nlme")
install.packages("ggplot2", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=ggplot2")
install.packages("tsModel", dependencies = TRUE, repos = "https://CRAN.R-project.org/package=tsModel")
library(tidyverse)
library(lmtest)
library(sandwich)
library(MatchIt) 
library(twang)
library(survey)
library(nlme)
library(ggplot2)
library(tsModel)

The methods Matching and IPTW will be explained on the basis of the following (observational) data set:

smk_data <- read_csv("https://raw.githubusercontent.com/gckc123/Causal_Analysis_Addiction_Examples/main/smoking_psyc_distress.csv")
## Rows: 8000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): sex, indigeneity, high_school, partnered, remoteness, language, sm...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It contains the following variables:

Sex: 0: Female; 1: Male
Indigeneity: 0: non-indigenous; 1: indigenous
High_school: 0: not completed high school; 1: completed high school
Partnered: 0: not partnered; 1: partnered
Remoteness: Remoteness of an individual’s residence, (“factor” variable). 0: major cities; 1: inner regional; 2: outer regional or more remote area
Language: Main language of the participant. 0: non-English; 1: English
Smoker: 0: No; 1: Yes
Risky_alcohol: Consuming alcohol at a risky level. 0: No; 1: Yes
Psyc_distress: Numeric variable ranged from 10 to 50. Higher value represents higher level of psychological distress
Age: Age of the participant

Method 1: Matching

The goal of matching is to establish the balance between treatment and control group, as it generally is in RTCs. Specificly, it targets similar distributions of all observed covariates in both treatment and control group. A variant of matching is one-to-one matching which matches each individual in the treatment group with an individual in the control group based on a propensity score. This score represents the probability of receiving the treatment, measured by all variables that can influence it. It is often estimated using logistic regression with pretreatment covariates. Unmatched individuals will be excluded. After matching, the Average Treatment effect among the Treated (ATT) can now be calculated using simple regression.

The following example illustrates estimating the causal effect of smoking on psychological distress:

Primarily, all smokers need to be matched to a non-smoker from the control group.

smk_matching <- matchit(smoker ~ sex + indigeneity + high_school + partnered + remoteness + language + risky_alcohol + age, data = smk_data, method = "optimal", distance = "glm")
summary(smk_matching)
## 
## Call:
## matchit(formula = smoker ~ sex + indigeneity + high_school + 
##     partnered + remoteness + language + risky_alcohol + age, 
##     data = smk_data, method = "optimal", distance = "glm")
## 
## Summary of Balance for All Data:
##               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance             0.1852        0.1130          0.6304     2.0778    0.2033
## sex                  0.4938        0.4421          0.1035          .    0.0518
## indigeneity          0.0524        0.0175          0.1565          .    0.0349
## high_school          0.4220        0.6378         -0.4370          .    0.2158
## partnered            0.4630        0.6913         -0.4578          .    0.2283
## remoteness0          0.5852        0.6773         -0.1870          .    0.0921
## remoteness1          0.2177        0.1900          0.0670          .    0.0277
## remoteness2          0.1971        0.1327          0.1621          .    0.0645
## language             0.9579        0.9130          0.2234          .    0.0449
## risky_alcohol        0.6427        0.5411          0.2120          .    0.1016
## age                 51.6057       53.7824         -0.1676     0.8214    0.0441
##               eCDF Max
## distance        0.3185
## sex             0.0518
## indigeneity     0.0349
## high_school     0.2158
## partnered       0.2283
## remoteness0     0.0921
## remoteness1     0.0277
## remoteness2     0.0645
## language        0.0449
## risky_alcohol   0.1016
## age             0.1020
## 
## Summary of Balance for Matched Data:
##               Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance             0.1852        0.1849          0.0025     1.0151    0.0002
## sex                  0.4938        0.5000         -0.0123          .    0.0062
## indigeneity          0.0524        0.0400          0.0553          .    0.0123
## high_school          0.4220        0.4127          0.0187          .    0.0092
## partnered            0.4630        0.4743         -0.0226          .    0.0113
## remoteness0          0.5852        0.6027         -0.0354          .    0.0175
## remoteness1          0.2177        0.2136          0.0100          .    0.0041
## remoteness2          0.1971        0.1838          0.0335          .    0.0133
## language             0.9579        0.9620         -0.0205          .    0.0041
## risky_alcohol        0.6427        0.6417          0.0021          .    0.0010
## age                 51.6057       51.2064          0.0307     0.9137    0.0153
##               eCDF Max Std. Pair Dist.
## distance        0.0072          0.0044
## sex             0.0062          0.2752
## indigeneity     0.0123          0.3042
## high_school     0.0092          0.1725
## partnered       0.0113          0.1833
## remoteness0     0.0175          0.2772
## remoteness1     0.0041          0.2389
## remoteness2     0.0133          0.3226
## language        0.0041          0.1329
## risky_alcohol   0.0010          0.2678
## age             0.0390          0.3493
## 
## Sample Sizes:
##           Control Treated
## All          7026     974
## Matched       974     974
## Unmatched    6052       0
## Discarded       0       0

To highlight the difference in the matched and unmatched data set concerning the adjustment of all social demographic variables:

plot(summary(smk_matching), abs=FALSE)

To estimate the the treatment effect of smoking on social distress a linear regression is used:

matched_data <- match.data(smk_matching)
smk_model1 <- lm(psyc_distress ~ smoker, data = matched_data, weights = weights)
summary(smk_model1)
## 
## Call:
## lm(formula = psyc_distress ~ smoker, data = matched_data, weights = weights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.313 -5.261 -2.261  2.700 34.739 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.2608     0.2155  70.826  < 2e-16 ***
## smoker        2.0524     0.3047   6.735 2.15e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.725 on 1946 degrees of freedom
## Multiple R-squared:  0.02278,    Adjusted R-squared:  0.02228 
## F-statistic: 45.36 on 1 and 1946 DF,  p-value: 2.147e-11

For inference the results we need to calculate the cluster-robust standard error and the corresponding confidence interval:

coeftest(smk_model1, vcov. = vcovCL, cluster = ~subclass)
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 15.26078    0.18357 83.1352 < 2.2e-16 ***
## smoker       2.05236    0.29287  7.0077 3.326e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefci(smk_model1, vcov. = vcovCL, cluster = ~subclass, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 14.900774 15.620787
## smoker       1.477986  2.626737
smk_model2 <- lm(psyc_distress ~ smoker + sex + indigeneity + high_school + partnered + remoteness + language + risky_alcohol + age, data = matched_data, weights = weights)
summary(smk_model2)
## 
## Call:
## lm(formula = psyc_distress ~ smoker + sex + indigeneity + high_school + 
##     partnered + remoteness + language + risky_alcohol + age, 
##     data = matched_data, weights = weights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.248  -4.406  -1.748   2.677  34.198 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.98300    1.02957  20.380  < 2e-16 ***
## smoker         2.05594    0.29643   6.936 5.49e-12 ***
## sex           -0.84844    0.30543  -2.778  0.00553 ** 
## indigeneity    0.71561    0.72464   0.988  0.32350    
## high_school   -0.13068    0.31555  -0.414  0.67882    
## partnered     -2.27879    0.30034  -7.587 5.04e-14 ***
## remoteness1   -0.10904    0.37838  -0.288  0.77324    
## remoteness2   -0.48268    0.40000  -1.207  0.22769    
## language       0.31060    0.78642   0.395  0.69292    
## risky_alcohol -0.04213    0.32257  -0.131  0.89609    
## age           -0.08498    0.01169  -7.268 5.27e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.535 on 1937 degrees of freedom
## Multiple R-squared:  0.08131,    Adjusted R-squared:  0.07657 
## F-statistic: 17.14 on 10 and 1937 DF,  p-value: < 2.2e-16
coeftest(smk_model2, vcov. = vcovCL, cluster = ~subclass)
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   20.983001   1.106654 18.9608 < 2.2e-16 ***
## smoker         2.055939   0.293548  7.0038 3.424e-12 ***
## sex           -0.848438   0.309643 -2.7400  0.006199 ** 
## indigeneity    0.715614   0.743268  0.9628  0.335771    
## high_school   -0.130683   0.324392 -0.4029  0.687099    
## partnered     -2.278793   0.301208 -7.5655 5.934e-14 ***
## remoteness1   -0.109040   0.382795 -0.2849  0.775787    
## remoteness2   -0.482682   0.409110 -1.1798  0.238210    
## language       0.310598   0.908526  0.3419  0.732485    
## risky_alcohol -0.042134   0.327350 -0.1287  0.897599    
## age           -0.084981   0.011135 -7.6317 3.611e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefci(smk_model2, vcov. = vcovCL, cluster = ~subclass, level = 0.95)
##                    2.5 %      97.5 %
## (Intercept)   18.8126422 23.15335935
## smoker         1.4802364  2.63164188
## sex           -1.4557070 -0.24116837
## indigeneity   -0.7420744  2.17330331
## high_school   -0.7668761  0.50551060
## partnered     -2.8695187 -1.68806716
## remoteness1   -0.8597748  0.64169401
## remoteness2   -1.2850237  0.31965914
## language      -1.4711929  2.09238922
## risky_alcohol -0.6841293  0.59986187
## age           -0.1068197 -0.06314286

Method 2: Inverse Probability Treatment Weighting

Inverse Probability Weighting (IPW) is a statistical technique used in observational studies to estimate the causal effect of a treatment or intervention. Its primary objective is to address potential confounding variables and (like the matching method) achieve balance between the treatment and control groups. In IPW, propensity scores are calculated for each individual in the study population. These scores represent the probability of receiving the treatment, given a set of observed covariates, like in this case religion or education. Once the propensity scores are obtained, weights are assigned to each individual based on their propensity score. Individuals in the treatment group are assigned weights that are the inverse of their propensity score, while individuals in the control group are assigned weights derived from the inverse of one minus the propensity score. These weights help balance the groups by giving more weight to individuals who are less likely to receive the treatment, and vice versa. The weighted data is then used to estimate the causal effect of the treatment using appropriate statistical methods such as regression models or stratification techniques.

Balance the groups:

#Here starts the part for IPTW method. The same data is being used, with remoteness variable recoded as factor variable.

smk_iptw <- ps(smoker ~ sex + indigeneity + high_school + partnered + remoteness + language + risky_alcohol + age, interaction.depth = 3, data = as.data.frame(smk_data), n.tree = 5000, estimand = "ATE", verbose = FALSE)

bal.table(smk_iptw)
## $unw
##                tx.mn  tx.sd  ct.mn  ct.sd std.eff.sz    stat  pval    ks
## sex            0.494  0.500  0.442  0.497      0.104   3.031 0.002 0.052
## indigeneity    0.052  0.223  0.018  0.131      0.239   4.770 0.000 0.035
## high_school    0.422  0.494  0.638  0.481     -0.443 -12.820 0.000 0.216
## partnered      0.463  0.499  0.691  0.462     -0.483 -13.504 0.000 0.228
## remoteness:0   0.585  0.493  0.677  0.467     -0.195  19.793 0.000 0.092
## remoteness:1   0.218  0.413  0.190  0.392      0.070      NA    NA 0.028
## remoteness:2   0.197  0.398  0.133  0.339      0.186      NA    NA 0.064
## language       0.958  0.201  0.913  0.282      0.164   6.180 0.000 0.045
## risky_alcohol  0.643  0.479  0.541  0.498      0.204   6.169 0.000 0.102
## age           51.606 12.990 53.782 14.333     -0.153  -4.839 0.000 0.102
##               ks.pval
## sex             0.020
## indigeneity     0.250
## high_school     0.000
## partnered       0.000
## remoteness:0    0.000
## remoteness:1       NA
## remoteness:2       NA
## language        0.064
## risky_alcohol   0.000
## age             0.000
## 
## $ks.mean.ATE
##                tx.mn  tx.sd  ct.mn  ct.sd std.eff.sz   stat  pval    ks ks.pval
## sex            0.449  0.498  0.448  0.497      0.002  0.051 0.959 0.001   1.000
## indigeneity    0.019  0.138  0.021  0.144     -0.012 -0.485 0.627 0.002   1.000
## high_school    0.608  0.488  0.612  0.487     -0.009 -0.217 0.828 0.004   1.000
## partnered      0.651  0.477  0.664  0.472     -0.028 -0.734 0.463 0.013   1.000
## remoteness:0   0.668  0.471  0.667  0.471      0.003  0.003 0.997 0.001   0.997
## remoteness:1   0.193  0.395  0.193  0.395     -0.001     NA    NA 0.000      NA
## remoteness:2   0.139  0.346  0.140  0.347     -0.003     NA    NA 0.001      NA
## language       0.929  0.257  0.918  0.274      0.039  0.847 0.397 0.011   1.000
## risky_alcohol  0.572  0.495  0.554  0.497      0.037  0.862 0.389 0.018   0.992
## age           53.293 13.857 53.525 14.194     -0.016 -0.373 0.709 0.016   0.999
## 
## $es.mean.ATE
##                tx.mn  tx.sd  ct.mn  ct.sd std.eff.sz   stat  pval    ks ks.pval
## sex            0.449  0.498  0.448  0.497      0.002  0.056 0.955 0.001   1.000
## indigeneity    0.020  0.138  0.021  0.144     -0.012 -0.467 0.640 0.002   1.000
## high_school    0.608  0.488  0.612  0.487     -0.008 -0.215 0.830 0.004   1.000
## partnered      0.650  0.477  0.664  0.472     -0.029 -0.757 0.449 0.014   1.000
## remoteness:0   0.667  0.471  0.667  0.471      0.002  0.003 0.997 0.001   0.997
## remoteness:1   0.193  0.395  0.193  0.395      0.000     NA    NA 0.000      NA
## remoteness:2   0.139  0.346  0.140  0.347     -0.003     NA    NA 0.001      NA
## language       0.929  0.257  0.918  0.274      0.039  0.847 0.397 0.011   1.000
## risky_alcohol  0.572  0.495  0.554  0.497      0.037  0.870 0.384 0.018   0.991
## age           53.296 13.859 53.525 14.194     -0.016 -0.368 0.713 0.016   0.999

The ps()-function is being used to estimate the propensity scores. The variable smoker is the outcome variable, and the other variables listed are the predictors used to predict the likelihood of being a smoker (Receiving the Treatment). The result is stored in the smk_iptw object, which can then be further used for estimating the causal effect of the treatment on smoking behavior.

The command “bal.table(smk_iptw)” calculates a balance table for the variables in relation to the created propensity score model “smk_iptw”. In this case, the focus is on the first and third table and on the column “std. eff. sz”. In the first table, there are substantial differences in all variables in the original data between the two groups (> 0.1). Compared to that, in the third table, which represents the weighted data, the differences are close to 0. So a much better balance between the groups regarding the covariates is achieved.

plot(smk_iptw)

Extract the weights:

#extract the weights
smk_data$weight <- get.weights(smk_iptw, stop.method = "es.mean")

With the previous step, the weights can now be extracted. The get.weights()-function is used to calculate the weights for the observations based on the propensity score analysis. The relevant object was defined above as smk_iptw. This object contains the necessary information to compute the weights. By executing this code, the propensity score weights are calculated for each observation in the “smk_data” dataset and stored in the newly created “weight” variable.

Estimate the Treatment Effect:

#Estimate the Treatment Effect

design_iptw <- svydesign(ids = ~1, weights = ~weight, data = smk_data)
smk_model3 <- svyglm(psyc_distress ~ smoker, design = design_iptw)
summary(smk_model3)
## 
## Call:
## svyglm(formula = psyc_distress ~ smoker, design = design_iptw)
## 
## Survey design:
## svydesign(ids = ~1, weights = ~weight, data = smk_data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.83600    0.06618 224.168  < 2e-16 ***
## smoker       1.72865    0.25771   6.708 2.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 38.62049)
## 
## Number of Fisher Scoring iterations: 2

After using IPTW, the treatment effect can now be estimated. The function svyglm() estimates a generalized linear model (glm) for psychological distress (psyc_distress) based on the variable smoker. The model is fitted using the previously created survey design design_iptw with inverse probability weighting. The results of the model are assigned to the variable smk_model3. The following summary ()-function provides the results from the smk_model3. It includes statistical information such as estimates, standard errors, p-values, and confidence intervals for the estimated model coefficients. By running these code snippets, estimates, summaries, and confidence intervals are computed for the relationship between smoking and psychological distress. The results can be used to analyze the impact of smoking on mental health and draw conclusions. By using Inverse Probability Weighting before, we got a much better estimate for the causal relationship between smoking and psychological distress by migate the impact of the confounding variables and create balance between treatment and control group.

In this case, the analysis estimates that smoking leads to an increase of 1.73 points in psychological distress compared to non-smokers.

confint(smk_model3)
##                 2.5 %    97.5 %
## (Intercept) 14.706263 14.965732
## smoker       1.223468  2.233829

Method 3: Interrupted Time Series Analysis

With the goal of altering a population-level outcome, numerous public health interventions are put into action, such as the rate of hospital emergency presentations due to excessive alcohol consumption. Interrupted Time Series Analysis, a statistical technique using observational data, examines the impact of interventions by analyzing changes in a time series before and after the intervention. It separates intervention effects from other factors, employing a control group for a counterfactual scenario. This method has broad applications in public health, economics, social sciences, and policy evaluation. It enables decision-makers to make informed choices and optimize policies. Interrupted Time Series Analysis is a powerful tool for understanding causal effects, evaluating interventions, and improving well-being.

In this part, the effect of the minimum alcohol pricing on alcohol consumption at the population level is estimated in Northern Territory, Australia, on the basis of the following (observational) data set:

alc_mup_data <- read_csv("https://statsnotebook.io/blog/data_management/example_data/alcohol_data_NTWA.csv")

The dataset contains 5 variables, as follows:

Alcohol: measure of population level alcohol consumption in a month.
Time: time measures in months
State: “NT”: Northern Territory; “WA” Western Australia
Intervention: Whether the time point is pre or post intervention. 0: Pre-intervention; 1: Post-intervention.
Season: Season of the year. 1: Spring; 2: Summer; 3: Autumn; 4: Winter

alc_mup_data$state <- factor(alc_mup_data$state, exclude = c("", NA))
alc_mup_data$intervention <- factor(alc_mup_data$intervention, exclude = c("", NA))

alc_mup_data$state <- relevel(alc_mup_data$state, ref="WA")

Generating the descriptive statistics by states and by pre/post intervention:

alc_mup_data %>%
group_by(state, intervention) %>%
summarize(count = n(),
M_alcohol = mean(alcohol, na.rm = TRUE),
Mdn_alcohol = median(alcohol, na.rm = TRUE),
SD_alcohol = sd(alcohol, na.rm = TRUE),
IQR_alcohol = IQR(alcohol, na.rm = TRUE)
) %>%
print()
## # A tibble: 4 × 7
## # Groups:   state [2]
##   state intervention count M_alcohol Mdn_alcohol SD_alcohol IQR_alcohol
##   <fct> <fct>        <int>     <dbl>       <dbl>      <dbl>       <dbl>
## 1 WA    0               24     1531.       1587.       358.        506.
## 2 WA    1               24     1582.       1583.       367.        517.
## 3 NT    0               24     3444.       3548.       349.        388.
## 4 NT    1               24     1777.       1748.       323.        324.

To visualize alcohol consumption level by state and by pre/post intervention the ggplot() function is being used:

ggplot(alc_mup_data) +
  geom_boxplot(aes(y=alcohol, x=state, fill = intervention))

Estimating the intervention effect of alcohol minimum pricing on population level alcohol consumption, using gls() function:

res <- gls(alcohol ~ time*intervention*state,
           data = alc_mup_data,
           correlation = corARMA(p = 1, form =~ time | state), method = "ML")
summary(res)
## Generalized least squares fit by maximum likelihood
##   Model: alcohol ~ time * intervention * state 
##   Data: alc_mup_data 
##        AIC      BIC    logLik
##   1330.261 1355.905 -655.1307
## 
## Correlation Structure: AR(1)
##  Formula: ~time | state 
##  Parameter estimate(s):
##       Phi 
## 0.7066424 
## 
## Coefficients:
##                                Value Std.Error   t-value p-value
## (Intercept)                 1558.506  263.2032  5.921303  0.0000
## time                           1.636   16.9157  0.096720  0.9232
## intervention1                -20.840  715.1188 -0.029143  0.9768
## stateNT                     2133.915  372.2256  5.732854  0.0000
## time:intervention1            -1.221   27.5750 -0.044281  0.9648
## time:stateNT                 -23.652   23.9224 -0.988718  0.3255
## intervention1:stateNT      -3258.232 1011.3307 -3.221728  0.0018
## time:intervention1:stateNT    62.170   38.9969  1.594227  0.1145
## 
##  Correlation: 
##                            (Intr) time   intrv1 statNT tm:nt1 tm:sNT in1:NT
## time                       -0.840                                          
## intervention1              -0.507  0.639                                   
## stateNT                    -0.707  0.594  0.358                            
## time:intervention1          0.600 -0.815 -0.945 -0.424                     
## time:stateNT                0.594 -0.707 -0.452 -0.840  0.576              
## intervention1:stateNT       0.358 -0.452 -0.707 -0.507  0.668  0.639       
## time:intervention1:stateNT -0.424  0.576  0.668  0.600 -0.707 -0.815 -0.945
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.3167349 -0.8427548  0.0837754  0.6497867  2.4752763 
## 
## Residual standard error: 312.3269 
## Degrees of freedom: 96 total; 88 residual
#generating the model-based prediction
alc_mup_data$predicted <- res$fitted

#generating the interaction for ggplots
groups = interaction(alc_mup_data$intervention,alc_mup_data$state)

#ploting the time series
plot <- ggplot() +
  geom_point(data = alc_mup_data, aes(y = alcohol, x = time, color = state)) +
  geom_line(data = alc_mup_data, aes(y = predicted, x = time, color = state, group = groups)) +
  geom_vline(xintercept = max((alc_mup_data %>% filter(intervention == "0"))$time), linetype = 
               "dashed") +
  theme_bw(base_family = "sans") +
  theme(legend.position = "bottom")
plot

While alcohol consumption in the Northern Territory was higher than in Western Australia before alcohol minimum pricing, there was no significant difference in the pre-intervention trend (as indicated by the time by state interaction). Immediately after implementing minimum alcohol price, there was a significant drop in alcohol consumption in the Northern Territory but not in Western Australia (as indicated by the intervention by state interaction).

alc_mup_data$residuals <- residuals(res)

To adjust for seasonality, harmonic function is being used based on sine and cosine function:

alc_mup_data <- cbind(alc_mup_data, data.frame(harmonic(alc_mup_data$time, 1, 12)))
alc_mup_data <- alc_mup_data %>% 
  rename(harmonic1 = X1,
         harmonic2 = X2)
res <- gls(alcohol ~ time*intervention*state + harmonic1 + harmonic2,
           data = alc_mup_data,
           correlation = corARMA(p = 1, form =~ time | state), method = "ML")
summary(res)
## Generalized least squares fit by maximum likelihood
##   Model: alcohol ~ time * intervention * state + harmonic1 + harmonic2 
##   Data: alc_mup_data 
##       AIC      BIC   logLik
##   1313.75 1344.522 -644.875
## 
## Correlation Structure: AR(1)
##  Formula: ~time | state 
##  Parameter estimate(s):
##       Phi 
## 0.6162917 
## 
## Coefficients:
##                                Value Std.Error   t-value p-value
## (Intercept)                 1665.579  197.7979  8.420607  0.0000
## time                          -7.244   13.0538 -0.554900  0.5804
## intervention1                142.751  541.7217  0.263513  0.7928
## stateNT                     2146.874  277.3498  7.740673  0.0000
## harmonic1                   -267.374   57.3382 -4.663114  0.0000
## harmonic2                    -27.005   53.3816 -0.505882  0.6142
## time:intervention1             0.474   20.4244  0.023186  0.9816
## time:stateNT                 -22.924   18.2655 -1.255060  0.2129
## intervention1:stateNT      -3284.316  762.7561 -4.305853  0.0000
## time:intervention1:stateNT    61.114   28.8323  2.119625  0.0369
## 
##  Correlation: 
##                            (Intr) time   intrv1 statNT hrmnc1 hrmnc2 tm:nt1
## time                       -0.853                                          
## intervention1              -0.468  0.540                                   
## stateNT                    -0.701  0.595  0.338                            
## harmonic1                  -0.128  0.145 -0.060  0.000                     
## harmonic2                  -0.017  0.007  0.074  0.000 -0.041              
## time:intervention1          0.609 -0.782 -0.924 -0.432 -0.014 -0.058       
## time:stateNT                0.596 -0.700 -0.391 -0.850  0.000  0.000  0.557
## intervention1:stateNT       0.336 -0.389 -0.704 -0.480  0.000  0.000  0.654
## time:intervention1:stateNT -0.429  0.552  0.652  0.612  0.000  0.000 -0.706
##                            tm:sNT in1:NT
## time                                    
## intervention1                           
## stateNT                                 
## harmonic1                               
## harmonic2                               
## time:intervention1                      
## time:stateNT                            
## intervention1:stateNT       0.556       
## time:intervention1:stateNT -0.789 -0.926
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -2.8783266 -0.7063188 -0.1231168  0.6723216  1.9645475 
## 
## Residual standard error: 252.7503 
## Degrees of freedom: 96 total; 86 residual
alc_mup_data$predicted <- res$fitted
groups = interaction(alc_mup_data$intervention,alc_mup_data$state)

alc_mup_data.linear <- alc_mup_data
alc_mup_data.linear$harmonic1 <- 0
alc_mup_data.linear$harmonic2 <- 0
alc_mup_data.linear$predicted <- predict(res, alc_mup_data.linear)

plot <- ggplot() +
  geom_point(data = alc_mup_data, aes(y = alcohol, x = time, color = state)) +
  geom_line(data = alc_mup_data, aes(y = predicted, x = time, color = state, group = groups), linetype = "dashed") +
  geom_line(data = alc_mup_data.linear, aes(y = predicted, x = time, color = state, group = groups)) +
  geom_vline(xintercept = max((alc_mup_data %>% filter(intervention == "0"))$time), linetype = "dashed") +
  theme_bw(base_family = "sans") +
  theme(legend.position = "bottom")
plot

Results & Interpretation

In conclusion, we have explored three powerful methods in causal inference which are part of the case study “Causal inference with observational data in addiction research”: Matching, Inverse Probability Treatment Weighting (IPTW), and Time Series Analysis. Each of these methods offers unique approaches to address causal questions in observational data. To explain Matching and IPTW the used study investigates the causal effect of smoking on psychological distress.

We saw, that Matching is a valuable technique that aims to create comparable treatment and control groups by pairing observations based on their similarity in propensity scores. IPTW assigning weights to observations based on their propensity scores, effectively balancing treatment and control groups. So these two methods have the same goal by creating balance between treatment and control group und thereafter on that base estimating the Average Treatment Effect.

Time Series Analysis, on the other hand, allows us to explore the dynamics and relationships among variables over time. It provides insights into the causal effects of interventions or treatments by examining how changes in one variable influence another variable in a time-dependent manner.

In summary, by employing Inverse Probability Weighting, Matching, and Time Series Analysis, researchers can enhance their ability to understand and estimate causal effects in diverse settings, contributing to evidence-based decision-making and advancing our understanding of complex phenomena.