Case Studies and Data Products
The participants of the DCL collaborate on various case studies in the context of causality. They work jointly on data products that illustrate theoretical concepts and link them to data examples. We listed the abstracts and the link to the data product and code below. An overview on all topics for causal case studies is available from the DCL Causal Case Study GitHub Repository. In case you have an idea, feel free to add it to the list and describe it in a new issue.
An Introduction to Difference in Differences
Participants: Aranka Bálint, Kristina Bychkivska, Maya Sadia
Abstract
In this project, we describe how a Differences-in-Differences-Design works theoretically and illustrate this with example data. A DiD is a statistical technique used to estimate the causal effect of a treatment by comparing the differences in outcomes over time between a group that is exposed to the treatment and a group that is not. It helps to control for time-invariant unobserved heterogeneity. Here, a treatment group is exposed to the intervention or treatment.The group of subjects or entities that are not exposed to the intervention or treatment, serving as a baseline to compare the treatment group’s outcomes. In our example, some districts receive a minimum wage, while others are not. A DiD design has assumption that, in the absence of treatment, the treatment and control groups would have followed the same trend over time. This is also known as the parallel trend assumption. Anything that derivatives from that assumed parallel trend is considered to be the treatment effect (D). This is the effect that we are going to calculate with our dataset testdata. For this project, we want to highlight the effect of the common trend, meaning that even without intervention, the outcome in the treatment group is also rising. Without accounting for the common trends, we end up overestimating the effects of our treatment and risking to infer wrong policy implications. This project is useful for anyone, who wants to understand how a DiD-Design works. We also draw attention to the assumptions and limitations of a DiD-Design. This project is for you, if you want to get a broad intuition about causality and DiD.
Links
Data product: DiD Notebook (html)
GitHub repository: https://github.com/DigitalCausalityLab/DCL_DiD
Causality and Large Language Models
Participants: Lucas Mandelik, Jannik Svenson, Alexander Lorenz
Abstract
This project deals with the causal inference capabilities of Large Language Models (LLMs), using the article “A Critical Review of Causal Reasoning Benchmarks for Large Language Models” by Yang et al. (December 2023) as its foundation which examines various benchmarks to assess the ability of LLMs to perform causal inferences. Initially, we provide an overview of the articles main points, which include the introduction of two ability hierarchies. Firstly, the article introduces causal hierarchies, which represent different levels of causal statements that an LLM can correctly make based on situational and environmental information, discovering new knowledge from data, and predicting the quantitative impact of actions. Secondly, it describes the “ladder of causation,” which also has three levels: describing basic statistical associations, understanding the effect of interventions, and comprehending underlying causal mechanisms to make correct statements even for hypothetical scenarios. The complexity of tasks increases with each level for both hierarchies. According to the paper, most current models can handle the first level for both hierarchies but struggle with the second and third. Next, we conducted our own investigations by testing some of the questions proposed by the paper and other causal questions known to challenge LLMs. We used GPT 3.5 Turbo and GPT 4 Omni, both based on OpenAI’s ChatGPT, and Gemini, provided by Google. Our investigation yielded mixed results. Previous studies indicated that most models only succeed at the first level of causal hierarchies. However, our experiments with the latest versions of OpenAI’s models, particularly ChatGPT 4 Omni, show better performance on benchmarks. While ChatGPT 3.5 Turbo and Gemini show general weaknesses in causal reasoning, ChatGPT 4 Omni only displayed some robustness issues. These findings highlight that despite significant progress, LLMs are still far from achieving or even surpassing human-level causal reasoning capabilities
Links
Data product: Presentation (pdf)
GitHub repository: https://github.com/DigitalCausalityLab/DCL_MLS
Average Treatment Effects and Heterogeneous Treatment Effects
Abstract
This data product is dedicated to the examination of the Average Treatment Effect (ATE) and its adjustment for the Heterogeneous Treatment Effect (HTE) as well as Selection Bias. First, the basis for calculating the ATE is presented. This is followed by an example that demonstrates the implementation of these concepts in code. Therefore, this data product provides an introduction to the statistical analysis of treatment effects and offers Propensity Score Matching as a valuable tool for analyzing biased data.
Participants: Marlon, Robin und Joel
Links
Data product: Code example (html) and Slide show (pdf)
GitHub repository: https://github.com/DigitalCausalityLab/DCL-Project-ATE-and-HTE
Collider Bias: The Hairdresser Example
Participants: Alper Duman, Amir Kabiri
Abstract
In our data product, we contribute to the Digital Causality Lab project by analyzing the collider bias, which is also referred to as the selection bias in some cases. The collider bias is one possible source of a bias under the null hypothesis, which can skew the results of a causal case study, permitting the flow of association when there is no underlying causal relationship between a set of variables.
To this end, we simulate a data set in the statistical software R, containing 500 barbershops as our observed units. Every unit has a friendliness- score and a quality-score, referring to the employees and the received haircuts, respectively, which are independently and identically distributed. Also, every barbershop gets a rating ranging from one to five stars, which is affected by the combination of the other two variables. We illustrate the collider bias and its effects by showing the results when conditioning for the collider, i.e., only focusing on four-star barbershops for example, and comparing these to the results in the case where we do not condition for any star rating.
Links
Data product: Hairdresser Example (pdf)
GitHub repository: https://github.com/DigitalCausalityLab/hairdresser_example
Shiny App based on project: https://digitalcausalitylab.shinyapps.io/hairdresser_example/
Data Example: Addiction Research
Participants: Mattes Grundmann, Oya Bazer, Jakob Zschocke
Abstract
While Randomized Controlled Trials (RCTs) are the gold standard for causal inference, these are often not feasible in addiction research for ethical and logistic reasons; for example, when studying the impact of smoking on cancer. Instead, observational data from real-world settings are increasingly being used to inform clinical decisions and public health policies. This paper presents the framework for potential outcomes for causal inference and summarizes best practices in causal analysis for observational data. Among them: Matching, Inverse Probability Weighting (IPW), and Interrupted Time-Series Analysis (ITSA). These methods will be explained using examples from addiction research, and the resulting results will be compared.
Links
Data product: Addiction research notebook (html)
GitHub repository: https://github.com/DigitalCausalityLab/Addiction-Research
Causal Baseball: The Case of Reach on Error
Participants: Endrit Kameraj, Almir Memedi, Vincent Riemenschneider
Abstract
Data Science in baseball is a widely spread field in which many different metrics and approaches have been developed, particularly in the last 20 years, to analyze and evaluate the performance of players and teams on the field. This field of data analysis in baseball is called Sabermetrics. The boundaries are not always entirely conclusive, and for some relationships, the question of actual meaningfulness can arise. In the following example, we aim to highlight the effect of a player’s attributes, such as their speed or hitting technique, on their Reached on Error (ROE) numbers. Reached on Errors is a somewhat overlooked value in Sabermetrics when evaluating player performance. In general, an offensive player (called a Hitter) receives an ROE when they reach a base due to a defensive error that they would not have reached without the error. An error can be, for example, a bad throw, bad fielding (poor ball retrieval), or dropping the ball. By definition, errors and bases reached due to errors are a product of a defensive player’s mistakes. With this data product we want to perform empirical evidence on the following quote stated on mlb.com
“By definition, errors are primarily the result of a fielder making a mistake. But even with that caveat, certain players – namely speedy ground-ball hitters – are likely to record more times reached on error than the average player.” mlb.com
Links
Data product: Causal Baseball - The Case of Reach on Error (html)
GitHub repository: https://github.com/DigitalCausalityLab/causal_baseball
Illustration of d-Separation: Graphs and Examples
Participants: Liliana Albrecht, Fenja Sonnefeld
Abstract
This case study explains the concept of d-seperation. Two variables are said to be d-separated if all paths between them are blocked (otherwise they are d-connected). Two sets of variables are said to be d-separated if each variable in the first set is d-separated from every variable in the second set.
The case study provides an overview of the rules, that determine whether a path is blocked or not including simple examples illustrating those rules. All rules and examples are visualised in DAGs (Directed acyclic graphs). Finally, a more complex DAG is discussed in order to show how all backdoor paths between two variables can be closed, so that they are d-seperated.
Links
Data product: d-separation (html)
GitHub repository: https://github.com/DigitalCausalityLab/d-separation