Students' Gallery

Case Studies and Data Products

The participants of the DCL collaborate on various case studies in the context of causality. They work jointly on data products that illustrate theoretical concepts and link them to data examples. We listed the abstracts and the link to the data product and code below. An overview on all topics for causal case studies is available from the DCL Causal Case Study GitHub Repository. In case you have an idea, feel free to add it to the list and describe it in a new issue.

Collider Bias: The Hairdresser Example

Participants: Alper Duman, Amir Kabiri
Abstract

In our data product, we contribute to the Digital Causality Lab project by analyzing the collider bias, which is also referred to as the selection bias in some cases. The collider bias is one possible source of a bias under the null hypothesis, which can skew the results of a causal case study, permitting the flow of association when there is no underlying causal relationship between a set of variables.

To this end, we simulate a data set in the statistical software R, containing 500 barbershops as our observed units. Every unit has a friendliness- score and a quality-score, referring to the employees and the received haircuts, respectively, which are independently and identically distributed. Also, every barbershop gets a rating ranging from one to five stars, which is affected by the combination of the other two variables. We illustrate the collider bias and its effects by showing the results when conditioning for the collider, i.e., only focusing on four-star barbershops for example, and comparing these to the results in the case where we do not condition for any star rating.

Data product: Hairdresser Example (pdf)

GitHub repository: https://github.com/DigitalCausalityLab/hairdresser_example

Data Example: Addiction Research

Participants: Mattes Grundmann, Oya Bazer, Jakob Zschocke
Abstract

While Randomized Controlled Trials (RCTs) are the gold standard for causal inference, these are often not feasible in addiction research for ethical and logistic reasons; for example, when studying the impact of smoking on cancer. Instead, observational data from real-world settings are increasingly being used to inform clinical decisions and public health policies. This paper presents the framework for potential outcomes for causal inference and summarizes best practices in causal analysis for observational data. Among them: Matching, Inverse Probability Weighting (IPW), and Interrupted Time-Series Analysis (ITSA). These methods will be explained using examples from addiction research, and the resulting results will be compared.

Data product: Addiction research notebook (html)

GitHub repository: https://github.com/DigitalCausalityLab/Addiction-Research

Causal Baseball: The Case of Reach on Error

Participants: Endrit Kameraj, Almir Memedi, Vincent Riemenschneider
Abstract

Data Science in baseball is a widely spread field in which many different metrics and approaches have been developed, particularly in the last 20 years, to analyze and evaluate the performance of players and teams on the field. This field of data analysis in baseball is called Sabermetrics. The boundaries are not always entirely conclusive, and for some relationships, the question of actual meaningfulness can arise. In the following example, we aim to highlight the effect of a player’s attributes, such as their speed or hitting technique, on their Reached on Error (ROE) numbers. Reached on Errors is a somewhat overlooked value in Sabermetrics when evaluating player performance. In general, an offensive player (called a Hitter) receives an ROE when they reach a base due to a defensive error that they would not have reached without the error. An error can be, for example, a bad throw, bad fielding (poor ball retrieval), or dropping the ball. By definition, errors and bases reached due to errors are a product of a defensive player’s mistakes. With this data product we want to perform empirical evidence on the following quote stated on mlb.com

By definition, errors are primarily the result of a fielder making a mistake. But even with that caveat, certain players – namely speedy ground-ball hitters – are likely to record more times reached on error than the average player.mlb.com

Data product: Causal Baseball - The Case of Reach on Error (html)

GitHub repository: https://github.com/DigitalCausalityLab/causal_baseball

Illustration of d-Separation: Graphs and Examples

Participants: Liliana Albrecht, Fenja Sonnefeld
Abstract

This case study explains the concept of d-seperation. Two variables are said to be d-separated if all paths between them are blocked (otherwise they are d-connected). Two sets of variables are said to be d-separated if each variable in the first set is d-separated from every variable in the second set.

The case study provides an overview of the rules, that determine whether a path is blocked or not including simple examples illustrating those rules. All rules and examples are visualised in DAGs (Directed acyclic graphs). Finally, a more complex DAG is discussed in order to show how all backdoor paths between two variables can be closed, so that they are d-seperated.

Data product: d-separation (html)

GitHub repository: https://github.com/DigitalCausalityLab/d-separation