Digital Causality Lab

Kick-off Data Product Phase

Philipp Bach

University of Hamburg

Outline

Digital Causality Lab


About the Digital Causality Lab

  • Digital Causality Lab (1 SWS, Oliver Schacht)
    • New - replaces the former tutorial
    • Focus on practice, implementation and tools
    • Data Literacy skills
    • Independent learning and collaboration
    • Teaching Phase and Data Product Phase
    • Wednesday, 2 pm and 3 pm, WiWi 2079
    • Questions to oliver.schacht@uni-hamburg.de

Digital Causality Lab


1. Teaching Phase

  • Thorough introduction of tools for causal analysis and data literacy
      1. Recap: Statistics
      1. Introduction to R
      1. Introduction to GitHub and Git
      1. Causal Inference in Practice
      1. Data Products with Quarto
  • Week 1 to 4

Digital Causality Lab


2. Data Product Phase

  • Independent development of a data product on causality
  • Solve case study in a group of students
  • Creative solution: It’s all up to you, from the concept to the implementation
  • Three milestones: 1. Concept - 2. Protoype, 2. Final
  • Week 4/5 to week 13

Digital Causality Lab


2. Data Product Phase

  • Case Study Topics
    1. Illustration of causal phenomena
    2. Graphical approaches
    3. Real-data examples
    4. Illustration of estimation approaches
    5. Causal estimation in practice
  • List of case studies available on GitHub

Outlook

Plan for the course

Outlook

Plan for the DCL data product phase

Workflow: Example Data Product

Workflow: Example Data Product

Provide an intuitive example on the collider bias and how it can be illustrated with a simple data set

  • Hence this would fit into the category 1. Illustration of causal phenomena

Workflow: Example Data Product


Data product development: 3 phases

  • 1. Conception - Topic, ideas, result (🎏 Concept)

  • 2. Implementation - From basic R scripts to a data product (🎏 Prototype)

  • 3. Launch - Presentation, release, publish code (🎏 Final)

Workflow: Example Data Product

1. Conception

  • Content/topic:
    • What is the data product/case study about?
  • Choice of data product type:
    • What kind of data product is suitable to illustrate the collider bias?
    • Medium: Presentation, notebook (Rmarkdown, quarto, Jupyter), blog post / quarto notebook, R function/package, \(\ldots\)
    • Components: Raw data set (table), scatter plot, DAG, regression output, \(\ldots\)
    • Later turned into a shiny app

Workflow: Example Data Product

1. Conception

  • Which example could be used in our data product?
    • Real data or simulated data?
    • Data generating process
  • Hairdresser collider bias example:
    • Data product type: R Markdown/Quarto Notebook with simulated data example
    • Content/topic: Do you trust a friendly hairdresser?
    • Data/DGP: Own data set simulating sample selection mechanisms
  • Ends with first-round feedback

Workflow: Example Data Product

2. Implementation

  • Start with basic R scripts
      1. Implement a DGP
      1. Basic visualization: DAGs, scatter plots, regression output
  • Iteration on example
    • Does the result reflect the goal of the case study in a concise and convincing way?
    • What can be improved?

Workflow: Example Data Product

2. Implementation

  • Improvements
    • Can we improve the code by cleaning up, using specific packages and/or speeding up calculations?
    • Can we present the content of the case study in a better way?
  • Ends with second-round feedback and priorization

Workflow: Example Data Product

2. Implementation

Code
set.seed(123)
FreundlichkeitMitarbeiter= sample(0:100,size=500,replace = T)
QualitätHaarschnitt= sample(0:100,size=500,replace = T)
EinStern = ifelse((FreundlichkeitMitarbeiter+QualitätHaarschnitt) <= 40, EinStern <- 1, EinStern <-  0)
ZweiSterne = ifelse((FreundlichkeitMitarbeiter+QualitätHaarschnitt) > 40 & (FreundlichkeitMitarbeiter+QualitätHaarschnitt) <= 80 , ZweiSterne <- 2, ZweiSterne <- 0)
DreiSterne = ifelse((FreundlichkeitMitarbeiter+QualitätHaarschnitt) > 80 & (FreundlichkeitMitarbeiter+QualitätHaarschnitt) <= 120 , DreiSterne <- 3, DreiSterne <- 0)
VierSterne = ifelse((FreundlichkeitMitarbeiter+QualitätHaarschnitt) > 120 & (FreundlichkeitMitarbeiter+QualitätHaarschnitt) <= 160 , VierSterne <- 4, VierSterne <- 0)
FünfSterne = ifelse((FreundlichkeitMitarbeiter+QualitätHaarschnitt) > 160 & (FreundlichkeitMitarbeiter+QualitätHaarschnitt) <= 200 , FünfSterne <- 5, FünfSterne <- 0)
Sternebewertung = abs(EinStern-ZweiSterne-DreiSterne-VierSterne-FünfSterne)
Datensatz = data.frame(FreundlichkeitMitarbeiter,QualitätHaarschnitt,Sternebewertung)
Teilmenge1 = subset(Datensatz,Sternebewertung == 1)
Teilmenge2 = subset(Datensatz,Sternebewertung == 2)
Teilmenge3 = subset(Datensatz,Sternebewertung == 3)
Teilmenge4 = subset(Datensatz,Sternebewertung == 4)
Teilmenge5 = subset(Datensatz,Sternebewertung == 5)
RegressionDatensatz = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt, Datensatz)
RegressionTeilmenge1 = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt,Teilmenge1)
RegressionTeilmenge2 = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt,Teilmenge2)
RegressionTeilmenge3 = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt,Teilmenge3)
RegressionTeilmenge4 = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt,Teilmenge4)
RegressionTeilmenge5 = lm(FreundlichkeitMitarbeiter~QualitätHaarschnitt,Teilmenge5)

summary(RegressionDatensatz)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Datensatz)

Residuals:
   Min     1Q Median     3Q    Max 
-49.89 -24.88  -1.86  25.14  50.11 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.986e+01  2.587e+00  19.270   <2e-16 ***
QualitätHaarschnitt 6.954e-04  4.338e-02   0.016    0.987    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 28.7 on 498 degrees of freedom
Multiple R-squared:  5.16e-07,  Adjusted R-squared:  -0.002008 
F-statistic: 0.000257 on 1 and 498 DF,  p-value: 0.9872
Code
summary(RegressionTeilmenge1)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Teilmenge1)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.5070  -5.1783  -0.7614   4.6450  17.8458 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          21.7621     2.0575  10.577 7.02e-13 ***
QualitätHaarschnitt  -0.6079     0.1399  -4.344 0.000101 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.88 on 38 degrees of freedom
Multiple R-squared:  0.3318,    Adjusted R-squared:  0.3142 
F-statistic: 18.87 on 1 and 38 DF,  p-value: 0.0001006
Code
summary(RegressionTeilmenge2)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Teilmenge2)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.394  -8.655   1.606   7.329  23.498 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         54.18624    1.97291   27.46   <2e-16 ***
QualitätHaarschnitt -0.73693    0.05015  -14.69   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.58 on 116 degrees of freedom
Multiple R-squared:  0.6505,    Adjusted R-squared:  0.6475 
F-statistic: 215.9 on 1 and 116 DF,  p-value: < 2.2e-16
Code
summary(RegressionTeilmenge3)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Teilmenge3)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.5809 -10.7194  -0.3982   9.4211  21.8870 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         95.96842    1.79823   53.37   <2e-16 ***
QualitätHaarschnitt -0.91064    0.03269  -27.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.55 on 165 degrees of freedom
Multiple R-squared:  0.8247,    Adjusted R-squared:  0.8236 
F-statistic:   776 on 1 and 165 DF,  p-value: < 2.2e-16
Code
summary(RegressionTeilmenge4)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Teilmenge4)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6841  -9.4502  -0.4412   8.4562  21.5013 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         135.27245    3.93111   34.41   <2e-16 ***
QualitätHaarschnitt  -0.96292    0.05203  -18.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.91 on 132 degrees of freedom
Multiple R-squared:  0.7218,    Adjusted R-squared:  0.7197 
F-statistic: 342.5 on 1 and 132 DF,  p-value: < 2.2e-16
Code
summary(RegressionTeilmenge5)

Call:
lm(formula = FreundlichkeitMitarbeiter ~ QualitätHaarschnitt, 
    data = Teilmenge5)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.2278  -4.8777   0.7657   5.1223  14.3200 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         141.3752    11.8455   11.94 1.36e-14 ***
QualitätHaarschnitt  -0.6370     0.1347   -4.73 2.93e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.635 on 39 degrees of freedom
Multiple R-squared:  0.3645,    Adjusted R-squared:  0.3482 
F-statistic: 22.37 on 1 and 39 DF,  p-value: 2.927e-05

Workflow: Example Data Product

2. Implementation

  • Collider bias example:
    • Visualization with base R plot
Code
par(pch=16)
plot(Datensatz$FreundlichkeitMitarbeiter,Datensatz$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",main = "Alle Friseursalons")
abline(RegressionDatensatz, col="red")

Code
plot(Teilmenge1$FreundlichkeitMitarbeiter,Teilmenge1$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",xlim = c(0,100),ylim = c(0,100),main="Sternebewertung 1")
abline(RegressionTeilmenge1, col="red")

Code
plot(Teilmenge2$FreundlichkeitMitarbeiter,Teilmenge2$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",xlim = c(0,100),ylim = c(0,100),main="Sternebewertung 2")
abline(RegressionTeilmenge2, col="red")

Code
plot(Teilmenge3$FreundlichkeitMitarbeiter,Teilmenge3$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",xlim = c(0,100),ylim = c(0,100),main="Sternebewertung 3")
abline(RegressionTeilmenge3, col="red")

Code
plot(Teilmenge4$FreundlichkeitMitarbeiter,Teilmenge4$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",xlim = c(0,100),ylim = c(0,100),main="Sternebewertung 4")
abline(RegressionTeilmenge4, col="red")

Code
plot(Teilmenge5$FreundlichkeitMitarbeiter,Teilmenge5$QualitätHaarschnitt,xlab="FreundlichkeitMitarbeiter",ylab="QualitätHaarschnitt",xlim = c(0,100),ylim = c(0,100),main="Sternebewertung 5")
abline(RegressionTeilmenge5,col="red")

Workflow: Example Data Product

2. Implementation

  • Hairdresser example:
    • Improvement: Customize visualization, integrate in shiny app, \(\ldots\)
    • Add DAGs
    • Priorization

Workflow: Example Data Product

3. Launch

  • Incorporate feedback (from first and second round)

  • Develop final version of the data product

  • Present data product in lecture

  • Deploy/launch product, e.g. in DCL Gallery

  • Critical reflexion and feedback

Workflow: Example Data Product

3. Launch

  • Hairdresser example:
    • Adjust code to work in shiny app
    • Optimize presentation (figures, text description)
    • Improve speed
    • Deploy shiny app with docker and heroku (technical)
    • Integrate in DCL website
    • Publish source code on GitHub

Workflow: Example Data Product

Data product development: 3 phases

  • The workflow will depend on your case study

  • The most important step is to understand what the data product is all about

  • Be creative and try to find a way that illustrates the core idea of your case study

  • Read our notebooks, books, blogs, package documentations and proceed step-by-step

  • Collaborate with your colleagues, ask for help (we are here to help)

  • Manage expectations: We know that the time is limited and you have to learn a lot about the data product tools

Grading

  • You will receive a bonus of \(0.3\) for your data product if
    • You present your data product in the final session
    • You publish your data product on GitHub
    • Your data product contains an Abstract in the README.md file
  • Groups with up to 3 students are possible

Possible Topics

  • List of case studies available on GitHub
  1. Fisher’s Tea Cupping Example 🍵, e.g., Chapter 3 in Ding (2023))

  2. Illustration of Omitted Variable Bias Example, e.g. effect of schooling on earnings

  3. Illustration of Selection-into-Treatment Mechanisms, based on this notebook

  4. Illustration of Fundamental Problem of Causal Inference & the Causal Estimation Problem, based on a simulated data example

  5. DAG examples, based on Cinelli et al. (2022), e.g. extending in terms of data examples

Possible Topics

  1. Illustration of Estimation Approaches, with simulated or real data

    • Inverse Probability Weighting
    • Propensity Score Matching
    • Doubly Robust Estimation
    • Linear Regression
    • Instrumental Variables
  2. Lalonde Data Example, based on Chapter 5 in Mixtape

  3. Data Examples, Great source for examples with already availabe code:

Possible Topics

  1. Causality and LLMs (like ChatGPT)
  2. Your own topics / ideas, they can also be experimental
    • Sports statistics
    • Causal inference in social media
    • Causal inference in marketing