Tutorial 1 - Recap: Statistics

Author

Digital Causality Lab

Published

April 5, 2023

Outline

Hernán and Robins (2020) summarize the probably most important question of causal inference as follows:

“The question is then under which conditions real world data can be used for causal inference”

In a particular situation (as in our case study example from the lecture), we want to know if a treatment \(A\) has a causal impact on outcome \(Y\). In practice, we cannot observe counterfactuals and have to live with data from randomized experiments. In many cases, even that is not possible and we have to work with observational data.

In any case, we need to use statistical tools to come up with an estimate of the average causal effect and thereby, need to have a good understanding of the assumptions that have to be satisfied to establish causality. This is why we will recap the basics of statistics in our first tutorial.

You may recap your notes from introductory statistics (“Statistik für Betriebswirte I & II”) to prepare the tutorial. Please, read the suggested additional reading as indicated on the last page of this problem set.

As this problem set is quite extensive, please do the following

Recap the statistical concepts in Exercise 1 and Exercise 2
Try to solve the following exercises on your own
- Exercise 1, c), f), g)
- Exercise 2, b), c), d)
- Exercise 3

1 Exercise 1 - Probabilities

Define the requirements of a probability measure.
Provide the definition of the law of total probability. Draw a Venn diagram to illustrate the intuition behind it.
Write down Bayes’ rule to compute conditional probabilities.
Provide a definition of stochastic independence for events.
Define a random variable. What is the definition of stochastic independence for random variables? Can you provide examples of stochastically independent and stochastically dependent random variables?
Use Bayes’ rule to calculate the probability of being sick under a positive diagnosis and the probability of being healthy under a negative diagnosis in the following example

On average 2% of the population of a developing country have tuberculosis (tbc). If the disease is present, a tbc-diagnosis (d) is positive in 95% of the cases. However, if the disease is not present, 4% of the diagnoses are incorecctly declared as positive.

Consider Table 1 that shows the relationship between gender and education level in the U.S. Calculate the following probabilities using the law of total probability and Bayes’ rule
- \(P(\text{high school})\),
- \(P(\text{high school } OR \text{ female})\),
- \(P(\text{high school } | \text{ female})\),
- \(P(\text{female } | \text{ high school} )\).

Note: You can assume, that the highest degree is meant here. You don’t need to include college graduated in your high school numbers.

Table 1: Frequency table, gender and educational achievement in the U.S.
Gender	Highest education achieved	Occurence (x 100.000)
Male	Never finished high school	112
Male	High school	231
Male	College	595
Male	Graduate School	242
Female	Never finished high school	136
Female	High school	189
Female	College	763
Female	Graduate School	172

2 Exercise 2 - Expected Values, Variance and Covariance

Provide the definition of
- 1. the expected value of a discrete random variable \(X\), \(E(X)\),
- 1. the expected value of a continuous random variable \(X\), \(E(X)\),
- 1. the variance of a random variable \(X\), \(Var(X)\),
- 1. the covariance of the random variables \(X\) and \(Y\), \(Cov(X,Y)\),
- 1. the correlation coefficient of two random variables \(X\) and \(Y\), \(\rho_{X,Y}\).
Calculate \(E(X)\), \(E(Y)\) and \(Var(X)\) for random variables \(X\) and \(Y\) with probability mass functions \(f_X(X)\) and \(f_Y(Y)\): \[\begin{align*} f_X(X) = \begin{cases} & 1/3\text{ , if } x = 0, \\ & 2/3\text{ , if } x = 1, \\ & 0 \text{ , otherwise }\end{cases} . \end{align*}\] and \[\begin{align*} f_Y(Y) = \begin{cases} & 1/6\text{ , if } y = 0, \\ & 2/6\text{ , if } y = 1, \\ & 3/6\text{ , if } y = 2,\\ & 0 \text{ , otherwise }\end{cases} . \end{align*}\] Are these discrete or continuous random variables?
Consider the probabilities in Table 2 for additional random variables \(X\) and \(Y\) (values for \(X\) depicted in rows, values for \(Y\) in columns) and
- 1. decide whether \(X\) and \(Y\) are stochastically independent,
- 1. calculate \(Cov(X,Y)\) and \(\rho_{X,Y}\).

Table 2: Probability table for random variables \(X\) (rows) and \(Y\) (columns).
	1	3	10
2	0.05	0.03	0.02
4	0.20	0.10	0.05
6	0.20	0.25	0.10

Show that whenever \(X\) and \(Y\) are independent, then \(Cov(X,Y) = \rho_{X,Y} = 0\).
- Hint: Use \(P_{X,Y}(X=x,Y=y) = P_X(X=x)\cdot P_Y(Y=y)\) to show that \(E(X\cdot Y)=E(X)\cdot E(Y)\) under stochastic independence.

3 Exercise 3 - A Prior to Causality

It is important to keep in mind the difference between association and causality. Recalling the definition of causality, what seems wrong to you in the following examples?

“Data show that income and marriage have a high positive correlation. Therefore, your earnings will increase if you get married.”
“A study reports that there is a zero correlation between two variables \(A\) and \(Y\). Hence, there is no causal effect of \(A\) on \(Y\).”
“Data show that as the number of fires increase, so does the number of fire fighters. Therefore, to cut down on fires, you should reduce the number of fire fighters.”
“A study reports that there is a zero correlation between two variables \(A\) and \(Y\). Hence, \(A\) and \(Y\) are independent of each other.”
“Data show that people who hurry tend to be late to their meetings. Don’t hurry, or you’ll be late.”
“A study reports that there is a positive correlation between variables \(A\) and \(T\). Hence, \(A\) has a causal effect on \(Y\).”

4 Additional Reading

Parts of this tutorial are based on Chapter 1.3 of Glymour, Pearl, and Jewell (2016). You might read Chapter 1 completely to develop some intuition about the topic. A very accessible recap of probability and regression is also available in Chapter 2 of Cunningham (2021).

References

Cunningham, Scott. 2021. “Causal Inference.” In Causal Inference. Yale University Press. https://mixtape.scunning.com/.

Glymour, Madelyn, Judea Pearl, and Nicholas P Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons. http://bayes.cs.ucla.edu/PRIMER/.

Hernán, Miguel A, and James M Robins. 2020. “Causal Inference.” Boca Raton: Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/.