Gender | Highest education achieved | Occurence (x 100.000) |
---|---|---|
Male | Never finished high school | 112 |
Male | High school | 231 |
Male | College | 595 |
Male | Graduate School | 242 |
Female | Never finished high school | 136 |
Female | High school | 189 |
Female | College | 763 |
Female | Graduate School | 172 |
Tutorial 1 - Recap: Statistics
Outline
Hernán and Robins (2020) summarize the probably most important question of causal inference as follows:
“The question is then under which conditions real world data can be used for causal inference”
In a particular situation (as in our case study example from the lecture), we want to know if a treatment
In any case, we need to use statistical tools to come up with an estimate of the average causal effect and thereby, need to have a good understanding of the assumptions that have to be satisfied to establish causality. This is why we will recap the basics of statistics in our first tutorial.
You may recap your notes from introductory statistics (“Statistik für Betriebswirte I & II”) to prepare the tutorial. Please, read the suggested additional reading as indicated on the last page of this problem set.
As this problem set is quite extensive, please do the following
- Recap the statistical concepts in Exercise 1 and Exercise 2
- Try to solve the following exercises on your own
- Exercise 1, c), f), g)
- Exercise 2, b), c), d)
- Exercise 3
1 Exercise 1 - Probabilities
Define the requirements of a probability measure.
Provide the definition of the law of total probability. Draw a Venn diagram to illustrate the intuition behind it.
Write down Bayes’ rule to compute conditional probabilities.
Provide a definition of stochastic independence for events.
Define a random variable. What is the definition of stochastic independence for random variables? Can you provide examples of stochastically independent and stochastically dependent random variables?
Use Bayes’ rule to calculate the probability of being sick under a positive diagnosis and the probability of being healthy under a negative diagnosis in the following example
On average 2% of the population of a developing country have tuberculosis (tbc). If the disease is present, a tbc-diagnosis (d) is positive in 95% of the cases. However, if the disease is not present, 4% of the diagnoses are incorecctly declared as positive.
- Consider Table 1 that shows the relationship between gender and education level in the U.S. Calculate the following probabilities using the law of total probability and Bayes’ rule
, , , .
Note: You can assume, that the highest degree is meant here. You don’t need to include college graduated in your high school numbers.
2 Exercise 2 - Expected Values, Variance and Covariance
Provide the definition of
- the expected value of a discrete random variable
, ,
- the expected value of a discrete random variable
- the expected value of a continuous random variable
, ,
- the expected value of a continuous random variable
- the variance of a random variable
, ,
- the variance of a random variable
- the covariance of the random variables
and , ,
- the covariance of the random variables
- the correlation coefficient of two random variables
and , .
- the correlation coefficient of two random variables
Calculate
, and for random variables and with probability mass functions and : and Are these discrete or continuous random variables?Consider the probabilities in Table 2 for additional random variables
and (values for depicted in rows, values for in columns) and- decide whether
and are stochastically independent,
- decide whether
- calculate
and .
- calculate
- Show that whenever
and are independent, then .- Hint: Use
to show that under stochastic independence.
- Hint: Use
3 Exercise 3 - A Prior to Causality
It is important to keep in mind the difference between association and causality. Recalling the definition of causality, what seems wrong to you in the following examples?
- “Data show that income and marriage have a high positive correlation. Therefore, your earnings will increase if you get married.”
- “A study reports that there is a zero correlation between two variables
and . Hence, there is no causal effect of on .” - “Data show that as the number of fires increase, so does the number of fire fighters. Therefore, to cut down on fires, you should reduce the number of fire fighters.”
- “A study reports that there is a zero correlation between two variables
and . Hence, and are independent of each other.” - “Data show that people who hurry tend to be late to their meetings. Don’t hurry, or you’ll be late.”
- “A study reports that there is a positive correlation between variables
and . Hence, has a causal effect on .”
4 Additional Reading
Parts of this tutorial are based on Chapter 1.3 of Glymour, Pearl, and Jewell (2016). You might read Chapter 1 completely to develop some intuition about the topic. A very accessible recap of probability and regression is also available in Chapter 2 of Cunningham (2021).
5 Solution
5.1 Solution - Exercise 1
a. Probability measure
A probability measure
- For any event
we have, - For countable events
with for we have
(This implies
Example 1: Toss a fair coin. Then
b. Definition of the law of total probability
For mutually exclusive events
and ( ), we always have .For any two events
and we have where denotes the complement of (“”).More generally, for any set of events
such that exactly one of the events must be true (an exhaustive, mutually exclusive set, called a partition), we have Finally, the Law of Total Probability states using the Bayes’ Rule.
c. Bayes’ rule
- For two events
and , Bayes’ rule states that
- More events: Let
be a partition of . Bayes’ rule implies
d. Stochastic independence
- Two events
and are said to be independent if
Or, equivalently (simply plug the previous definition into Bayes’ rule)
In Example 1, are the events H and T stochastically independent?
Further, two events
and are conditionally independent given a third event ifNote:
and are called marginally independent if the the statement holds without conditioning on .
e. Random variables
Let
Two random variables
and are said to be independent of each other if for every value and that and can take we haveWhenever
and are independent, we have that the joint probability function/density is equal to the product of marginal probability functions/densities for all for discrete random variables and for continuous random variables.
Example 2: Let us consider two independent coin flips (similiar to example 1). Let
Example 3: Assume that we are rolling a fair dice. Let
f. Application of Bayes’ rule
- The given probabilities are
Therefore we have
g. Gender and education example
- We start by adding the sum of occurrences to the table to calculate the number of individuals.
Gender | Highest education achieved | Occurence (x 100.000) |
---|---|---|
Male | Never finished high school | 112 |
Male | High school | 231 |
Male | College | 595 |
Male | Graduate School | 242 |
Female | Never finished high school | 136 |
Female | High school | 189 |
Female | College | 763 |
Female | Graduate School | 172 |
. | . | 2440 |
- It may be helpful to represent the frequencies in a contingency table. Our variables of interest are female and high school.
High school | No high school | Sum | |
---|---|---|---|
Male | 231 | 949 | 1180 |
Female | 189 | 1071 | 1260 |
Sum | 420 | 2020 | 2440 |
High school | No high school | Sum | |
---|---|---|---|
Male | 0.0947 | 0.3889 | 0.4836 |
Female | 0.0775 | 0.4389 | 0.5164 |
Sum | 0.1721 | 0.8279 | 1.0000 |
Let
denote the event “a person has highest education achieved high school” and denote the event “a person is female”. Then we can calculate,We calculate
We use Bayes’ rule to calculate
We use Bayes’ rule to calculate
5.2 Solution - Exercise 2
a. Definitions
b. Expectations and variance
Calculate
c. Stochastic independence
Consider the following probability table showing the joint probability function for random variables
1 | 3 | 10 | |
---|---|---|---|
2 | 0.05 | 0.03 | 0.02 |
4 | 0.20 | 0.10 | 0.05 |
6 | 0.20 | 0.25 | 0.10 |
- First we calculate the marginal probabilities of
and
1 | 3 | 10 | P(X=x) | |
---|---|---|---|---|
2 | 0.05 | 0.03 | 0.02 | 0.10 |
4 | 0.20 | 0.10 | 0.05 | 0.35 |
6 | 0.20 | 0.25 | 0.10 | 0.55 |
P(Y=y) | 0.45 | 0.38 | 0.17 | 1.00 |
and are not stochastically independent since, for instance, .
The correlation between
d. Independence and correlation
Show that whenever
Whenever
It follows immediately