Well, technically I didn't apply the technique to solve a problem myself - as I served as a mentor in this case. Long story short, almost a year ago I was assigned to a project which aimed to identify top attributes associated with a target. Eventually, we found a handful of variables, and our client would like to drill down specifically on one of them. Based on the sample data, I checked the association between that variable and the target, and the results were inconclusive. The project was then put on hold, but it lingered in my mind. I then learned that there is an active research field called Causal Inference which seems to provide systematic ways to answer the core question that my client has: that is, if we intervene on variable X, will it cause changes on the outcome Y (regardless of positive or negative)? Causal inference sounds like a perfect tool to address this kind of question, therefore I framed a research question and requested an intern to work on it. It turns out to be a bit challenging than I originally thought, though. I'll try to summarize the basics we've covered for this project, which I believe are the stepping stones for advanced topics in this field. Hopefully, I'll be able to finish grasping the main ideas from ECI then!

So, What Is Causal Inference?

Per Wikipedia, causal inference...

is the process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an effect.

It could be described in many different ways under various contexts, but to me, it means: can we say that treatment causes a change in the outcome?

  • Treatment: the intervention that we take.
    • For instance, whether arranging heart transplant for a patient or not; whether changing the wording of a homepage or not. These treatments are binary.
    • If there are multiple treatments within a timeframe (e.g. yearly follow-up with AIDS patients over 5 years, each time providing a different dose of medicine) that will be considered as time-varied treatments, which I have no exposure so far (the Causal Inference book dedicates a whole section for it, and I found it quite difficult to follow).
    • I wonder if it's possible to have continuous treatments... would be quite hard to do so, especially in an experimental setting? Multi-level discrete treatments are more likely, but fortunately, as my first project involving causal inference, I just need to deal with binary treatment.
  • Outcome: the end result of a study/an experiment/something that we are measuring.
    • For the 2 examples that I've shown above, their outcomes could both be "success" (whether a patient survived or not; increase in revenue/CTR).
    • Outcome could be discrete (Yes/No) or continuous (e.g. change in revenue).

Randomized Experiments vs. Observational Study

If a causal analysis is performed in a control experiment setting, it would be fantastic. I'm not an expert in experimental designs, but randomized experiments really help (also make it easier to check the 3 conditions above). In reality, though, it could be that we are simply collecting a snapshot of data retrospectively. These data would be considered as observational data, as they were passively captured. My impression is that if one would like to draw a conclusion using observational study data, he/she should gather a large enough dataset (here comes another question: what do I mean by "large enough"? Well, it depends) with a handful of measured variables. Even then, causal inference is built upon a series of assumptions, so clearly stating the assumptions and limitations of the result is crucial.

Identifiability Conditions

The following 3 conditions should be checked prior to conduct a causal analysis:

  • Exchangeability: each subpopulation within different stratums should be exchangeable. For instance, had I prescribed this drug to patients who do not receive it at the moment, their responses should be the same as those who currently get the drug.
  • Positivity: each subpopulation should have at least 1 instance. Say that I don't have any male samples that live in Boston, age > 50 with > 10K annual income and 2 dogs, my study would fail to meet the positivity assumption.
  • Consistency: the observed outcome should be the same as the counterfactual outcome for every treated/untreated sample. Let's assume you notice that taking Aspirin will noticeably reduce the risk of heart attack by 5%. This should be consistent across the board - regardless of the brand.
    • I personally find the idea of consistency a bit abstract to understand/explain. The "Causal Inference: What If" book mainly refers to different versions of treatment (e.g. physicians have their own preferences on conducting the same operation), but I still feel that something is missing, at least in my understanding. If you could find a better example to illustrate, let me know!

If dealing with observational data, I doubt if the 3 conditions could all be satisfied flawlessly. My approach, for now, is to consider them first when dealing with a causal problem, and attempt to remedy any missing pieces; if not, I will highlight the failing ones and clearly emphasize them as part of the limitations of this work.

Causal Diagram

It typically looks like this:

Image source: Causal Inference: What If, Miguel A. Hernán, James M. Robins, February 21, 2020, pp. 69

Per Wikipedia, "A causal diagram is a directed graph that displays causal relationships between variables in a causal model". I find it useful to sort out variable relationships, especially when you have a handful of them to analyze. The one displays above shows a simple scenario: that is, we have a set of confounders L that affect both the treatment A and the outcome Y. Y is also impacted by A. Causal diagrams are capable of conveying more information, such as blocking, colliders, etc.

I didn't get to play with the dowhy package from Microsoft for this project, which requires a causal diagram as part of the input. I perfer to translate the problem into several causal graphs after verifying the identifiability conditions and finishing exploratory data analysis on core variables.

Confounders & Effect Modifiers

The two concepts seem similar but indeed they're not. My take on both:

  • Confounders are variables that affect both the treatment and the outcome. Let's say that I'm interested in this newly approved medicine that claims to reduce pain by 25%. One measured factor could be pregnancy (0/1). If a patient is pregnant, I might play safe and adjust the treatment doses as compared with non-pregnant subjects; pregnancy might also impact the pain level that a patient could tolerate, thus affecting the outcome too.
    • Confounders need to be adjusted so that we could achieve marginal exchangeability: that is, switching subjects in each subpopulation should achieve the same result. If not adjusting, the conclusion will be biased, as we no longer can say that the treatment is the only variable that causes changes in the outcome (in the graphical representation, that would be a "backdoor path").
  • The formal definition of effect modifiers is: an effect modifier V modifies the effect of the treatment A on the outcome Y, when the average causal effect of A on Y varies across levels of V. If we find that women are more likely to survive after a heart transplant than men, then gender would be considered as an effect modifier.
    • I had an in-depth discussion under the comment section of this video. Back then my mindset was wrong - I thought that effect modifiers == variables that only impact the outcome. That is not correct as I didn't take the treatment into consideration. Variables that only impact the outcome but not the treatment could be excluded from causal diagrams too, as they won't create loops.
  • Effect modifiers could be confounders; back to the previous example - if we took gender into consideration when assigning hearts to patients (e.g. female are 30% more likely than male to receive a heart) then it has an impact on the treatment too, therefore it will be a confounder. They could be non-confounders as well if the treatment procedure does not involve them.
  • Confounders do not have to be effect modifiers. Say that in the heart transplant experiment, patients who are over age 50 are 30% more likely to receive hearts than those under age 50 (i.e. age > 50 affecting the treatment); elderly people may have a shorter life expectation (i.e. age > 50 affecting the outcome). However, in this study, we do not see a change in average causal effects in the over- and under-50 age groups. If so, age > 50 will NOT be an effect modifier, but still a confounder.
  • Rule of thumbs: adjusting confounders whenever you see one (regardless of effect modification or not). Effect modifiers, if exist, worth reporting too. But you may need to take some time understanding their meanings prior to sharing them with key stakeholders.


G-methods are a collection of techniques to understand generalized treatment contrasts involving treatments that vary over time. It includes IP Weighting, standardization, and g-estimation. For this work, we applied IP Weighting & standardization to a treatment variable that does not evolve over time (i.e. the treatment was given once and done). I will summarize my understanding and key steps for both algorithms below, but they may be incomplete in the time-varied treatment case.

IP Weighting

IP weighting is a mechanism that removes the dependency of the treatment A on covariates L. It is able to do so by creating "pseudo-populations". For each covariate group and each treatment combination, there will be some responses for the outcome group. IP weighting essentially takes the binary tree and splits it into two parts, one with all untreated and the other with all treated. The distributions of L are thus the same in both groups. I find this concept hard to grasp. If you are dealing with binary, one-time A and a set of L, one parametric way to achieve IP weighting is:

  1. Estimate the propensity score model, i.e. P(A=1|L). My suggestion is to start with well-understood linear models first, e.g. logistic regression.
  2. Estimate weights using output probabilities from the propensity score model. For samples whose A = 1, their weights would be 1/p. Those that have A = 0 would have their weights equal to 1/(1-p).
  3. Use the weights above to estimate the outcome model, i.e. P(Y|A). If using generalized linear models (GLM), you may perform weighted least squares and interpret the coefficient associated with the treatment variable.

This paper suggests that one may try truncating weights. The "truncation" does not mean throwing away data; rather, it's a method to adjust for weird cases, such as samples that are supposed to have a high likelihood of receiving treatment (P(A=1|L) is high) but ended up not getting treatment (A=0), and vice versa. Or you may compare the weighted average responses over the treatment and the control group (should be equivalent to the coefficient obtained from 3. above). I personally value IP weighting over Propensity Score Matching.


Unlike IP weighting, which approaches from the treatment side, standardization approaches from the outcome Y side. The formula looks like below:

Image source: Causal Inference: What If, Miguel A. Hernán, James M. Robins, February 21, 2020, pp. 162

I like to think of it as a form of weighted average: we obtain conditional expectations per each treatment (A = 0 or 1) & covariates (L) combinations, then marginalize over L (i.e. "standardizing" the expectations). We will be comparing two expectation values here (for binary treatments), one with A = 1 and the other with A = 0. The causal effect estimate will be E(Y|A=1, L=l) - E(Y|A=0, L=l).

As for implementation, the code accompanying Hernán & Robins's book summarizes it nicely:

  1. outcome modeling, on the original data
  2. prediction on the expanded dataset
  3. standardization by averaging

Standardization and IP weighting are mathematically equivalent, so they should produce results that agree with each other. If not, something is off (so-called model misspecification).

Things I Wish to Know More

I rushed through the first two parts of Causal Inference: What If in less than 3 weeks, so as to finish up the project scoping on time. There are terms that I ran into but didn't dig through. Some of those include doubly robusted method (a parametric model that is said to work with incorrectly specified treatment or outcome model), sufficient causes (how is the graphical representation relates to causal diagrams?), and instrumental variable estimation (the authors do not sound like a fan to this concept, but I wonder how I may find an ideal set of instrument variables to estimate from another angle). Moreover, I'm interested in incorporating machine learning (e.g. tree-based ensembled models) into causal inference (such as using random forest to estimate the conditional mean, E(Y|A)). I believe this is technically doable, just unsure of the implications and interpretations (time to revisit ECI?). Another general item is to build a knowledge map - at least in my mind. Now I feel that my knowledge is here and there; though I'm aware of certain terms and concepts, how are they connected? Therefore, I'm taking a step back to read Pearl's classic premier. Brady Neal has put together a fascinating flowchart on what book to read, so if you're new to the field too, check it out!


Ah, it took me more than a week or so to finish this blog post! I didn't proof-read though, so please bear with me for grammatical mistakes. For conceptual misalignments, leave me a message & I'll be more than willing to investigate.

I normally write technical notes after going through major challenges, so this blog post is no exception. My original goal is to lower the entry barrier of this field - it was painful to go through the literature as a complete novice. If anyone finds this interesting/helpful, that's a great comfort to me. I might write another blog post to share my journey in causal inference (if only... I have the time and not procrastinating), so stay tuned :)