Causality - Doing Data Science

Databases Reference

In-Depth Information

The causal effect is sometimes defined as the ratio of these

two numbers instead of the difference.

But we don't have God's knowledge, so instead we choose another

population to compare this one to, and we see whether they get cancer

or not, while not taking the drug. Say they have a natural cancer rate

of 0.10. Then we would conclude, using them as a proxy, that the in‐

creased cancer rate is the difference between 0.30 and 0.10, so 20%.

This is of course wrong, but the problem is that the two populations

have some underlying differences that we don't account for.

If these were the “same people,” down to the chemical makeup of each

others' molecules, this proxy calculation would work perfectly. But of

course they're not.

So how do we actually select these people? One technique is to use

what is called propensity score matching or modeling. Essentially what

we're doing here is creating a pseudo-random experiment by creating

a synthetic control group by selecting people who were just as likely

to have been in the treatment group but weren't. How do we do this?

See the word in that sentence, “likely”? Time to break out the logistic

regression. So there are two stages to doing propensity score modeling.

The first stage is to use logistic regression to model the probability of

each person's likelihood to have received the treatment ; we then might

pair people up so that one person received the treatment and the other

didn't, but they had been equally likely (or close to equally likely) to

have received it. Then we can proceed as we would if we had a random

experiment on our hands.

For example, if we wanted to measure the effect of smoking on the

probability of lung cancer, we'd have to find people who shared the

same probability of smoking . We'd collect as many covariates of people

as we could (age, whether or not their parents smoked, whether or not

their spouses smoked, weight, diet, exercise, hours a week they work,

blood test results), and we'd use as an outcome whether or not they

smoked. We'd build a logistic regression that predicted the probability

of smoking. We'd then use that model to assign to each person the

probability, which would be called their propensity score, and then

we'd use that to match. Of course we're banking on the fact that we

figured out and were able to observe all the covariates associated with

likelihood of smoking, which we're probably not. And that's the

Search WWH ::

Custom Search

Home