“Causal Inference in Python. Cause-and-effect relationships in IT development”

Data used

In 1988, California passed the famous Tobacco Tax and Health Care Act, which became known as “Proposition 99”. “The primary effect of this law was to impose a 25-cent excise tax on each pack of cigarettes sold within the State of California, with approximately equivalent excise taxes on retail sales of other tobacco products, particularly cigars and chewing tobacco. Additional restrictions imposed on the sale of tobacco include a ban on the installation of vending machines with tobacco products in public places open to children, as well as a ban on the individual sale of cigarettes.”

To evaluate the effect, cigarette sales data could be collected across multiple states and over many years. In our case, we operated on data for the period from 1970 to 2000, taken from 39 states. Other states had similar tobacco control programs and were not included in our analysis. This is what our data looks like:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn import linear_model

%matplotlib inline

cig = (pd.read_csv("/content/drive/MyDrive/smoking.csv")
         .drop(columns=["lnincome","beer", "age15to24"]))

style.use('fivethirtyeight') # You can change this to your preference

cig.query("california").head()
image

Dataset fragment

Here state — state index/number, California has number 3. As covariates, we take the retail cost of cigarettes and per capita cigarette sales in terms of packs. The variable of interest to us at the output is cigsale. Finally, we have auxiliary Boolean variables that can be used to infer the state of California and the post-intervention period. Here's what you get if you plot cigarette sales in California and average cigarette sales in other states over time.

image

During the period for which we have data, Californians clearly bought fewer cigarettes than the national average. In addition, there is a trend towards a decrease in tobacco consumption after the 80s. It appears that this downward trend has increased in California (relative to other states) since the passage of Proposition 99, but this cannot be said with certainty. This is just a version that we need to check against the schedule.

To answer the question of whether Proposition 99 affected cigarette consumption, we examine the pre-intervention period and construct a synthetic control. We will combine data from other states to to assemble a fictitious state based on them, the trends in which are very similar to Californian ones. We then look at how the picture in this synthetic control state changes after the intervention.

Mathematical notation

Let's say we have J+1 units. Further, without prejudice to our generalizations, assume that unit 1 is affected by the intervention in question, that is, Proposition 99. Units j=2,…, J+1 is a collection of unaffected units (states) that we collectively call the “donor pool.” Let's also assume that the data we have covers T periods with T0 periods before intervention. For each unit j and each interval t the result is observed Yjt. For each unit j and each period t we define YNjt as a potential outcome that occurs in the absence of intervention, and YIjt – as a potential result that occurs after the intervention.

image

Then the effect of the affected unit j=1 at time t for t>T0 defined as

image

Since j=1 affected by the intervention, YIjt observed in practice, and YNjt – No. Now the question arises: how do we evaluate YNjt. Note that the effect of the intervention is determined on a period-by-period basis, meaning it may change over time. It should not appear instantly, but may accumulate or dissipate. To show this more clearly, we point out that the task of assessing the effect of an intervention comes down to the following: estimate what would happen if the output were from unity j=1 was received without intervention.

image

For evaluation YNjt Let us not forget that the combination of units in the donor pool can approximate the characteristics of the affected unit much better than any unaffected unit as such. Therefore, the synthetic control is defined as the weighted average of all units in the control pool. Having weights W=(w2,…,wJ+1)we obtain the estimate through synthetic control YNjt equal to

image

Visual explanation

As you know, linear regression allows you to obtain a forecast as a weighted average of all variables. In this case, regression can be represented as the following matrix multiplication.

image

When using synthetic control, we do not have many units, but many time periods. In this case, we simply flip the input matrix. The units then become “variables” and we represent the total as a weighted average of the units – see the following matrix multiplication.

image

If we have more than one characteristic for a period of time, then we can group the characteristics as follows. The most important thing here is to ensure that the regression “attempts to predict” the affected unit 1 using the other units. Therefore, we can select the weights in an optimal way to achieve the degree of approximation we need. We can even scale features differently, assigning them different degrees of importance.

image

Implementation

To evaluate the effect of the intervention using the synthetic control method, we will try to collect a “dummy unit” that resembles the properties of the one we are processing, but before the intervention period. In this way we will see how the “dummy unit” manifests itself after the intervention. The difference between a synthetic control unit and a real unit is that we get a mimicry of the effect of the intervention.

To achieve this using linear regression, we find the weight using the Ordinary Least Squares (OLS) method. We minimize the quadratic distance between the weighted average of the units in the donor pool and the affected unit in the state it was in before the intervention.

To do this, we first need to convert the units (in our case, states) into columns, and the time into rows. Since we have 2 signs, cigsale And retpriceplace them on top of each other, as was done in the previous picture. Let's put together a synthetic control state that looked something like California before the intervention, and see what happens to it in the post-intervention period. Therefore, it is important to select only the period before the intervention. Here the features are plotted on a similar scale, so we do not normalize them. If the scales of the features were different, for example, one was in the thousands, and the other in the tens, then when minimizing the difference, the larger feature would be most important to us. To avoid this, it is important to scale the features first.

features = ["cigsale", "retprice"]

inverted = (cig.query("~after_treatment") # отфильтруем период до вмешательства
            .pivot(index='state', columns="year")[features] # предусмотрим по столбцу на год и по строке на штат 
            .T) # перевернём таблицу так, чтобы у нас было по столбцу на штат 

inverted.head()
image

You can now define variable Y to be the state of California and variable X to be the other states.

y = inverted[3].values # штат Калифорния
X = inverted.drop(columns=3).values  # другие штаты

Next, let's perform lasso regression. Lasso regression, also called L1, is used to avoid overfitting our data across states. You can also use ridge regression for this. The regression will return a set of weights that minimize the squared difference between the unit in question and the units in the donor pool.

from sklearn.linear_model import Lasso
weights_lr = Lasso(fit_intercept=False).fit(X, y).coef_
weights_lr.round(3)

image

These scales show how to build a synthetic control. Let's multiply the total of State 1 by 0.566, State 3 by 0.317, State 4 by 0.158, etc. This can be achieved by dot product between the state matrix in the pool and the weights.

calif_synth_lr = (cig.query("~california")
                  .pivot(index='year', columns="state")["cigsale"]
                  .values.dot(weights_lr))

Now that the synthetic control is ready, we can plot it along with the outcome variable for the state of California.

plt.figure(figsize=(10,6))
plt.plot(cig.query("california")["year"], cig.query("california")["cigsale"], label="California")
plt.plot(cig.query("california")["year"], calif_synth_lr, label="Synthetic Control")
plt.vlines(x=1988, ymin=40, ymax=140, linestyle=":", lw=2, label="Proposition 99")
plt.ylabel("Gap in per-capita cigarette sales (in packs)")
plt.legend();

image

Given a synthetic control, it is possible to estimate the treatment effect as the gap in performance between the affected and control subjects.

image
plt.figure(figsize=(10,6))
plt.plot(cig.query("california")["year"], cig.query("california")["cigsale"] - calif_synth_lr,
         label="California Effect")
plt.vlines(x=1988, ymin=-30, ymax=7, linestyle=":", lw=2, label="Proposition 99")
plt.hlines(y=0, xmin=1970, xmax=2000, lw=2)
plt.title("State - Synthetic Across Time")
plt.ylabel("Gap in per-capita cigarette sales (in packs)")
plt.legend();

image

It appears that by 2000, Proposition 99 had reduced per capita cigarette sales by 25 packs. Now we need to find out whether this indicator is statistically significant.

Logical conclusion

Here we will arm ourselves with Fisher's exact test. Intuitively it is completely clear. We thoroughly rearrange the units in the affected and control parts. Since we only have one unit in the affected part, this will mean that we will consider each unit to be processed, and all the others at the same time as control.

image

Ultimately, we will have one synthetic control estimate and one real estimate per state. That is, in this case, it is assumed that the measures were actually taken not in California, but in another state, and then it is estimated what the effect would have been if these measures had not been taken. Next, we test whether the effect achieved in California is significantly higher compared to the effect achieved in another, fictitious state. The idea is that if no measures were actually taken by the states (and we assumed the opposite), then we would not detect any significant effect from the intervention.

This function returns a data frame with one column representing the state, one representing the year, and one representing the total. cigsale and one to the synthetic total for a given state.

This is what you get when you apply the function to the first state:

def synthetic_control(state, pool, data) -> np.array:
    
    features = ["cigsale", "retprice"]
    
    inverted = (data.query("~after_treatment")
                .pivot(index='state', columns="year")[features]
                .T)
    
    y = inverted[state].values # treated
    X = inverted.drop(columns=state).values # donor pool

    weights = Lasso(fit_intercept=False).fit(X, y).coef_.round(3)
    synthetic = (data.query(f"~(state=={state})")
                 .pivot(index='year', columns="state")["cigsale"]
                 .values.dot(weights))

    return (data
            .query(f"state=={state}")[["state", "year", "cigsale", "after_treatment"]]
            .assign(synthetic=synthetic))

image

Having a synthetic control for all states, it is possible to estimate the gap between the synthetic and the true state for all states. In California, this is the effect achieved by the intervention. For other states, the effect is similar to a placebo—we estimate how a state would have been affected by an exposure that did not actually occur. If you plot all the placebo effects along with the effect of the actual measures taken in California, you get the following picture:

sinthetic_states = [synthetic_control(state, control_pool, cig) for state in control_pool]
plt.figure(figsize=(12,7))
for state in sinthetic_states:
    plt.plot(state["year"], state["cigsale"] - state["synthetic"], color="C5",alpha=0.4)

plt.plot(cigar.query("california")["year"], cigar.query("california")["cigsale"] - calif_synth_lr,
        label="California");

plt.vlines(x=1988, ymin=-50, ymax=120, linestyle=":", lw=2, label="Proposition 99")
plt.hlines(y=0, xmin=1970, xmax=2000, lw=3)
plt.ylabel("Gap in per-capita cigarette sales (in packs)")
plt.title("State - Synthetic Across Time")
plt.legend();

------

control_pool = cig["state"].unique()

sinthetic_states = [synthetic_control(state, control_pool, cig) for state in control_pool]

plt.figure(figsize=(12,7))
for state in sinthetic_states:
    plt.plot(state["year"], state["cigsale"] - state["synthetic"], color="C5",alpha=0.4)

plt.plot(cig.query("california")["year"], cig.query("california")["cigsale"] - calif_synth_lr,
        label="California");

plt.vlines(x=1988, ymin=-50, ymax=120, linestyle=":", lw=2, label="Proposition 99")
plt.hlines(y=0, xmin=1970, xmax=2000, lw=3)
plt.ylabel("Gap in per-capita cigarette sales (in packs)")
plt.title("State - Synthetic Across Time")
plt.legend();

image

In this picture, two details immediately catch the eye. Firstly, we see that the spread after the intervention is higher than before it. This is expected, since the synthetic control was designed specifically to minimize the difference with the period before the intervention. Another interesting aspect is that there are some units that don't fit very well even in the pre-intervention period. This is also expected.

Since these units fit so poorly into the graph, we can simply exclude them from the analysis. One way to do this objectively is to set a threshold for the error that is acceptable before intervention.

image

and remove all units with too large errors. If we further reproduce the same graph as in the previous figure, then this is what we get.

def pre_treatment_error(state):
    pre_treat_error = (state.query("~after_treatment")["cigsale"] 
                       - state.query("~after_treatment")["synthetic"]) ** 2
    return pre_treat_error.mean()

plt.figure(figsize=(12,7))
for state in sinthetic_states:
    
    # удалить единицы со средней ошибкой выше 5.
    if pre_treatment_error(state) < 5:
        plt.plot(state["year"], state["cigsale"] - state["synthetic"], color="C5",alpha=0.4)

plt.plot(cig.query("california")["year"], cig.query("california")["cigsale"] - calif_synth_lr,
        label="California");

plt.vlines(x=1988, ymin=-50, ymax=120, linestyle=":", lw=2, label="Proposition 99")
plt.hlines(y=0, xmin=1970, xmax=2000, lw=3)
plt.ylabel("Gap in per-capita cigarette sales (in packs)")
plt.title("Distribution of Effects")
plt.title("State - Synthetic Across Time (Large Pre-Treatment Errors Removed)")
plt.legend();

image

Removing the noise, we see how extreme the effect was achieved in the state of California. As you can see from the figure, if such measures had been adopted in any other state except California, we would have made almost no changes as dramatic as in California.

image

The effect of the intervention in California in 2000 is -31,419, meaning that the intervention reduced per capita cigarette consumption by almost 31 packs.

Conclusion

Using pre-intervention data from other states, we built a lasso regression model that assigned fixed weights to each control state to produce a weighted average that closely resembled California's smoking pattern before Proposition 99 was passed. “

We then applied the resulting lasso regression model and synthesized a dummy state that would have looked like California in the post-period if the intervention had not occurred. The difference between actual cigarette sales and the total that we synthesized is the effect of the intervention.

We also looked at how this synthetic control allows us to reach logical conclusions based on Fisher's exact test. We guessed what unaffected units would look like after the intervention and thus calculated the effect. This effect is similar to a placebo: we would have seen it without the intervention. Based on these effects, we visualized and tested the effect of the intervention as it affected cigarette sales in California.

Book

Causal Inference for The Brave and True by Matheus Facure

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *