CausalImpact & Healthcare Costs in the US

A quick look at the CausalImpact package in R, studying US healthcare costs over time

Robert Myles MCDonnell

4 minute read

This article by Mike Holly on interested me. I wasn’t aware of the history of healthcare costs in the US, and it’s quite surprising, especially the take-off in costs after the 1960s. Holly’s argument is that the “U.S. ‘health care cost crisis’ didn’t start until 1965. The government increased demand with the passage of Medicare and Medicaid while restricting the supply of doctors and hospitals.” In fact, I was quite impressed with this graph:

x <- c("rvest", "dplyr", "tidyr", "ggplot2", "magrittr", "lubridate")
lapply(x, require, character.only = TRUE) 

I’ve wanted an excuse to use Google’s CausalImpact package for a while now, so let’s give it a shot using these data. The Consumer Price Index we can grab quickly, since it’s widely available:

url <- ""

cpi <- read_html(url) %>% 
  html_node("table") %>% 
  html_table() %>% 
  dplyr::select(-X14:-X16) %>% 
  filter(X1 != "Year") %>% 
  set_colnames(c("Year", "m01", "m02", "m03", "m04", "m05", "m06", 
                 "m07", "m08", "m09", "m10", "m11", "m12")) %>% 
  mutate_all(as.numeric) %>% 
  slice(-1) %>% 
  gather(Month, Value, m01:m12) %>% 
  mutate(Day = "01",
         Month = gsub("m", "", Month),
         Date = paste(Year, Month, Day, sep = "-"),
         Date = parse_date_time(Date, "Ymd")) %>% 
  dplyr::select(Date, Value) %>% arrange(Date) %>% filter(! #%>% 
  #mutate(Value = (Value - mean(Value))/sd(Value))

ggplot(cpi, aes(x = Date, y = Value)) +
  geom_line(colour = "dodgerblue") +
  theme_minimal() + ylab(NULL) + xlab(NULL)

Downloading the data for Medical costs from the US Bureau of Labor Statistics let us plot the two side-by-side:

ggplot() +
  geom_line(data = dplyr::filter(cpi, Date >= "1947-01-01"), aes(x = Date, y = Value),
            colour = "dodgerblue") +
  geom_line(data = med, aes(x = DATE, y = CPIMEDSL), colour = "firebrick") +
  theme_minimal() + ylab(NULL) + xlab(NULL) +
  annotate("text", label = "Medical care price index", y = 300, 
           x = as.POSIXct("1993-01-01"), colour = "firebrick") +
  annotate("text", label = "Consumer price index", y = 110, 
           x = as.POSIXct("1999-01-01"), colour = "dodgerblue")

Certainly looks similar, although not quite as dramatic (slightly different take-off points too, in the split). And Holly seems to have divided the numbers on the y-axis by 10 or something like that. Nonetheless, it looks like we’re dealing with the same data. So now let’s impact that thang.


cpi <- cpi %>% 
  filter(Date >= "1947-01-01") %>% 
  rename(cpi = Value)

med <- med %>%
  rename(Date = DATE, med = CPIMEDSL)

series <- full_join(med, cpi) %>% filter(!

pre_period <- c(1, 449)
post_period <- c(450, 843)

impact <- CausalImpact(series[, 2:3], pre_period, post_period)

I won’t dare attempt to draw causal conclusions (my DAG/Judea Pearl-loving friends would never forgive me). I’ll just say “interesting”.

CausalImpact also has a summary function which prints out a report! Let’s have a look:

summary(impact, "report")
## Analysis report {CausalImpact}
## During the post-intervention period, the response variable had an average value of approx. 278.22. By contrast, in the absence of an intervention, we would have expected an average response of 174.96. The 95% interval of this counterfactual prediction is [162.17, 184.15]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 103.26 with a 95% interval of [94.07, 116.05]. For a discussion of the significance of this effect, see below.
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 109.62K. By contrast, had the intervention not taken place, we would have expected a sum of 68.94K. The 95% interval of this prediction is [63.90K, 72.56K].
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +59%. The 95% interval of this percentage is [+54%, +66%].
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (103.26) to the original goal of the underlying intervention.
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.

Ooh, very fancy.

comments powered by Disqus