Purpose

The purpose of this is to analyze federal budget data for Tidy Tuesday February 12, 2019. The webpage for the challenge is at the R2DS website. The original data source is the AAAS.

Setup

Get data

From the website, the code to get the data is as follows:

fed_rd <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/fed_r_d_spending.csv")

## Parsed with column specification:
## cols(
##   department = col_character(),
##   year = col_double(),
##   rd_budget = col_double(),
##   total_outlays = col_double(),
##   discretionary_outlays = col_double(),
##   gdp = col_double()
## )

# energy_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/energy_spending.csv")
# climate_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/climate_spending.csv", col_types = "ccd")

There is an issue with the climate spend data, so I downloaded the Excel file from the source site and reimported. I had to rework the cleaning code a bit, mostly to deal with some unfortunate renaming issue in tibble’s name repair mechanism (the variable name for the agency gets renamed to ..1, which confounds the NSE mechanisms of rename and other dplyr functions). Basically, I fell back to base R’s name repair and modified the cleaning code slightly to compensate. However, I don’t use the energy and climate dataset below, so I comment those out for now.

One thing that will be interesting: adding in the party of the President. Though Nixon/Ford started in 1969, I’ll just start from 1976, when the data begin. This will help with the graphs.

party <- tribble(
  ~party,       ~from,  ~to,
  "Republican", 1976,   1977,
  "Democrat",   1977,   1981,
  "Republican", 1981,   1993,
  "Democrat",   1993,   2001,
  "Republican", 2001,   2009,
  "Democrat",   2009,   2017,
  "Republican", 2017,   2019
)

Let’s add this to the fed_rd dataset.

fed_rd %>% 
  crossing(party) %>% 
  filter(year >= from & year < to) %>% 
  select(-from, -to) ->
  fed_rd

Exploration

Federal R&D

The first dataset shows Federal R&D, and looks as follows:

head(fed_rd)

## # A tibble: 6 x 7
##   department  year  rd_budget total_outlays discretionary_ou~     gdp party
##   <chr>      <dbl>      <dbl>         <dbl>             <dbl>   <dbl> <chr>
## 1 DOD         1976    3.57e10  371800000000      175600000000 1.79e12 Repu~
## 2 NASA        1976    1.25e10  371800000000      175600000000 1.79e12 Repu~
## 3 DOE         1976    1.09e10  371800000000      175600000000 1.79e12 Repu~
## 4 HHS         1976    9.23e 9  371800000000      175600000000 1.79e12 Repu~
## 5 NIH         1976    8.02e 9  371800000000      175600000000 1.79e12 Repu~
## 6 NSF         1976    2.37e 9  371800000000      175600000000 1.79e12 Repu~

This already suggests a visualization, so let’s try one. The default y-axis label is in dollars, so I use the scales::dollar_format function to scale it to billions of dollars, and adjust the label accordingly.

ggplot(fed_rd, aes(year, rd_budget, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  xlab("Year") +
  ylab("Federal R&D Budget (billions)") +
  scale_color_viridis_d() +
  scale_fill_brewer() +
  scale_y_continuous(labels = scales::dollar_format(scale = 1e-9))

It would be interesting to get these budgets as a percentage of discretionary or total outlays. We’ll start with discretionary. I use the scales::percent_format to label the y-xaxis. The automatic guess from scales::percent gives one decimal place, which I thought was silly for this graph, so I put in the extra effort to cut it out.

fed_rd %>% 
  mutate(rd_perc_disc = rd_budget / discretionary_outlays) %>% 
  ggplot(aes(year, rd_perc_disc, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  xlab("Year") +
  ylab("Federal R&D proportion of discretionary spending") +
  scale_colour_viridis_d() +
  scale_fill_brewer() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))

The same for percentage of total, leaving the extra decimal place in this time because it is informative:

fed_rd %>% 
  mutate(rd_perc_tot = rd_budget / total_outlays) %>% 
  ggplot(aes(year, rd_perc_tot, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  xlab("Year") +
  ylab("Federal R&D proportion of total spending") +
  scale_colour_viridis_d() +
  scale_fill_brewer() +
  scale_y_continuous(labels = scales::percent)

Ok, so this brief exploration showed that defense spending has dominated other spending, at least as a portion of the federal budget. One thing I did check on at the AAAS site is that this spending is adjusted for inflation. (Probably should have checked that before embarking on this!)

I think it is worth trying to remake the above graph but switching the legend order based on the average spending over time. Here we go:

fed_rd %>% 
  mutate(rd_perc_tot = rd_budget / total_outlays,
         department = fct_reorder(department, rd_perc_tot, .fun = mean)) %>% 
  ggplot(aes(year, rd_perc_tot, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  xlab("Year") +
  ylab("Federal R&D proportion of total spending") +
  scale_colour_viridis_d(option = "B", direction = -1) +
  scale_fill_brewer() +
  scale_y_continuous(labels = scales::percent)

I think this graph is more informative than the above graph ordered alphabetically by department. I also switched up the color scheme a bit. To me this subtle change made the graph far more informative, with the color legend giving the order of the priority (more or less) of R&D vs the discretionary spending from least to most. This simple change using the fct_reorder function from the forcats package (loaded as part of tidyverse) is very powerful. (I could have also done this in the ggplot aesthetic.)

One more quick addition to see if it adds anything:

fed_rd %>% 
  mutate(rd_perc_tot = rd_budget / total_outlays,
         department = fct_reorder(department, rd_perc_tot, .fun = mean)) %>% 
  ggplot(aes(year, rd_perc_tot, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, linetype = "dotted") +
  xlab("Year") +
  ylab("Federal R&D proportion of total spending") +
  scale_colour_viridis_d(option = "B", direction = -1) +
  scale_fill_brewer() +
  scale_y_continuous(labels = scales::percent)

The fitted lines give a little more information that may help group the R&D budgets even further. We have DOD, in a class by itself (obvious from the raw line plot), HHS and NIH that have a slight trend down over time, DOE and NASA that started higher but have a stronger trend down, then everything else that started low relative to everything else and as a result remain relatively flat. One final picture, putting the y-axis on log scale:

fed_rd %>% 
  mutate(rd_perc_tot = rd_budget / total_outlays,
         department = fct_reorder(department, rd_perc_tot, .fun = mean)) %>% 
  ggplot(aes(year, rd_perc_tot, group = department, color = department)) +
  geom_line() +
  geom_rect(aes(xmin = from, xmax = to, ymin = 0.0001, ymax = Inf, fill = party), 
            inherit.aes = FALSE, data = party, alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, linetype = "dotted") +
  xlab("Year") +
  ylab("Federal R&D proportion of total spending") +
  scale_colour_viridis_d(option = "B", direction = -1) +
  scale_fill_brewer() +
  scale_y_log10(labels = scales::percent)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 26 rows containing non-finite values (stat_smooth).

This graph was produced by switching scale_y_log10 in for scale_y_continuous. You get a whole bunch of junk complaining about how you can’t take the log of a negative values (from geom_smooth, because the linear trend extends below 0), and geom_rect if you keep the ymin = -Inf. I went ahead and changed ymin = 0.01 so we could keep the party shading. The main effect of the log scale was to explore the trends of the “everything else” in the above graphs. What was a relatively flat trend still is, but it is shown that everything in R&D is trending down relative to discretionary spending.

These three graphs all give an interesting picture. Under Republican presidents in general, defense R&D spending has increased, sometimes drastically so, for a few years and started to fall. But as a general trend, defense spending as a percentage of discretionary and total spending has declined since the 1970s, where we were still in the heat of the Cold War. (It’s important to remember we’re talking about R&D spending.)

Just for kicks

Ow, my eyes!

library(gganimate)
fed_rd %>% 
  mutate(rd_perc_tot = rd_budget / total_outlays,
         department = fct_reorder(department, rd_perc_tot, .fun = mean)) %>% 
  ggplot(aes(x = "", rd_perc_tot, fill = department)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") +
  scale_fill_viridis_d(option = "B") +
  ggtitle("Federal R&D proportion of total spending") +
  scale_y_continuous(labels = scales::percent) +
  theme_void() +
  transition_time(year) +
  ease_aes("linear")

I think I just invented the incredible shrinking pie chart.

Discussion

Honestly, this blog post went in a slightly different direction than I thought. The main takeaway from above is that subtle changes to the coloring, ordering, and scaling of a graph can reveal different insights in the data. I don’t think any one of the above is particularly better than any other, just different.

There are a lot of different directions this can go. I thought about using gganimate with pie charts just to hurt some eyes. If you were a goverment contractor, you might be interested in forecasts, and it would be easy to apply the techniques from the previous blog posts to do that. I didn’t even touch the climate dataset.

RJS Data Science LLC Professional statistician and data science consultant with 25+ years experience
john@rjsdatascience.com

Federal R&D Budget (Tidy Tuesday Feb 12, 2019)