As part of our investigation into mortality rates reported on social media and mortality data created by combining government data sources, we've now completed 3 steps. First, we downloaded mortality data from the National Center for Health Statistics (NCHS). Then we downloaded population data from the Census. Using the mortality and population data, we created mortality rates from 1999 to 2020. Now that we have mortality rates, we can compare the mortality data we have with the mortality data reported on social media that started this blog series.
library(here) library(reshape2) library(lubridate) library(tidyverse)
Our mortality data is in the .csv
mortality_time_series_national.csv. I manually coded the table reported in the social media blog post in the .csv
mortality_time_series_national = read_csv(here("output/mortality_time_series_national.csv")) social_media_mortality_data = read_csv(here("data/social_media_mortality_data.csv"))
Let's add a column,
source, to each dataset so that we can tell the data sources apart when we create a comparison data set
mortality_time_series_national$source = "CDC & Census" social_media_mortality_data$source = "Social Media"
Quick look at the datasets. We can see that they have identical column names and the same amount of observations.
## Rows: 44 ## Columns: 5 ## $ state_name <chr> "United States", "United States", "United States", "Unit... ## $ variable <chr> "pop_estimate", "pop_estimate", "pop_estimate", "pop_est... ## $ value <dbl> 280466621, 281424600, 284968955, 287625193, 290107933, 2... ## $ year <dbl> 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 20... ## $ source <chr> "CDC & Census", "CDC & Census", "CDC & Census", "CDC & C...
## Rows: 44 ## Columns: 5 ## $ state_name <chr> "United States", "United States", "United States", "Unit... ## $ year <dbl> 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 20... ## $ variable <chr> "pop_estimate", "pop_estimate", "pop_estimate", "pop_est... ## $ value <dbl> 279040168, 281421906, 284968955, 287625193, 290107933, 2... ## $ source <chr> "Social Media", "Social Media", "Social Media", "Social ...
Combine our curated mortality data with social media mortality data into one long mortality dataset.
mortality_comparison_data_long = rbind(mortality_time_series_national, social_media_mortality_data)
Create a wide combined mortality dataset and add three comparison variables: (1) the mortality rate, (2) the mortality rate lag, and (3) the rate of change for the mortality rate
mortality_comparison_data_wide = reshape2::dcast(mortality_comparison_data_long, state_name + year + source ~ variable) %>% group_by(source) %>% arrange(source) %>% mutate(mortality_rate = round((all_deaths/pop_estimate)*100000), mortality_rate_lag = lag(mortality_rate, order_by = year), mortality_rate_roc = (mortality_rate - mortality_rate_lag)/mortality_rate_lag)
Now we can plot the mortality rate by data source. The mortality rates are near identical with the exception of 2020 (more on this later)
Let's create facet plots of the mortality rate by source to better see the trend because they so closely overlap and look like one line for most of the time period
Here's a plot of mortality rate's rate of change by data source. Similar to the plot of mortality rate, 2020 is the data point where our data sources differ the most.
We can create another facet plot to better separate trends
We will compare the social media mortality data and the mortality data we created by cohort and period. Period and cohort analysis is a method common to epidemiology; while we are not using the terms in their exact public health sense, these terms can be helpful for separating time-varying elements. The period comparison will compare the mortality rate of change from 1999 to 2019 with the mortality rate of change from 2020, recalling the central point of contention from the social media post that 2020 mortality is inline with the time period 1999 to 2019. Secondly, we will split the time period 1999 to 2020 into 5-year cohorts for a more granular analysis of the trends within proximate years.
cohort_period_comparisons = mortality_comparison_data_wide %>% mutate(period = ifelse(year<2020,"1999-2019", "2020"), cohort = ifelse(year>=1999 & year<2005, "1999-2004", ifelse(year>=2005 & year<2010, "2005-2009", ifelse(year>=2010 & year<2015, "2010-2014", ifelse(year>=2015 & year<2020, "2015-2019", "2020"))))) %>% group_by(source, cohort) %>% mutate(cohort_mortality_rate = round(mean(mortality_rate, na.rm = TRUE)), cohort_mortality_rate_roc = round(mean(mortality_rate_roc, na.rm = TRUE),3)) %>% ungroup() %>% group_by(source, period) %>% mutate(period_mortality_rate = round(mean(mortality_rate, na.rm = TRUE)), period_mortality_rate_roc = round(mean(mortality_rate_roc, na.rm = TRUE), 3)) %>% select(source, cohort, period, cohort_mortality_rate, period_mortality_rate, cohort_mortality_rate_roc, period_mortality_rate_roc) %>% distinct()
|Data Source||Cohort||Period||Cohort Mortality Rate||Period Mortality Rate||Cohort Mortality Rate ROC||Period Mortality Rate ROC|
|CDC & Census||1999-2004||1999-2019||845||833||-0.008||0.001|
|CDC & Census||2005-2009||1999-2019||811||833||-0.006||0.001|
|CDC & Census||2010-2014||1999-2019||813||833||0.008||0.001|
|CDC & Census||2015-2019||1999-2019||860||833||0.010||0.001|
|CDC & Census||2020||2020||989||989||0.138||0.138|
In the first blog post of this series, I claimed that (1) the mortality rate in 2020 was almost 20% higher than the period 1999 to 2019 and (2) the rate of change for 2020's mortality rate is 138 times greater than the average rate of change from 1999 to 2019. Now we can see how these numbers were derived. The mortality rate in 2020 was 989 deaths per 100,000 residents while the period mortality rate for 1999 to 2019 was 833 deaths per 100,000 residents. The 2020 mortality rate represents and 18.7% increase from the 1999 to 2019 mortality rate.
The rate of change comparison is where 2020 sets itself apart. The rate of change for the mortality rate in 2020 was .138 while the period 1999 to 2019 had a rate of change of .001. The 2020 rate of change is 138 times greater than the period rate of change! This is an astounding acceleration in the mortality rate. Moreover, using the social media data, the rate of change in 2020 is vastly different from the time period 1999 to 2019. In the social media data, the rate of change for the mortality rate in 2020 was .011 while the period 1999 to 2019 had a rate of change of .001. With this data set, the 2020 rate of change is 11 times greater than the period rate of change, also cause for alarm.
Ultimately, the mortality rate conclusions from the social media post are incorrect because the 2020 mortality data is not accurate. The total deaths in 2020 obtained from government data sources was 3,258,883 while the social media post showed 2,913,144 total deaths. The difference between these data sources is 345,739 deaths (11% difference). According to Johns Hopkins University data, more than 330,000 Americans died from COVID-19 in 2020, roughly the difference in deaths between the data sources. It appears the author of the social media post used mortality data that did not account for COVID-19 and even with that flawed data set, the 2020 change in mortality rate would still be significantly larger than the average mortality rate (5% greater) or rate of change (11 times greater) from 1999 to 2019.
Aside from conclusions about COVID-19, another important trend emerges from the data. From 1999 to 2009, there were six years were the mortality rate fell and the overall mortality rate for those two cohorts declined. However, after 2009, the mortality rate has steadily increased with the largest increase occurring between 2019 and 2020. The changes in mortality rates are reflected in changes in life expectancy and has sparked literature to investigate declines in life expectancy, such as this paper by Anne Case and Angus Deaton that documented declines in life expectancy that occurred from an increase in "deaths of despair". "Deaths of despair" are deaths from drug and alcohol poisonings, suicide, and chronic liver diseases. Our mortality data captures the uptick in mortality and is supportive of evidence for life expectancy decreases.