Kivan Polimis, Mon 11 January 2021, Tutorials
In the previous post we used RSocrata to download mortality data from the National Center for Health Statistics (NCHS) from 1999 to 2020.
The goal of this post is to pair the mortality data with population data from the same time period to create mortality rates. Mortality rates (also known as death rates) are a scaled metric that capture the amount of deaths in a population. Typically that scaled population expresses the deaths per 1,000 or 100,000 people in the population (e.g., the U.S. mortality rate in 2000 was 854 deaths per 100,000 residents). We will combine 3 Census datasets to get U.S. population estimates from 1999 to 2020.
library(here)
library(tidyverse)
Use the readr
package (part of the Tidyverse) to go from census urls of .csv data to create data frames of census data
national_pop_1999_url = "https://www2.census.gov/programs-surveys/popest/datasets/1990-2000/intercensal/national/us-est90int-07-1999.csv"
national_pop_2000_2010_url = "https://www2.census.gov/programs-surveys/popest/datasets/2000-2010/intercensal/national/us-est00int-alldata-5yr.csv"
national_2010_2020_pop_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/totals/nst-est2020.csv"
national_pop_1999 = read_csv(national_pop_1999_url, skip = 3,
col_names = c("date", "age_group", "total_population",
"male_population", "female_population"),
col_types = cols(date = col_date("%B %d, %Y")))
national_pop_2000_2010 = read_csv(national_pop_2000_2010_url)
national_pop_2010_2020 = read_csv(national_2010_2020_pop_url)
We know the column names of the national_pop_1999
dataframe because we manually added them. What are the column names for other population data?
names(national_pop_2000_2010)
names(national_pop_2010_2020)
## [1] "SUMLEV" "REGION" "DIVISION"
## [4] "STATE" "NAME" "CENSUS2010POP"
## [7] "ESTIMATESBASE2010" "POPESTIMATE2010" "POPESTIMATE2011"
## [10] "POPESTIMATE2012" "POPESTIMATE2013" "POPESTIMATE2014"
## [13] "POPESTIMATE2015" "POPESTIMATE2016" "POPESTIMATE2017"
## [16] "POPESTIMATE2018" "POPESTIMATE2019" "POPESTIMATE2020"
Let's do some pre-processing and reshaping of the data to create a wide data set. Typically, record entry is done using a long data format where multiple variables are stored in one column. However, when analyses need to be performed on long data in a sequential way, for instance, in a time-series, the data needs to be wide. Wide data has each variable occupy it's own column instead of sharing a column as in long data.
national_pop_1999_wide = national_pop_1999 %>%
filter(date=="1999-12-01" & age_group=="All Age") %>%
mutate(state_name = "United States") %>%
rename("pop_estimate_1999"="total_population") %>%
select(state_name, pop_estimate_1999)
national_pop_2000_2010_filtered = national_pop_2000_2010 %>%
group_by(year) %>%
filter(AGEGRP==0) %>%
rename("value"="TOT_POP") %>%
mutate(state_name="United States", variable = paste("pop_estimate", year, sep="_")) %>%
slice(1) %>%
ungroup() %>%
select(state_name, variable, value)
reshape
library to transform 2000 to 2010 census data from long to wide formatnational_pop_2000_2010_wide = reshape2::dcast(national_pop_2000_2010_filtered,
state_name ~ variable,
value.var = "value", fun.aggregate = sum)
national_pop_2010_2020_wide = national_pop_2010_2020 %>%
filter(NAME %in% "United States") %>%
rename("state_fips" = "STATE", "state_name" = "NAME", "census_estimate_2010" = "CENSUS2010POP",
"base_estimate_2010" = "ESTIMATESBASE2010", "pop_estimate_2010" = "POPESTIMATE2010",
"pop_estimate_2011" = "POPESTIMATE2011", "pop_estimate_2012" = "POPESTIMATE2012",
"pop_estimate_2013" = "POPESTIMATE2013", "pop_estimate_2014" = "POPESTIMATE2014",
"pop_estimate_2015" = "POPESTIMATE2015", "pop_estimate_2016" = "POPESTIMATE2016",
"pop_estimate_2017" = "POPESTIMATE2017", "pop_estimate_2018" = "POPESTIMATE2018",
"pop_estimate_2019" = "POPESTIMATE2019", "pop_estimate_2020" = "POPESTIMATE2020") %>%
select(state_fips:pop_estimate_2020)
national_1999_2020_pop_wide = cbind(national_pop_1999_wide,
national_pop_2000_2010_wide %>% select(-state_name),
national_pop_2010_2020_wide %>% select(-state_fips:-pop_estimate_2010))
## state_name pop_estimate_1999 pop_estimate_2000 pop_estimate_2001
## 1 United States 280466621 281424600 284968955
## pop_estimate_2002 pop_estimate_2003 pop_estimate_2004 pop_estimate_2005
## 1 287625193 290107933 292805298 295516599
## pop_estimate_2006 pop_estimate_2007 pop_estimate_2008 pop_estimate_2009
## 1 298379912 301231207 304093966 306771529
## pop_estimate_2010 pop_estimate_2011 pop_estimate_2012 pop_estimate_2013
## 1 308745538 311583481 313877662 316059947
## pop_estimate_2014 pop_estimate_2015 pop_estimate_2016 pop_estimate_2017
## 1 318386329 320738994 323071755 325122128
## pop_estimate_2018 pop_estimate_2019 pop_estimate_2020
## 1 326838199 328329953 329484123
In the next post we will combine the mortality and population data to create mortality rates.