By Zachary Bolotte, Katherine O'Connor, and Elana Pocress
How did the differing COVID-19 policy prescriptions affect public health and economic conditions? We will specifically be measuring the difference in testing, postives, and death rates of Democratic and Republican affiliated states, the 4 regions of the United States, and CA vs. FL. Through analyzing various public health and economic indicators we seek to generalize the impacts of diametrically opposed COVID-19 prevention policies.
From analyzing the difference in health and economic indicators of California and Florida, we would like to generalize the impact of COVID prevention policies similar to California and Florida. We specifically chose California and Florida because of their similar testing rates and because they have radically different COVID prevention policies. Among all the states, California is considered one of the most restrictive, while Florida is considered one of the least.
Specifically, to measure the health impacts of the COVID-19 policies, we will be analyzing tests, positives, and deaths per capita for FL and CA 6 months after a policy was enacted, and the entire time frame after the policy was enacted. To measure the economic health of the states, we will analyse unemployment rate.
There are stark contrasts between the COVID guideline implementations in Democrat states, such as New York and California, and Republican states, like Florida. The partisanship of COVID guidelines has manifested itself in the form of opposing stances on mask wearing, economic shutdowns, social distancing mandates, and venue capacity limitations.
At the beginning of the pandemic, states were quick to shut-down businesses, regardless of their leadership's political affiliation. However, Republican ran states like Florida under Ron DeSantis were much more hasty in reopening their economies, and even implemented less restrictive COVID guidelines when their businesses were in-fact closed. Contrarily, the Democratic led states of New York and California under Andrew Cuomo and Gavin Newson respectively had more strict regulations, some of which may be attributed to the fact that New York was the first United States COVID epicenter. The extent of this partisanship is exemplified by a study that found that US counties that voted for Donald Trump (R) were 14% less likely to exhibit social distancing measures than individuals in majority-democrat counties. Another study has furthered this notion - finding that high consumption of republican-leaning media outlets (like Fox News) is linked to areas with reduced social distancing measures (Gollwitzer et al 2020).
Ultimately, the differences in COVID regulations, starting at least a month after the pandemic, between Democratic and Republican led states were greatly different. For example, while Governors DeSantis, Cuomo and Newsom first issued stay-at-home orders by the end of March (March 28th, March 20th, March19th, respectively), Florida was the first state to both partially reopen its economy and begin the phase-in-process to fully reopen. On May 17th, Governor DeSantis issued an executive order to phase-in indoor dining, open retail stores, open non-essential businesses like museums and libraries (at 50% capacity), and open gyms/fitness centers. At this same point in time, New York had only begun to allow small gatherings of 10 people, and California had only begun to announce that they will soon institute a tier system to discuss reopening businesses. California began to institute these Florida-like policies a month later, by June 16, by reopening bars at limited capacity. Perhaps the most drastic and noteworthy shift in COVID policy was Florida’s move on September 25 to roll back effectively all statewide restrictions - allowing any and all businesses to operate at 100% capacity (this order also banned local governments from obstructing reopening).
Another great difference between democrat and republican led states is the policies regarding mask-wearing. In late October 2020, when Florida was averaging 3,000 positive cases a day, DeSantis announced that “Officials can require face masks, but [we] will not enforce it.” Subsequent executive orders in Florida prohibit local governments from enforcing mask rules or requiring vaccine passports. Meanwhile, both Newsom and Cuomo have been adamant about mask-wearing - evident by the issuance and upholding of several executive orders. As recent as April 7, 2021, Newsom has stated that “We don’t have any short term goals as it relates to lifting the mask mandate.” Cuomo has echoed similar sentiments.
While the implementation of COVID restrictions, or the lack thereof, has shown to be somewhat of a partisan issue, the climate of specific regions is another complex factor in the spread of the coronavirus. According to one study, the envelope structures of Covid are “sensitive to physical and chemical conditions” such that it can be destabilized by heat (Kassem, 2020). Accordingly, regions with colder climates are more susceptible to Covid transmission than regions with hotter climates. A similar study has suggested that “high temperature and humidity, together, have a combined effect on inactivation of coronavirus while the opposite weather condition can support prolonged survival time of the virus on surfaces and facilitate the transmission and susceptibility of the viral agent” (Mecenas et al. 2020)“. This is highlighted to emphasize that spread of COVID is multidimensional and not limited to any single factor. Etiological and political factors can be related, though. When comparing warmer states, like Florida (with more relaxed COVID restrictions) and colder states, like New York (with more strict COVID regulations), it is important to note the climates of these regions as an additional factor in the spread of the virus.
Sorces:
https://www.nature.com/articles/s41562-020-00977-7
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.linear_model import LinearRegression
## load data stored locally
us_covid = pd.read_csv("national-history.csv")
state_covid = pd.read_csv("all-states-history.csv")
california_unemployment = pd.read_csv("california_unemployment.csv")
florida_unemployment = pd.read_csv("florida_unemployment.csv")
state_unemployment = pd.read_csv('unemployment_data.csv')
NY_Covid_Policy = pd.read_csv("NY_Covid_Policy.csv")
FL_Covid_Policy = pd.read_csv("FL_Covid_Policy.csv")
state_info = pd.read_csv("state_info.csv")
""""
# Load data hosted on github
us_covid = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/clean_us_covid.csv?token=AAACQXVVA3UD74247Z5VQ4DAMU5D6")
state_covid = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/all-states-history.csv?token=AAACQXS4PR75L44XM6TPAEDAMU5YI")
NY_Unemployment = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/NY_UnemploymentClaims.csv?token=AAACQXWPNMK2OKGIZ2567LDAMU52M")
FL_Unemployment = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/FL_UnemploymentClaims.csv?token=AAACQXT7H5WUS3EFUR6VQI3AMU54G")
US_Unemployment = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/US_UnemploymentClaims.csv?token=AAACQXS7FBAXWKLL6LQ6SP3AMU554")
NY_Covid_Policy = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/NY_Covid_Policy.csv?token=AAACQXRYMI2FADUAAJ3UJ33AMU6A2")
FL_Covid_Policy = pd.read_csv("https://raw.github.coecis.cornell.edu/erp49/INFO2950/master/FL_Covid_Policy.csv?token=AAACQXW64BGDDXWHMVNWSLLAMU6C6")
"""""
#Change date to datetime object
state_covid['date']= pd.to_datetime(state_covid['date'])
us_covid['date']= pd.to_datetime(us_covid['date'])
state_unemployment['Date']= pd.to_datetime(state_unemployment['Date'])
def cleanUnemployment(df):
"""
Replaces '' with '_' for column heads and puts them to all lowercase letters. Also convert 'date' column to datetime objects.
Parameter df: data frame to manipulate
Precondition: df is a pandas dataframe
"""
new_names = list(df.columns.values)
new_names = [k.lower() for k in new_names]
new_names = [k.replace(' ', '_') for k in new_names]
df.columns = new_names
df['filed_week_ended']= pd.to_datetime(df['filed_week_ended'])
cleanUnemployment(florida_unemployment)
cleanUnemployment(california_unemployment)
#Add population, political affiliation, and region information to state_covid dataframe
state_covid = pd.merge(state_covid, state_info, on='state')
#CALCULATIONS
#Calculate % of Tests that are Positive per Increase in Total Test Results
state_covid['percentPositiveIncrease'] = state_covid['positiveIncrease'] / state_covid['totalTestResultsIncrease']
us_covid['percentPositiveIncrease'] = us_covid['positiveIncrease'] / us_covid['totalTestResultsIncrease']
#Calculate Increase Rates per 100,0000 for Testing, Positives and Deaths per day
perNum = 100000
US_population = 328.2 * 10**6
#Create each new column by dividing cols_of_interest by respective populations
cols_of_interest = ['totalTestResultsIncrease','positiveIncrease','deathIncrease','totalTestResults','positive', 'death']
new_cols = ['testingIncreaseRate', 'positiveIncreaseRate', 'deathIncreaseRate', 'testingRate', 'positiveRate', 'deathRate']
for i in range(len(new_cols)):
new_c = new_cols[i]
c = cols_of_interest[i]
us_covid[new_c] = us_covid[c]/US_population * perNum
state_covid[new_c] = state_covid[c]/state_covid['population'] * perNum
us_covid['percent_positiveIncreaseRate'] = us_covid["positiveIncrease"]/us_covid["totalTestResultsIncrease"] * perNum
state_covid['percent_positiveIncreaseRate'] = state_covid["positiveIncrease"]/state_covid["totalTestResultsIncrease"] * perNum
us_covid['percent_positiveRate'] = us_covid["positive"]/us_covid['totalTestResults']
state_covid['percent_positiveRate'] = state_covid["positive"]/state_covid['totalTestResults']
#Isolate CA and FL from state_covid
CA_covid = state_covid.loc[state_covid['state']=='CA']
FL_covid = state_covid.loc[state_covid['state']=='FL']
CA_unemployment = state_unemployment[['Date', 'CA']]
FL_unemployment = state_unemployment[['Date', 'FL']]
#-------------------------------------------------------------------------------------------------------------------------
#GROUP
#Get totals from most recent date on 3/07/2021
us_totals = us_covid.loc[us_covid['date']==pd.to_datetime('3-07-2021')]
state_totals = state_covid.loc[state_covid['date']==pd.to_datetime('3-07-2021')]
#Isolate columns of interest
us_totals = us_totals[['testingRate', 'positiveRate', 'percent_positiveRate', 'deathRate']]
state_totals = state_totals[['state', 'testingRate', 'positiveRate', 'percent_positiveRate','deathRate', 'Political_Affiliation', 'Region']]
#Group states by political Affiliation
state_politic_totals = state_totals.groupby(by='Political_Affiliation')
Dem_totals = state_politic_totals.get_group('Democratic')
Rep_totals = state_politic_totals.get_group('Republican')
#Group states by Region
state_region = state_totals.groupby(by='Region')
Midwest = state_region.get_group('Midwest')
Northeast = state_region.get_group('Northeast')
South = state_region.get_group('South')
West = state_region.get_group('West')
#Make Same Calculations Specific to People in Democratic and Republican States
state_covid_politic = state_covid[['date','deathIncrease','positiveIncrease', 'totalTestResultsIncrease', 'population', 'Political_Affiliation']].groupby(['Political_Affiliation'])
#Sum all data of Republican and Democratic states on 3/07/2021
Dem_covid = state_covid_politic.get_group('Democratic')
Dem_covid = Dem_covid.groupby('date').sum()
Rep_covid = state_covid_politic.get_group('Republican')
Rep_covid = Rep_covid.groupby('date').sum()
#More calculations
cols_of_interest = ['totalTestResultsIncrease','positiveIncrease','deathIncrease']
new_cols = ['testingIncreaseRate', 'positiveIncreaseRate', 'deathIncreaseRate',]
for i in range(len(new_cols)):
new_c = new_cols[i]
c = cols_of_interest[i]
Dem_covid[new_c] = Dem_covid[c]/Dem_covid['population'] * perNum
Rep_covid[new_c] = Rep_covid[c]/Rep_covid['population'] * perNum
dem_states = list(Dem_totals['state'])
rep_states = list(Rep_totals['state'])
Dem_unemployment = state_unemployment[dem_states].mean(axis=1)
Rep_unemployment = state_unemployment[rep_states].mean(axis=1)
unemployment_politic = pd.DataFrame(np.array([state_unemployment['Date'], Dem_unemployment, Rep_unemployment]).transpose(), columns=['Date', 'Democrat', 'Republican'])
print(us_covid['testingRate'].mean())
32704.498431560332
#Downolad CSVs of cleaned data to local drive
"""
us_covid.to_csv (r'C:\Users\kathe\OneDrive\Documents\S2-DESKTOP-S9R0JMM\INFO_2950\Final Project\clean_us_covid.csv', index = False, header=True)
state_covid.to_csv(r'C:\Users\kathe\OneDrive\Documents\S2-DESKTOP-S9R0JMM\INFO_2950\Final Project\clean_state_covid.csv', index = False, header=True)
CA_covid.to_csv (r'C:\Users\kathe\OneDrive\Documents\S2-DESKTOP-S9R0JMM\INFO_2950\Final Project\clean_CA_covid.csv', index = False, header=True)
FL_covid.to_csv (r'C:\Users\kathe\OneDrive\Documents\S2-DESKTOP-S9R0JMM\INFO_2950\Final Project\clean_FL_covid.csv', index = False, header=True)
state_totals.to_csv (r'C:\Users\kathe\OneDrive\Documents\S2-DESKTOP-S9R0JMM\INFO_2950\Final Project\clean_state_totals.csv', index = False, header=True)
"""
File "<ipython-input-24-c44342768908>", line 9 """ ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 22-23: truncated \UXXXXXXXX escape
https://covidtracking.com/data/download
This large file contains day-by-day covid-19 data for all 50 United States, with rows representing each day for each state (i.e. one entry per state, per day, from 03/07/2020 to 03/07/2021. Columns attributes include both positive and negative test results, the amount of tests administered, hospitalizations, ventilator use, icu capacity, and daily changes each metric therein. This dataset was a volunteer effort published by the Atlantic with funding from The Rockefeller Foundation, among other large institutes, and is largely regarded as the largest conglomeration of such data. It is unlikely that these publishers had control of what data was observed, but they did rely on the data reported by the states themselves - which has faced allegations of manipulation by the states (the governors of both NY and FL have been accused of falsifying reports in the national media). The underlying data, as reported by the states
https://oui.doleta.gov/unemploy/claims.asp
The columns represent the state (California or Florida), the week filed for unemployment, how many initial claims there were, reflecting week end, the amount of continued claims, those covered under unemployment, and the insured unemployment rate This dataset was created to provide information about weekly unemployment claims in the particular states. This report was funded by the United States Department of Labor The data that was recorded was influenced completely by individuals who filed for unemployment. There may be Individuals that are undocumented citizens, among others, who lost their jobs but did not report it. The people involved in this data collection are the individuals that are filing for unemployment. Since unemployment is filed with the state, this information has been desensitized by removing personal information. It is not likely that individuals are explicitly told when filing for unemployment that their information will be part of a larger dataset.
https://www.bls.gov/charts/state-employment-and-unemployment/state-unemployment-rates-animated.htm
The rows of this table are the 50 states and the columns are every month from March 2011 to March 2021 This dataset was created to indicate the seasonally adjusted state unemployment rates of the past 10 years. The Bureau of Labor Statistics is funded by the US Department of Labor, which is a US governmental agency Since the unemployment rate is a factor of the amount of unemployment claims, individuals must file for unemployment on their own terms. So, wealthier unemployed individuals may not file for unemployment because they do not need unemployment assistance so the numbers may not perfectly reflect the exact rate The data was displayed in a table, but was also shown in an interactive map format. The raw table data was used which did not require preprocessing.
Basic information of the population, region, and political affiliation based on the 2020 election for each of the 50 states was compiled into this spreadsheet. The information was gathered from the following sources.
https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html
The columns represent the estimated population of the entire United States, the 4 regions of the United States, each individual state, and Puerto Rico. It was created by the United States Census Bureau and was funded by the federal government in order to estimate population. All of the estimates were made based on data from the 2010 Census where all US residents were mailed surveys. Census workers were sent to the addresses that did not respond. Because of the nature of the data collection, people without permanent residences may be underrepresented.
https://www.archives.gov/electoral-college/2020
The columns represent the number of electoral votes per state, and the number of electoral votes that went to Joe Biden, Donald Trump, Kamala Harris, and Michael Pence in the 2020 Presidential election. The data was collected by the U.S. National Archives and Records Administration.
https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
The United States Census Bureau divided the 50 U.S. states into the Northeast, South, West and Midwest regions. The states that each region contains are listed in the above document.
https://www.gov.ca.gov/california-takes-action-to-combat-covid-19/
The columns represent the start (and end dates if applicable) for each policy enactment by either the state, governor, or local authorities. This set provides a comprehensive timeline of the most important policies and restrictions/ guidelines instituted by California authorities. Though not exhaustive, this dataset provides information that is consistent with other states policies such as mask mandates, stay at home orders, executive orders, etc for comparative purposes. This dataset was created by the Office of Governor Gavin Newsom to report a collection of previous policies through reviewing releases issued by their office and/ or other local officials.. This dataset is comprehensive enough such that there are no obvious biases that could have arised through selecting reporting of data
https://ballotpedia.org/Documenting_Florida%27s_path_to_recovery_from_the_coronavirus_(COVID-19)_pandemic,_2020-2021 The columns represent the date of the policy enactment and description of the policy itself. This dataset was created to provide a timeline of the US policies regarding COVID and restrictions/guidelines that were specific to Florida. This dataset was compiled through the work of journalists’ reporting on state and federal government policies. This sample of data does not appear to be biased because it is not skewed either in favor or against the policies enacted.
Objective:¶
Using the Difference in Differences (DD) model, we seek to measure the treatment on the treated (TOT) effect on the day-over-day positivity rate changes - using 09/25 as treatment date, FL as treated, and CA as untreated. Ultimately, a statistically significant sigma ($\sigma$) could point towards a true difference (for better or worse) sparked by the 'treatment' of removing restrictions and completely reopening businesses.
In practice, this analysis will serve us in understanding the efficacy of state-mandated COVID-regulations; namely whether, and to what degree, economically damaging policies actually fulfill their purpose in slowing the spread of COVID-19 (i.e. lowering the rate at which positive tests accumulate). The DD model is great in these circumstances because it does not require homogeneity between FL and CA; instead, it relies on pre- and post-treatment observations while assuming constant time-series effects to make judgement.
Should this finding be interesting and/or significant we may also choose to apply this model to assess similar effects in death rate changes.Formula:¶
$y_i = \alpha + \beta treatment_i + \gamma post_i + \sigma (treatment_i \cdot post_i) + \epsilon_i$
Hypothesis:¶
$H_0: \sigma = 0 $ (TOT = 0; Florida's reopening 'treatment' has no affect on positive test increase rate)
Where:¶
- $y_i$: increase in positive COVID test rates
- $\alpha_0$: constant effects term
- $\beta$: predisposed effect of being in treatment group - regardless of treatment (FL)
- $treatment_i$: binary variable - 1 if observation made in Florida, 0 otherwise
- $\gamma$: effect of common timeframe (i.e. changes that came about due to time in FL and CA)
- $post_i$: binary variable - 1 if observation is post-treatment (09/25), 0 otherwise
- $\sigma$: coefficient measuring the true effect on the treated (i.e. effect of FL's reopening strategy)
- $\epsilon_i$: error term
Objective:¶
Using a series of ordinary least square (OLS) linear regression models, we seek to measure the effect of being in a republican state (FL) relative to the increase rate of positive COVID-19 tests. Rather than a single regression model, we plan to produce a publication-style regression table containing the results of a number regressions with varied timeframes, control variables, and,potentially, relevant instrumental variables as well.
In practice, this will serve us in highlighting the correlative difference between states led by Republicans and Democrats - ultimately analyzing both the efficacy and impacts of related policies. Such is important, in no small part, as a means of assessing (1) whether COVID policies actually yielded positive results and (2) if so, what specific aspects played roles in producing such results.Formula:¶
$y_i = \alpha + \beta_1 Florida_i + \beta_2 Controls_i + \epsilon_i$
Hypothesis:¶
$H_0: \sigma = 0 $ (being in FL (as opposed to CA) has no affect on positive test increase rate)
Where:¶
- $y_i$: increase in positive COVID test rates
- $\alpha_0$: constant effects term
- $\beta_1$: effect of being in Florida
- $Florida_i$: binary variable - 1 if observation made in Florida, 0 otherwise
- $\beta_2$: effect of control variables
- $Controls_i$: collection of variables which may effect $y_1$ other than being in FL
- $\epsilon_i$: error term
In the future, we plan to conduct ANOVA tests to determine if any of the mean COVID testing rates, positive rates, and death rate totals as of 3/07/2021 of the 5 regions of the United States are statistically significant. If the p-value of the ANOVA test is less than 0.05, we will conduct a Tukey Test to determine which regions' differences in means are statistically significant. For both tests, we will be using a confidence level of 0.05.
We also plan to conduct ANOVA tests to determine if the mean COVID testing rates, positive rates, and death increase rates of California, Florida, and the US average are statstically significant over specific time periods using the same procedures as stated above. We also want to conduct these same test on subsets of the data that only the contain certain time periods after Florida began reopening. We plan to test the entire timeframe, only after Florida reopened, and 6 months after Floria reopened.
$H_0:$ the true mean testing, positive, and death rates of all 5 regions of the United States are equal
$H_0:$ the true mean testing, positive, and death rates of all CA, FL and the averages of the United States are equal
There are many differences across the country that could affect these factors and we are trying our best to account for these differences. We suspect that the more rural midwestern and southern regions of the country will have less testing and therefore less positives and deaths will be recorded. We suspect they will have less testing because they are more rural and will have lower accessibility to testing sites. However, these isolated areas are also less likely to contract COVID-19 than urbanized areas that have more frequent testing. The midwestern and southern regions of the country are generally more Reublican dominated which have a history of being less adherent to COVID-19 guidelines. We also suspect that Florida will have lower testing rates, and higher positives and death rates becuase of their less strict COVID policies that reflect the publics general attitude to adherence of COVID guidelines.
We are foremostly limited by the data sets that were available to us. There were many other factors that we considered taking into consideration but there is not publicly available data for. Some datasets are not updated to recent dates so we must either extrapolate based on ast trends or stop our analyses at the dates that the datasets end at. It will be difficult to control for all lifestyle differences between FL and NY which could otherwise affect the spread of COVID-19 (i.e. feasibility of being outdoors where spread is less likely, access to private transportation, etc.) We cannot account for individuals that travelled from New York to Florida, or vice versa. This is particularly notable given the number of ‘snowbirds’ who occupy NY in the summer and FL in the winter - as those migration patterns may spread COVID-19 between the two states and represent an infection in one state that was actually caused by the other states’ actions.
First, we looked at some basic summary statistics. We calculated the total deathRate, positiveRate, and testingRate for CA, FL, the US and for all Democratic and Republican states. Each rate was calculated by taking the total number of deaths, positives, and tests as of 3/07/2021 and dividing that by the respective total populations. For example, if the US has a death rate of 0.00157 it means that about 0.157% of the population has died from COVID. If it has a positive rate of 0.087618, the 8.76% of the population has tested positive for COVID. If it has a testing rate of 1.108546, then there have been about 1.1 tests for every 1 American as 3/07/2021.
s1 = state_totals.loc[(state_totals['state'] == 'CA')]
s2 = state_totals.loc[(state_totals['state'] == 'FL')]
states_of_interest = pd.concat([s1,s2]).groupby(by='state')
print('US Summary Statistics')
print(us_totals, '\n')
#We don't actually need the means of CA and FL becuase they each only have one row. I did this for formatting.
print('CA and FL Summary Statistics')
print(states_of_interest.mean(), '\n')
print('US States groupe by Political Affiliation Summary Statistics')
print(state_politic_totals.mean())
US Summary Statistics testingRate positiveRate percent_positiveRate deathRate 0 110854.606338 8761.786715 0.079039 156.961609 CA and FL Summary Statistics testingRate positiveRate percent_positiveRate deathRate state CA 125647.230732 8861.546464 0.070527 136.980397 FL 104010.874144 8889.246572 0.085465 150.229980 US States groupe by Political Affiliation Summary Statistics testingRate positiveRate percent_positiveRate \ Political_Affiliation Democratic 132762.194179 7626.600673 0.063344 Republican 96077.249994 9799.530075 0.123030 deathRate Political_Affiliation Democratic 144.150750 Republican 150.203195
Based on the above information, it seems that California and Florida have simmilar Positve rates and death rates, which are both close to the national average. It appears that California had more testing per capita than Florida and more testing than the national average.
The testing and positive rates seemed different for Democratic and Republican states. The death rates looked simmiliar. I wanted to determine if the difference in rates were significant for Democratic and Republican states, so I conducted 3 t-tests. I was also curious about the distribution of rates among Democratic and Republican states, so I drew boxplots.
cols_of_interest = list(Dem_totals.columns.values)[1:-2]
titles = ['Testing Rate Totals', 'Positives per 100,000 Totals', 'Percent Positive of Total Tests','Death Rate Totals']
print(cols_of_interest)
for i in range(len(titles)):
col = cols_of_interest[i]
title = 'Democrat vs. Republican State ' + titles[i]
plt.figure(figsize=(4,2))
plt.boxplot([Rep_totals[col], Dem_totals[col]], labels=['Republican', 'Democratic'], vert=False)
plt.title(title, fontsize=15)
plt.show()
print('Democratic:', Dem_totals[col].mean())
print('Republican:', Rep_totals[col].mean())
ttest = stats.ttest_ind(Rep_totals[col], Dem_totals[col])
is_significant = 'SIGNIFICANT' if ttest.pvalue < 0.05 else 'NOT SIGNIFICANT'
print(titles[i] + ":", ttest)
print(is_significant, '\n')
['testingRate', 'positiveRate', 'percent_positiveRate', 'deathRate']
Democratic: 132762.19417880994 Republican: 96077.24999386033 Testing Rate Totals: Ttest_indResult(statistic=-2.647825761328732, pvalue=0.010867406400506143) SIGNIFICANT
Democratic: 7626.600673209203 Republican: 9799.53007478047 Positives per 100,000 Totals: Ttest_indResult(statistic=3.5463011201158405, pvalue=0.0008712856003042247) SIGNIFICANT
Democratic: 0.06334440208823938 Republican: 0.1230298841034768 Percent Positive of Total Tests: Ttest_indResult(statistic=4.691813695053396, pvalue=2.2063470586887677e-05) SIGNIFICANT
Democratic: 144.15075007752637 Republican: 150.20319512708153 Death Rate Totals: Ttest_indResult(statistic=0.3756884881628496, pvalue=0.7087700622759839) NOT SIGNIFICANT
Democratic states have significantly more testing that Republican states. Republicans had significantly more positives per 100,000 people than Democratic states, despite conducting less covid tests. Republican states also had a significantly higher proportion of their tests be postive than Democratic states.However, the difference in death rates was not significant.
print(cols_of_interest)
titles = ['Testing Rates', 'Positives per 100,000', 'Percent Positive of Total Tests', 'Death Rates']
for i in range(len(titles)):
col = cols_of_interest[i]
title = 'Region ' + titles[i]
data = [ Midwest[col], Northeast[col], South[col], West[col] ]
regions = ['Midwest', 'Northeast', 'South', 'West']
plt.boxplot(data, labels=regions, vert=False)
plt.title(title, fontsize=25)
plt.show()
anova = stats.f_oneway(Midwest[col], Northeast[col], South[col], West[col])
is_significant = 'SIGNIFICANT' if anova.pvalue < 0.05 else 'NOT SIGNIFICANT'
print(titles[i] + ":", anova)
print(is_significant, '\n')
if is_significant == 'SIGNIFICANT':
tukey = pairwise_tukeyhsd(endog=state_totals[col], groups=state_totals['Region'], alpha=0.05)
print(tukey)
['testingRate', 'positiveRate', 'percent_positiveRate', 'deathRate']
Testing Rates: F_onewayResult(statistic=4.3419400642997505, pvalue=0.008820376813806223) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ====================================================================== group1 group2 meandiff p-adj lower upper reject ---------------------------------------------------------------------- Midwest Northeast 74707.6149 0.0047 18676.5087 130738.7211 True Midwest South 23905.7597 0.5444 -24003.0457 71814.5652 False Midwest West 34160.5844 0.2917 -16706.7679 85027.9366 False Northeast South -50801.8552 0.0604 -103182.6864 1578.9761 False Northeast West -40547.0305 0.2178 -95646.807 14552.7459 False South West 10254.8246 0.9 -36561.3493 57070.9985 False ----------------------------------------------------------------------
Positives per 100,000: F_onewayResult(statistic=0.6478333291613055, pvalue=0.5882440066485599) NOT SIGNIFICANT
Percent Positive of Total Tests: F_onewayResult(statistic=4.574690442292593, pvalue=0.006837864326236779) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ========================================================== group1 group2 meandiff p-adj lower upper reject ---------------------------------------------------------- Midwest Northeast -0.0702 0.0111 -0.1278 -0.0126 True Midwest South -0.0351 0.2433 -0.0843 0.0142 False Midwest West -0.06 0.0188 -0.1122 -0.0077 True Northeast South 0.0352 0.3151 -0.0187 0.089 False Northeast West 0.0103 0.9 -0.0463 0.0669 False South West -0.0249 0.5177 -0.073 0.0232 False ----------------------------------------------------------
Death Rates: F_onewayResult(statistic=4.4129931741535975, pvalue=0.008158962572390084) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================= group1 group2 meandiff p-adj lower upper reject ------------------------------------------------------------- Midwest Northeast 30.0732 0.5523 -30.8454 90.9918 False Midwest South -10.6743 0.9 -62.7621 41.4135 False Midwest West -49.2102 0.0971 -104.5146 6.0942 False Northeast South -40.7475 0.2396 -97.6975 16.2024 False Northeast West -79.2834 0.0051 -139.1895 -19.3774 True South West -38.5359 0.1966 -89.4357 12.364 False -------------------------------------------------------------
At least one difference in average testing rate, percent of the tests that were positive, and death rates among the four regions of the US were statistically significant. The difference in number of positives per capita was not significant.
The Midwest and Northeastern regions of the country had statisically significant different testing rates and percent of tests that returned positive, which makes sense since the Midwest is very rural, while the Northeast is more urban. Since the Midwest is rural, it is more difficult to get tested and it may be that a greater proportion of the individuals are getting tested already have COVID symptoms. In the Northeast, testing is much more easily accessible. Some schools and institutions require people to get tested regualry which you would expect to increase the number of negative tests in that region.
The Northeastern testing rate is also significantly higher than the testing rate of the West.
The Northeastern death rate is significantly higer than the death rate of the West. This is likely becuase the spread of COVID started in the Northeast and New York City was hit especially hard in terms of COVID deaths at the beginning of the pandemic.
cols_of_interest = ['testingIncreaseRate', 'positiveIncreaseRate', 'deathIncreaseRate']
ylab = ['New Tests per 100,000 People','New Positives Increase Per 100,000 People','New Deaths per 100,000 People']
titles = ['Testing Increase Rate', 'Positive Increase Rate', 'Death Increase Rate']
for i in range(len(cols_of_interest)):
plt.figure(figsize=(24,8))
c = cols_of_interest[i]
title = titles[i]
plt.plot( Dem_covid[c].rolling(7, center=True).mean(), alpha=0.8, label='Democratic')
plt.plot(Rep_covid[c].rolling(7, center=True).mean(), color='red', alpha=0.8, label='Republican')
plt.xlim(pd.to_datetime('2020-03-1'),pd.to_datetime('2021-03-07'))
plt.xlabel('Date')
plt.ylabel(ylab[i])
plt.title('Democratic vs. Republican {}'.format(title), fontsize=25)
plt.legend(loc=2, fontsize=16)
plt.show()
ttest = stats.ttest_ind(Dem_covid[c], Rep_covid[c])
print('Democratic Average of {}: {}'.format(title, Dem_covid[c].mean()))
print('Republican Average of {}: {}'.format(title, Rep_covid[c].mean()))
print(ttest)
is_significant = 'SIGNIFICANT' if ttest.pvalue < 0.05 else 'NOT SIGNIFICANT'
print(is_significant, '\n')
Democratic Average of Testing Increase Rate: 303.6026000566923 Republican Average of Testing Increase Rate: 216.70698215825058 Ttest_indResult(statistic=6.2262532601951275, pvalue=7.612785511483494e-10) SIGNIFICANT
Democratic Average of Positive Increase Rate: 19.6353677792971 Republican Average of Positive Increase Rate: 23.23090079258996 Ttest_indResult(statistic=-2.4228218786244744, pvalue=0.015615598133711831) SIGNIFICANT
Democratic Average of Death Increase Rate: 0.37894826192270986 Republican Average of Death Increase Rate: 0.3770474425602762 Ttest_indResult(statistic=0.07734031594345926, pvalue=0.9383716402875895) NOT SIGNIFICANT
The above data was scaled by population meaning that the Democratic and Republican rates per day were calculated by taking the sum of all new tests, positves, and deaths per day from the Democartic and Republican states divided by the sums of the states population respectively. This also means that the COVID policies of the states with the greatest populations affect the data the most.
The difference in average testing rates and positive rates per day were statsitically significant for Republican and Democratic states over the entire time frame.
Interestingly, the testing rates of Democratic and Republican states seemed very simmilar up until July of 2020, where the Republcan testing rates decreased lower than the Democratic testing rates. Around this time, the positives rates and death rates of Republican states also increased higher than the Democratic states.
We got a general sense that death rates increased as testing rates increased, whic is the opposite of what we expected. In order to analyze this reationship further, we graphed the death rate against the testing rate average for all Democratic and Republican states per day.
plt.figure(figsize=(16,8))
plt.scatter(Dem_covid['testingIncreaseRate'], Dem_covid['deathIncreaseRate'], alpha=0.4, label='Democratic')
plt.scatter(Rep_covid['testingIncreaseRate'], Rep_covid['deathIncreaseRate'], alpha=0.4, color='red', label='Republican')
lr = LinearRegression()
lr.fit(np.array(Dem_covid['testingIncreaseRate']).reshape(-1,1), Dem_covid['deathIncreaseRate'])
print('Democratic Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(Dem_covid['testingIncreaseRate'], lr.predict(np.array(Dem_covid['testingIncreaseRate']).reshape(-1,1)), color='blue')
lr.fit(np.array(Rep_covid['testingIncreaseRate']).reshape(-1,1), Rep_covid['deathIncreaseRate'])
print('Republican Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(Rep_covid['testingIncreaseRate'], lr.predict(np.array(Rep_covid['testingIncreaseRate']).reshape(-1,1)), color='red')
plt.xlabel('Testing Increase Rate')
plt.ylabel('Death Increase Rate')
plt.title('Death Increase Rate vs. Testing Increase Rate')
plt.legend()
plt.show()
Democratic Estimate: y = 0.0007570574533882543x + 0.1491036506817379 Republican Estimate: y = 0.0017374746060789788x + 0.0005245641003060464
There appears to be a positive correlation between testing and death rates for both Reublican and Democratic states. This suggests a major error in the data collection. It is likely that only the deaths of people who who tested positive for COVID were reported. Since there was a lower testing rate in the beginning of the pandemic, less deaths were probably recorded then. The data may not accuartely represent the true population for this reason.
On 5/04/2020, Florida began reopening resturants, stores, libraries and Museums to 25% capacity. On 6/11/2020, they also opened bars and movie theaters and extended that capacity to 50%. On 9/25/2020, Florida fully reopended all business to 100% capcity and announced that there would be no penalties for not wearing masks. Meanwhile, Caifornia does not plat to have a full reopening until July 2021. However they have opened bars, resturant and salons to limited capacity in 6/12/2020. Given these drasrically different COVID policies, we expected Florida's positive and death rates to be higer than Claifornia and the US average.
We conducted ANOVA tests to compare the average testing, positve, and death rates per daya of California, Florida, and the U.S. national average. We conducted theses test on the entire time frame, after 5/04/2020, and 6 months after 5/04/2020 to determine if Florida's testing, positive, or death rates were significantly higher or lower than California's or the national average.
us_covid['state'] = 'US'
print(us_covid['testingRate'].mean())
y_coord = [900, 110, 1.4]
timeframe = [['2020-03-1','2021-03-07'],['2020-05-4','2021-03-07'],['2020-05-4','2020-11-4']]
time_frame_titles = ['Full Time Frame', 'After Reopening on 5/04/2020','6 Months After Reopening on 5/04/2020']
for i in range(len(cols_of_interest)):
plt.figure(figsize=(24,8))
c = cols_of_interest[i]
title = titles[i]
plt.plot( CA_covid['date'], CA_covid[c].rolling(7, center=True).mean(), alpha=0.8, label='CA')
plt.plot(FL_covid['date'], FL_covid[c].rolling(7, center=True).mean(), color='red', alpha=0.8, label='FL')
plt.plot(us_covid['date'], us_covid[c].rolling(7, center=True).mean(), color='black', alpha=0.8, label='US')
plt.xlim(pd.to_datetime('2020-03-1'),pd.to_datetime('2021-03-07'))
plt.axvline(x=pd.to_datetime('2020-05-4'), color='tomato')
plt.text(pd.to_datetime('2020-05-5'), y_coord[i], 'FL Stage 1 Reopening', fontsize=16, color='tomato')
plt.axvline(x=pd.to_datetime('2020-06-3'), color='tomato')
plt.text(pd.to_datetime('2020-06-4'), y_coord[i]*0.95, 'FL Stage 2 Reopening', fontsize=16, color='tomato')
plt.axvline(x=pd.to_datetime('2020-09-25'), color='tomato')
plt.text(pd.to_datetime('2020-09-26'), y_coord[i], 'FL Full Reopening', fontsize=16, color='tomato')
plt.axvline(x=pd.to_datetime('2020-11-4'), linestyle='--', color='grey')
plt.text(pd.to_datetime('2020-11-5'), y_coord[i], '6 months after Reopening', fontsize=16, color='grey')
plt.xlabel('Date')
plt.ylabel(title)
plt.title('California vs. Florida {}'.format(title), fontsize=25)
plt.legend(loc=2, fontsize=16)
plt.show()
for j in range(len(timeframe)):
CA_subset = CA_covid.loc[(CA_covid['date'] >= pd.Timestamp(timeframe[j][0])) & (CA_covid['date'] <= pd.Timestamp(timeframe[j][1]))]
FL_subset = FL_covid.loc[(FL_covid['date'] >= pd.Timestamp(timeframe[j][0])) & (FL_covid['date'] <= pd.Timestamp(timeframe[j][1]))]
us_subset = us_covid.loc[(us_covid['date'] >= pd.Timestamp(timeframe[j][0])) & (us_covid['date'] <= pd.Timestamp(timeframe[j][1]))]
tukey_data = pd.concat([CA_subset, FL_subset, us_subset])
print(time_frame_titles[j])
print('CA Average: {}'.format(CA_subset[c].mean()))
print('FL Average: {}'.format(FL_subset[c].mean()))
print('US Average: {}'.format(us_subset[c].mean()))
anova = stats.f_oneway(CA_subset[c], FL_subset[c], us_subset[c])
is_significant = 'SIGNIFICANT' if anova.pvalue < 0.05 else 'NOT SIGNIFICANT'
print(anova)
print(is_significant, '\n')
if is_significant == 'SIGNIFICANT':
tukey = pairwise_tukeyhsd(endog=tukey_data[c], groups=tukey_data['state'], alpha=0.05)
print(tukey)
print('\n')
32704.498431560332
Full Time Frame CA Average: 340.50386812323114 FL Average: 279.5989363011025 US Average: 297.9908846231975 F_onewayResult(statistic=7.738448402126859, pvalue=0.00045967284764869684) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ====================================================== group1 group2 meandiff p-adj lower upper reject ------------------------------------------------------ CA FL -60.9049 0.001 -98.1595 -23.6503 True CA US -42.513 0.0205 -79.7676 -5.2584 True FL US 18.3919 0.4786 -18.7872 55.5711 False ------------------------------------------------------ After Reopening on 5/04/2020 CA Average: 402.0641669787608 FL Average: 331.2693166761591 US Average: 352.5139614346654 F_onewayResult(statistic=10.901738232167286, pvalue=2.092225537560887e-05) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ====================================================== group1 group2 meandiff p-adj lower upper reject ------------------------------------------------------ CA FL -70.7949 0.001 -107.323 -34.2667 True CA US -49.5502 0.0043 -86.0783 -13.0221 True FL US 21.2446 0.3607 -15.2835 57.7728 False ------------------------------------------------------ 6 Months After Reopening on 5/04/2020 CA Average: 252.6109999470836 FL Average: 246.9361520596759 US Average: 243.50516494556726 F_onewayResult(statistic=0.39304990260091477, pvalue=0.6751838208151788) NOT SIGNIFICANT
Full Time Frame CA Average: 24.01466755563204 FL Average: 23.89582411808338 US Average: 23.553175352034263 F_onewayResult(statistic=0.036184020963477934, pvalue=0.9644639333475139) NOT SIGNIFICANT After Reopening on 5/04/2020 CA Average: 28.330687482666875 FL Average: 28.334504652936147 US Average: 27.29888233338877 F_onewayResult(statistic=0.18140674188779393, pvalue=0.8341258238304653) NOT SIGNIFICANT 6 Months After Reopening on 5/04/2020 CA Average: 12.126168955158308 FL Average: 19.51524237601211 US Average: 13.761458899484495 F_onewayResult(statistic=25.420779821536126, pvalue=2.7485678444325342e-11) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 ==================================================== group1 group2 meandiff p-adj lower upper reject ---------------------------------------------------- CA FL 7.3891 0.001 4.8307 9.9475 True CA US 1.6353 0.2912 -0.9231 4.1937 False FL US -5.7538 0.001 -8.3122 -3.1954 True ----------------------------------------------------
Full Time Frame CA Average: 0.37122058856336126 FL Average: 0.4038440322636638 US Average: 0.4219357132092284 F_onewayResult(statistic=1.9621018137149224, pvalue=0.14104988751355035) NOT SIGNIFICANT After Reopening on 5/04/2020 CA Average: 0.4265407043428415 FL Average: 0.46655076116513294 US Average: 0.446123879167755 F_onewayResult(statistic=0.9582523708168573, pvalue=0.3839447026559414) NOT SIGNIFICANT 6 Months After Reopening on 5/04/2020 CA Average: 0.21255140158472935 FL Average: 0.39583414312325443 US Average: 0.2667737207042509 F_onewayResult(statistic=42.31457825594809, pvalue=7.989224827652206e-18) SIGNIFICANT Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject --------------------------------------------------- CA FL 0.1833 0.001 0.1352 0.2314 True CA US 0.0542 0.0226 0.0061 0.1023 True FL US -0.1291 0.001 -0.1772 -0.081 True ---------------------------------------------------
The difference in Florida's testing rate was never significantly lower than the national average in any of the time frames, however California's testing rates were higher than Florida's and the national average in every time frame except for that between 5/04/2020 and 11/04/2020. Based on the graph we can tell that California's testing rate deviated from that of Florida and the national average in December of 2020.
Florida's postive rate and death rate was significantly higher than both California and the national average in only the 6-month period after it began reopening on 5/04/2021. This peak in positives and deaths proceeding the months after Florida began its reopeing process suggests there is a significant relationship between COVID prevention policy and postives and death rates.
plt.figure(figsize=(16,8))
predictor = 'testingIncreaseRate'
response = 'positiveIncreaseRate'
plt.scatter(CA_covid[predictor], CA_covid[response], alpha=0.4, label='California')
plt.scatter(FL_covid[predictor], FL_covid[response], alpha=0.4, color='red', label='Florida')
lr = LinearRegression()
lr.fit(np.array(CA_covid[predictor]).reshape(-1,1), CA_covid[response])
print('California Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(CA_covid[predictor], lr.predict(np.array(CA_covid[predictor]).reshape(-1,1)), color='blue')
lr.fit(np.array(FL_covid[predictor]).reshape(-1,1), FL_covid[response])
print('Florida Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(FL_covid[predictor], lr.predict(np.array(FL_covid[predictor]).reshape(-1,1)), color='red')
plt.xlabel('Testing Rate')
plt.ylabel('Positive Increase Rate')
plt.title('Positive Increase Rate vs. Testing Rate')
plt.legend()
plt.show()
plt.figure(figsize=(16,8))
predictor = 'testingRate'
response = 'positiveRate'
plt.scatter(CA_covid[predictor], CA_covid[response], alpha=0.4, label='California')
plt.scatter(FL_covid[predictor], FL_covid[response], alpha=0.4, color='red', label='Florida')
lr = LinearRegression()
lr.fit(np.array(CA_covid[predictor]).reshape(-1,1), CA_covid[response])
print('California Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(CA_covid[predictor], lr.predict(np.array(CA_covid[predictor]).reshape(-1,1)), color='blue')
lr.fit(np.array(FL_covid[predictor]).reshape(-1,1), FL_covid[response])
print('Florida Estimate: y = {}x + {}'.format(lr.coef_[0], lr.intercept_))
plt.plot(FL_covid[predictor], lr.predict(np.array(FL_covid[predictor]).reshape(-1,1)), color='red')
plt.xlabel('Testing Rate')
plt.ylabel('Positive Increase Rate')
plt.title('Positive Increase Rate vs. Testing Rate')
plt.legend()
plt.show()
California Estimate: y = 0.09494737183836505x + -8.31527982346601 Florida Estimate: y = 0.0914051263033822x + -1.5294061646749952
California Estimate: y = 0.07231418345581x + -293.1136465012023 Florida Estimate: y = 0.08551044849506438x + -3.371142379965022
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col #Closest thing to the 'estout'pretty tables
FloCal = FL_covid.append(CA_covid)
FloCal['is_Florida'] = pd.get_dummies(FloCal['state'])['FL']
FloCal['constant'] = 1
FloCal_recent = FloCal.set_index('date').loc['2021']
#Train regression to view testing rate effect ONLY on increase in positives
reg0 = sm.OLS(FloCal['positiveRate'],FloCal[['testingRate','constant']]).fit()
#Train regression to view state and testing rate effect on increase in positives
reg1 = sm.OLS(FloCal['positiveRate'],FloCal[['is_Florida','testingRate','constant']]).fit()
#Train regression to do same as reg0, limiting data to only 2021
reg2 = sm.OLS(FloCal_recent['positiveRate'],FloCal_recent[['is_Florida','testingRate','constant']]).fit()
reg_names = ["OLS I","OLS II","OLS III"]
reg_Order = ['testingRate','is_Florida','constant']
print (summary_col([reg0,reg1,reg2],stars=True,float_format='%0.2f', model_names=reg_names,regressor_order=reg_Order,
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs))}))
=========================================== OLS I OLS II OLS III ------------------------------------------- testingRate 0.08*** 0.08*** 0.08*** (0.00) (0.00) (0.00) is_Florida 767.86*** 1154.30*** (30.91) (46.87) constant -86.51*** -525.27*** -110.31 (29.57) (28.25) (181.61) R-squared 0.96 0.98 0.94 0.96 0.98 0.94 N 773 773 132 =========================================== Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
OLS I seeks to predict the sole effect of testing rate on the increase in positive test rates in both California and Florida for the entirety of the COVID-19 pandemic. Significant at the 99% level, the model - and all subsequent variants therein - predicts that the rate of tests administered has zero effect on positive tests increased rates. Essentially, such is to say that the data makes a compelling case for changes in positive testing rate to be independent of how many tests are administered.
OLS II seeks to predict the effect of being in Florida on the increase in positive test rates (controlling for testing rate) over the entirety of the COVID-19 pandemic. Insignificant at any reasonable level, the model is unable to create a compelling case for one state's holistic policies to have had an effect on the rate at which positive tests accumulate. This is still notable, however, as it suggests that California's economically costly shutdowns did not yield the significant changes it intended to.
OLS III performs the same function of OLS III, simply with the relevant data points being restricted to those observed in 2021. Here, the coefficient of being in Florida now incorporates vaccination efforts, as well as a more limited set of policy differences. This model predicts, at a 99% significance level, that being in Florida acutally corresponds to a 48 percentage point DECREASE in positive tests overall.
#Difference in Difference Model Before and After Sept 25, 2020 (Florida's Complete Reopening)
import datetime as dt
treatment_date = dt.datetime(2020,9,25)
FloCal['Post'] = np.where(FloCal['date'] >= treatment_date, 1, 0)
FloCal['Treatment'] = FloCal['is_Florida']
FloCal['Post x Treatment'] = FloCal['Post']*FloCal['Treatment']
#Train regression to assess effect of being in Florida after it's complete Reopening
reg = sm.OLS(FloCal['positiveRate'],FloCal[['Post','Treatment','Post x Treatment','testingRate','constant']]).fit()
print (summary_col([reg],stars=True,float_format='%0.2f',
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)}))
============================= positiveRate ----------------------------- Post -1391.85*** (43.96) Treatment 346.07*** (25.75) Post x Treatment 1077.69*** (39.61) testingRate 0.09*** (0.00) constant -317.75*** (19.82) R-squared 0.99 0.99 N 773 R2 0.99 ============================= Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
The Difference in Differences (DD) model is one of the most telling analyses we completed. This model seeks to predict the effect of Florida's 100% reopening in September while controlling for varying testing rates AND adjusting for the various time effects that would have shared by FL and CA (i.e. changes in testing accuracy, greater awareness and sanitation techniques, etc.). In sum, the model predicts - at a 99% level of significance - that the effect of Florida's reopening policy was actually a DECREASE in the positive test rate by 8.10 percentage points relative to California.
This finding defies the general perception around COVID-19 mitigation policies. While the justification for such costly policies has been public safety, the model ultimately points to a much different reality. Therein, reopening may not only be economically beneficial, but potentially beneficial to public health as well.
unemployment_politic['Difference'] = unemployment_politic['Democrat'] - unemployment_politic['Republican']
diff_before = unemployment_politic.loc[(unemployment_politic['Date'] >= pd.Timestamp('1-1-2016')) & (unemployment_politic['Date'] < pd.Timestamp('3-13-2020'))]
diff_after = unemployment_politic.loc[(unemployment_politic['Date'] >= pd.Timestamp('3-13-2020'))]
plt.figure(figsize=(24,8))
plt.plot(unemployment_politic['Date'], unemployment_politic['Democrat'], alpha=0.8, label='Democratic')
plt.plot(unemployment_politic['Date'], unemployment_politic['Republican'], color='red', alpha=0.8, label='Republican')
plt.xlim(pd.to_datetime('2019-01-1'),pd.to_datetime('2021-03-07'))
plt.title('Democratic vs. Republican Unemployment Rate', fontsize=25)
plt.axvline(x=pd.to_datetime('2020-03-13'), color='grey')
plt.text(pd.to_datetime('2020-03-14'), 5, 'National Emergency Declared', fontsize=16, color='grey')
plt.legend(loc=2, fontsize=16)
plt.show()
plt.figure(figsize=(24,8))
plt.plot(unemployment_politic['Date'], unemployment_politic['Democrat'] - unemployment_politic['Republican'], alpha=0.8)
plt.axhline(y = diff_before['Difference'].mean(), linestyle='dotted', color='grey')
plt.title('Democratic vs. Republican Difference in Unemployment Rate', fontsize=25)
plt.xlim(pd.to_datetime('2019-01-1'),pd.to_datetime('2021-03-07'))
plt.axvline(x=pd.to_datetime('2020-03-13'), color='grey')
plt.text(pd.to_datetime('2020-03-14'), 2, 'National Emergency Declared', fontsize=16, color='grey')
plt.show()
print('Before COVID National Emergency:', diff_before['Difference'].mean())
print('After COVID National Emergency:', diff_after['Difference'].mean())
ttest = stats.ttest_ind(diff_before['Difference'], diff_after['Difference'])
is_significant = 'SIGNIFICANT' if ttest.pvalue < 0.05 else 'NOT SIGNIFICANT'
print( "Before vs. After COVID National Emergency:", ttest)
print(is_significant, '\n')
Before COVID National Emergency: -0.0070170660856935215 After COVID National Emergency: 1.9776515151515155 Before vs. After COVID National Emergency: Ttest_indResult(statistic=-31.368989958593207, pvalue=6.409661144533248e-39) SIGNIFICANT
The above graph depicts the monthly average unemployment rates of Democratic and Republican states. The second graph depicts the difference in unemployment rates and the grey dotted line represents the average difference in unemployment rates from January 1, 2016 to March 13, 2020, when the United states declared a national emergency becuase of COVID-19. Accordingly, there was a spike in both Democratic and Republican states becuase of the COVID-19 pandemic. It is important to note that the states are classified as Democratic or Republican based on the 2020 election results. Even if the state had a different political affiliation before the elections in late 2020, they are listed according to their affiliation in the 2020 presidential election. Another important factor is that all states are represented equally regardless of population.
Before the COVID-19 pandemic, Democratic and Republican states had simmilar unemploymente rates. However, after the pandemic, the Democratic unemployment rates rose higher than the Republican unemployment rates. We conducted a t-test on the difference in the averages of Republican and Democratic unemployment rates before and after the COVID-19 pandemic. The average difference from January 1, 2016 to March 13, 2020 was -0.007017. The differences in averages after the national emregency was declared was 1.9777. The t-test confirmed that the differences in averages is statistically significant, meaning that Democratic states had significantly higher average unemployment rate than Republican states.
CA_FL_diff = california_unemployment[['filed_week_ended']]
CA_FL_diff['Difference'] = california_unemployment['insured_unemployment_rate']-florida_unemployment['insured_unemployment_rate']
diff_before = CA_FL_diff.loc[(CA_FL_diff['filed_week_ended'] < pd.Timestamp('3-13-2020'))]
diff_after = CA_FL_diff.loc[(CA_FL_diff['filed_week_ended'] >= pd.Timestamp('3-13-2020'))]
#Plotting
plt.figure(figsize=(24,8))
plt.plot(california_unemployment['filed_week_ended'], california_unemployment['insured_unemployment_rate'], alpha=0.8)
plt.plot(florida_unemployment['filed_week_ended'], florida_unemployment['insured_unemployment_rate'], alpha=0.8, color='red')
plt.xlim(pd.to_datetime('2019-01-1'),pd.to_datetime('2021-04-07'))
plt.axvline(x=pd.to_datetime('2020-05-4'), color='tomato')
plt.text(pd.to_datetime('2020-05-5'), 3, 'FL Stage 1 Reopening', fontsize=16, color='tomato')
plt.title('California vs. Florida Unemployment Rate Difference', fontsize=25)
plt.axvline(x=pd.to_datetime('2020-03-13'), color='grey')
plt.text(pd.to_datetime('2020-03-14'), 5, 'National Emergency Declared', fontsize=16, color='grey')
plt.show()
plt.figure(figsize=(24,8))
plt.plot(california_unemployment['filed_week_ended'], california_unemployment['insured_unemployment_rate']-florida_unemployment['insured_unemployment_rate'], alpha=0.8)
plt.xlim(pd.to_datetime('2019-01-1'),pd.to_datetime('2021-04-07'))
plt.axvline(x=pd.to_datetime('2020-05-4'), color='tomato')
plt.text(pd.to_datetime('2020-05-5'), 3, 'FL Stage 1 Reopening', fontsize=16, color='tomato')
plt.axvline(x=pd.to_datetime('2020-03-13'), color='grey')
plt.text(pd.to_datetime('2020-03-14'), 5, 'National Emergency Declared', fontsize=16, color='grey')
plt.axhline(y = diff_before['Difference'].mean(), linestyle='dotted', color='grey')
plt.title('California vs. Florida Unemployment Rate Difference', fontsize=25)
plt.show()
#TTest
print('Before COVID National Emergency:', diff_before['Difference'].mean())
print('After COVID National Emergency:', diff_after['Difference'].mean())
ttest = stats.ttest_ind(diff_before['Difference'], diff_after['Difference'])
is_significant = 'SIGNIFICANT' if ttest.pvalue < 0.05 else 'NOT SIGNIFICANT'
print( "Before vs. After COVID National Emergency:", ttest)
print(is_significant, '\n')
C:\Users\kathe\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Before COVID National Emergency: 1.5761666666666672 After COVID National Emergency: 5.903157894736845 Before vs. After COVID National Emergency: Ttest_indResult(statistic=-23.069016770125863, pvalue=2.944077310866634e-82) SIGNIFICANT
The above graphs depict the weekly unemployment rates of California and Florida, and the difference in unemployment rates. Before the pandemic, California consistently had higer unemployment rates than Florida. After the pandemic, this difference grew and the t-test confirmied that the averages of the difference in unemployment rate before and after the national emergency was declared on 3/13/2020 is statistically significant. However, it appears that the difference is approaching what it used to be before the pandemic in the later months of 2021.
The goal of the project was to determine how the different COVID-19 policies affect health and economic factors. The COVID policies implemented by Democratic led states and Republican led states were considerably different, and often fell along party-lines. The former can be categorized by more restrictive policies like mask-wearing and social distancing while the latter can be categorized by less restrictive policies like hasty re-opening initiatives and less abundant mask-wearing and social distancing measures, as evidenced by Governor issued directives and orders. Specifically, we analyzed the testing, positives, and death rates of Republican and Democratic states, the 4 regions of the United States, and California vs. Florida.
We first compared the averages of total testing, positives, percent positive, and deaths in Democratic and Republican states as of 3/07/2021 using T-Tests. In turn, we found that Democratic states had significantly higher testing rates, Republican states had significantly higher positive rates, and a higher percentage of tests returned positive. However, the difference in average death rates was not significant. We found similar results when we compared testing, positive, and death rates of Republican and Democratic states on a day by day basis.
We also compared California, Florida, and the national average in specific time periods before and after Florida began phase 1 reopening. This included the entire time frame, the time frame only after Florida began stage 1 reopening, and the 6 months directly after Florida began stage 1 reopening. California had a significantly higher average testing rate than Florida and the national average in the entire time frame, and after phase 1 reopening, but not the 6 months after Florida began phase 1 reopening. Florida had a significantly higher positive rate than Florida and the national average in only the 6 months after phase 1 reopening. Florida’s average death rate was higher than California and the national average in only the six month period after phase 1 reopening.
To better understand what role governmental policy played throughout the COVID-19 pandemic, we ran a number of linear (OLS) regressions - including one as a difference in differences (DD) model - measuring changes in positive test rates between CA and FL while controlling for a handful of other explanatory variables (e.g. tests administered).
Across the board, the coefficient for testing rates was a statistically significant 0.00 - meaning the model predicts there is no correlation between increased testing and swings in the positive test rate. Not statistically significant, however, was the coefficient of being in either Florida or California when considering the data in its entirety. This ultimately defied expectations, as the model suggests that - despite vast policy differences - being under the rule of one state or the other doesn’t necessarily lead to changes in the positive test rate. We went on to reapply this model to a 2021-only dataset (to better assess more recent developments) and found a similarly unexpected result: a statistically significant prediction that Florida’s 2021 approach led to a greater decrease in positive tests than that of California. These findings ultimately set a foundation for the contrary notion that the (economically costly) COVID-19 restrictions imposed by states may not have yielded positive results as intended.
To account for the effect of shared changes over time, and to address the heterogeneity between California and Florida beyond the use of control variables, we chose to incorporate a Diff-in-Diff model into our last linear regression. Essentially, we trained this model to measure the underlying differences between CA and FL in the early stages of the pandemic; this ultimately afforded us a baseline trend that could predict what would have happened in a ‘treated’ state based off of its known relationship to an ‘untreated’ state. We then recognized Florida’s 9/25 complete reopening as a major change and valid ‘treatment’ in our natural experiment. The results from this model were particularly interesting in lieu of our original question, as isolating a single major policy shift led us to find that Florida actually fared better at a statistically significant level in this instance.
We also compared the averages of total testing, positives, percent positive, and deaths for the four regions of the United States using ANOVA and Tukey tests. We found that the Midwestern and Northeastern states had statistically significant differences in testing rates and percent of tests that returned positive. This makes sense since the Midwest is very rural and COVID tests are less accessible. The Northeastern testing rate and death rates were significantly higher than those of the West. The difference in death rates may be because the COVID pandemic originated in New York and hit the northeastern states the worse before eventually reaching other parts of the country. We are unsure why the West’s testing rate is lower than the Northeastern one.
When we compared the difference in unemployment rates of Democratic and Republican states, we found that the difference increased after the COVID pandemic, meaning that Republican states had significantly lower average unemployment rates than Democratic states after the national emergency of the COVID-19 pandemic declared on March 13, 2020. Similarly, Florida also had a significantly lower unemployment rate average than California in this time period.
Overall, there is an association between less restrictive Republican states and lower testing rates, higher positives, and lower unemployment rates. The difference in death rates is generally insignificant. However there are many lurking variables that we can’t account for that have influenced the data.
We are foremostly limited by the data sets that were available to us. It is likely that the true COVID-19 case and death counts are higher than what was recorded, and there may be reporting discrepancies between states as well.
We sought to incorporate a handful of other metrics in our research, but a lack of publicly accessible data (and data in general, given the recency of our topic) limited the overall breadth of resources at our disposal.
Incorporating economic measures by state also proved troublesome - as there are few ways to reliably assess state-level economies in real time. Unemployment proved to be a decent metric, but a day-to-day measure of household income and/or poverty rates would’ve been ideal (annual reports on such metrics were available, but the data was too infrequent to draw conclusions from).
It is difficult to control for all lifestyle differences between FL and CA or between Republican or Democrat affiliated states which could otherwise affect the spread of COVID-19 (i.e. feasibility of being outdoors where spread is less likely, access to private transportation, etc.) Republican states tend to be more rural and may have less accessibility to testing. In that manner, an innate bias exists in the data. There is no way to measure the population’s willingness to follow COVID guidelines, as many people chose to ignore them. We also cannot account for interstate or international travel. In the United States, the COVID pandemic started in the mostly Democratic Northeast and spread to the rest of the country. This is probably why the northeast region has a higher total death rate than other regions of the country and it may have skewed the average death rates of all Democratic states, This represents another inherent bias that we did not account for.
We determined the states’ political affiliation based on the results of the 2020 presidential election. A states’ covid policies are more influenced by the political affiliation of its governor, but the presidential election results generally show if the state’s population leans more Democratic or Republican. However if a state recently changed political majorities, its data may contribute to the opposite political affiliation that it belongs to. Also it is possible for a state’s majority political affiliation to have changed in the time frames we measured. Overall this method of determining political affiliation may not represent swing states or mixed populations well.
The Covid Tracking Project, the source of one of our state and national COVID data sets, ended its collection on March 7, 2021, so we only have COVID data up to that point. The website for the unemployment data from the United States Department of Labor only allowed us to download the information for one state at a time. Because of this, we only analysed the weekly unemployment rates of California and Florida. From the U.S. Bureau of Labor statistics, we were able to get the monthly unemployment rates of all 50 states from March 2011-2021, which is what we used to estimate the average unemployment rates of Democrat and Republican states.
We also noticed an aberration in the data where higher testing rates were associated with higher death rates. When we checked the documentation of the Covid Tracking Project, we found they only reported the deaths of people who had tested positive for COVID or who showed symptoms and were likely infected with COVID. Obviously, a state’s testing rate would influence the number of confirmed deaths reported. Also, states may have different criteria for determining probable COVID deaths which could lead to underrepresentation.
We also wanted to graph the percentage of tests that were positive on the day to day basis but were unable to do so because COVID tests take 1-7 days to process, and positveIncrease / TotalTestsIncrease would not accurately reflect the true percentage of tests that were positive on a given day. In early 2020, it was often that new positives were reported on days when no new tests were reported which would cause a division by zero error that would affect later calculations.