Open Access Online Scientific Journal

Research Article

J Med Discov (2020); 5(3):jmd20049; DOI:10.24262/jmd.5.3.20049;
Received July 15th, 2020, Revised August 23rd, 2020, Accepted September 07th, 2020 , Published September 15th, 2020.

The Analysis and Forecasting COVID-19 Cases in the United States Using Bayesian   

Structural Time Series Models

Liming Xie1,*

 

 1North Dakota State University, Department of Statistics, Fargo, ND 58108, USA. Email: limingxie2018@gmail.com.

 

*Correspondence: Liming Xie,North Dakota State University, Department of Statistics, Fargo, ND 58108, USA. Email: limingxie2018@gmail.com

Abstract

 In this paper the Bayesian structural time series model (BSTS) is used to analyze and predict total confirmed cases who infected COVID-19 in the United States from February 28, 2020 through April 6, 2020 using the collect data from CDC (Center of Disease Control) in the United States. It includes variables of days, total confirmed cases, confirmed cases daily, death cases daily, and fatality rates. The author exploits the flexibility of Local Linear Trend, Seasonality, Contemporaneous covariates of dynamic coefficients in the Bayesian structural time series models. In addition, Causal Impact function in R programming is applied to analyze the model and read report of model. The results of the model show that the total confirmed cases who infected COVID-19 will be still most likely to increase straightly, the total numbers infected COVID-19 would be broken through 600,000 in the United States in near future (in the subsequent months). And then arrive at the peak around mid-May 2020. The cumulative prediction values are 1e+05 of cumulative trend. Also, the model suggests that the probability of variable Recovered cases daily is 0.07.  

Keywords: BSTS, COVID-19, Causal Impact, Trend, Seasonality, Forecast

Introduction

At present, COVID-19 is a brutal enemy who has killed tens of thousands of people. It posed a serious threat to human being. Moreover, people do not have effective ways to annihilate it. In the pandemic period, most of countries can currently take these actions just using passive protect measures like social distance, taking masks, and home isolation to reduce the infected risk. Many researchers analyzed china’s epidemic period that China banned travel to and from Wuhan city for more than 50 days the cases who infected COVID-19 have been dropped down [1]. Some researcher even urged that the home isolation should keep until 2022 [2]. However, all economic activities have been shown down now. Thus, people expect that bending the curve of which keeps rising confirmed cases infected COVID-19 and death cases can come up early. Thus, the author tries to apply BSTS model to analysis U.S. COVID-19 case data from CDC in the United States by using Local linear trend, Seasonality, Contemporaneous covariates of dynamic coefficients. On the other hand, the author takes techniques such as prior distribution and prior elicitation, and posterior inference to give some interpretations of the model. The research gains more effective results.

In addition, many institutions and scholars conducted a lot research for COVID-19. Some researchers used an objective approach to predict the continuation of the COVID-19. They produced ten-day ahead point forecasts and prediction intervals [3]. Some people used 1334 cases from China that were line-list data to receive age-stratified estimation. The results displayed that the case fatality ratio was likely to be strongly influenced by the availability of health-care facilities [4]. Also, some scholars chose exponential smoothing family to test forecasting accuracy. The family suggested that there were good effects [5]. In the initial phase, the author in this paper applied general time series forecasting methods followed by his previous techniques such as SARIMA model (1,1,2) (Seasonal autoregressive integrated moving average) and GARCH (generalized autoregressive conditional heteroskedasticity) to tested and analyzed the trend and seasonality of COVID 19 data [6,7]. The results showed good accuracy over a couple of forecasting competition. Also, the author tested to least absolute shrinkage and selection operator to estimate parameters of COVID-19 data by introducing into previous his method [8]. And then, used clustering method to allow “the eigenvalues of the correlation or covariance to test the parameters” [9]. But it was not positive effects.

On the other hand, some pursued new angle of research on COVID-19. For example, German and British scholars’ research found that it is probable to have three types of viruses: A, B, and C for COVID-19. U.S. cases are likely to come from them [10]. But epidemiologists often apply time series models such as ARIMA model (Autoregressive Integrated Moving Average), spectral analysis and the periodogram. For instance, an epidemiologist used the Lomb periodogram and the classic periodogram to Philadelphia Whooping cough timer series to achieve project research goal [11].

In despite of these research on COVID-19, no institutions and scholars use BSTS to analyze and predict COVID-19. Therefore, the author would like to test to use BSTS to analyze and forecast the total Confirmed cases in the United States that collects data from CDC from February 29, 2020 through April 6, 2020. The author thinks that the Causal Impact function in R programming is the feature selection in BSTS model that may perform causal inference by counterfactual predictions. It has good characteristics such as trend, seasonal, regression, holiday. It is one of good default model [12]. It is supposed that the time series of the treated unit is explained with a set of covariates while do not allow themselves to be impacted by the intervention. This is currently popular analysis method of time series model. Typically, it contains pre-period, post-period arguments and / or alpha, etc. for this paper, a time series model is automatically built and estimated. The argument models provide over the model computations.

For most models of BSTS we know that “niter” is number of Markov Chain Monte Carlo (MCMC) sample to draw. The model can generate time series model for short- and long-term forecasting [13]. If there is higher number, it is more accurate inferences. “standardized. data” allows all columns of the data with moments estimated for the pre-intervention period prior to fitting the model to be standardized. It means that empirical Bayes accessing setting the priors so that the results will be linear transformations of the data. “priot.level.sd.” denotes in terms of data standard deviations that does have the Gaussian random walk of the local level. It can choose and let dataset with low residual volatility when regressing out known predictors such as total confirmed or fatality in this data. “nseasons” is period of the seasonal components. In general, it contains a seasonal component, set this to entire number larger than 1. For example, when there are daily observations, e.g. 1 for a day from a component of data, it interfaces currently only supports up to one seasonal component. In addition, it can let observations specify a number of seasonal components. Therefore, BSTS. model defaults to 1 that means no used seasonal component. “season. duration” is a kind of duration of each season. For example, if adding a day-of-week component to data with daily granularity, and offers the arguments model.args=list(nseasons=7. Season, duration=1), etc. “dynamic. Regression” includes the coefficients of time varying regression. It may combine local trend or local level that effects one of over specification.

Bayesian structural time series models

Typically, models of structural time series data are state space models. Most of them can be regarded as a set of equations. Structural models are much easier to generalize, such as covariates and we are not hard to process missing values with structural models [14]. For example, the local level model is

The above equations assume under independent of all other unknowns. For 1st equation it is the observation function that connects with the observed data to a latent d-dimensional state . Second one is the state function that involves in the process of vector   through time. Thus, second one called “state” equation sometimes. Also, is as to a scalar observation and is usually d-dimension output vector by transition matrix . is often a matrix by . is regarded as a scalar observation error that could be noise variance  and is a q-dimensional system error with a state diffusion matrix  . Here, note that exists.

Since structural time series model has some advantages of flexible and modular and is statistical method for feature selection, it is more and more valued by time series researchers. For their flexible, it is probably because the model includes all ARIMA models (Autoregressive Integrated Moving Average). Modular does combine with and , etc. from a library of component sub-models to capture significant structures in the data.  Some of component models have been applied to capturing, the trend, seasonality, or else. For example,

(1) Local linear trend. The 1st components model. It is a form of the following equations:

This is a selection of modeling trends that can fit for local variation immediately. Especially for short term prediction requirement, it seems to appear in the flexible levels, but it is not appropriate to long term predictions. Here,  is trend, is the seasonal component. Moreover, the model is faster to enters the state space framework [15].

(2)Seasonality.

Like local linear trend, seasonality is also state component model. The functions are as follows:  

Here, S expresses the number of seasons andthat indicates joint contribution to the observed response . Please note that S-2 is most recent seasonal effects. Also, there is a scalar for the error term. Moreover, the equation is one of state model that does have less than full rank. For there is the mean that has zero of the total seasonal effect with summed over S seasons. For instance, when S=3 so that we can capture 3 seasons per time unit, the mean of the spring coefficient might be -1 (summer +fall+ winter), then, for the seasonal model the transition matrix will be (S-1) x (S-1) matrix with -1’s with the top row, which the sub-diagonal and 0’s else.

(3)Contemporaneous covariates of dynamic/static coefficients

The latter is to control time series when it gets no treatment. It is used to solve accurate counterfactual predictions and effects other unobserved causes. In this paper, it is mainly used of former, contemporaneous covariates with dynamic coefficients. It is a regression component to explain time varying ties for example, suppose covariates a=1,2, …, n, was imported in the dynamic regression component, then we have the following expression:

Here, the above equation belongs to the coefficient for the m th control series and is the standard deviation of its connected with random walk. If setting,on the other hand, we set the corresponding the transition matrix into , and, then we can complete the dynamic regression components.

(4)Prior distributions and prior elicitation

Suppose  expresses all of model parameters, is a set of the full state sequences and They are allowed to a prior distribution x () with the model parameters and exist in x() is regarded as the initial state values. So, we use MCMC method to gain the sample results through p (). If there are a matrix  with zero of elements and supposeand otherwise. Also, if expresses the nonzero scalars for the vector. If sequencesSupposethe rows. The columns are be nonzero entries for q. Then, we can use the spike and slab in the following function:

In addition, we can also write a formula for slab of the prior if using Gamma distribution:

(5)Posterior inference

For the full likelihood function, if the law of total probability is, then we have the function:

Here, expresses the set of values to multiple time series that is components of time series with trend, seasonality and others). Also, is a correlated error. It is involved in a normal prior to have the observations standardize, and then generate a posterior distribution. Thus, we could use MCMC to compute the posterior probability distribution [16]

(6)Forecasting

Since the forecasts are associated with the mode with the posterior predictive distribution, Bayesian analysis and forecasts have a formula that is like posterior inference. If we define as the set of values, then we can get the formula of posterior predictive distribution is as follows:

Results

The data of this paper is used from CDC (Center of Disease Control) in the United States. The author collects variables is that (1) The days are time that gets to start to report the first death case, from February 29, 2020 through April 6, 2020. (2) Total Confirmed cases. This is counting same as the days by which patients have been tested the positive Corona virus Nucleic Acid testing no matter whether or not they have symptoms with cough, fever, and other upper respiratory tract infection. (3) Recovered cases daily who have been treated or in hospital or out outpatients, or self-heal without any treatment. (4) Death cases who were computed without vital signs. (5) Fatality rate. It is the numbers that computed by the formula=100 % (Death cases daily/ Confirmed cases daily). To find the relationship with the confirmed case daily and death case daily in the United States, the author made plot of time series for the both. The results showed that there is significant linear regression between the both. Please see the following figure 1.

Figure 1. The plot of time series with conformed cases daily and death cases daily.

For predict struct, plots are three panels: The 1st one shows that the data and a counterfactual prediction for the post-treatment period. The 2nd one displays the difference between observed data and counterfactual prediction trends, which is the pointwise case effect via the model estimations. The last one is plus of the pointwise contributions by the 2nd panel that generates a plot of the cumulative effect of the intervention. They are based mainly on the assumption that most of covariates impacted by the intervention. Also, the model is assumed by which the relationship between covariates and treated time series as established while the pre-period, remains stable to the post-period.

Figure 2. The prediction plot for original data, pointwise, and cumulative.

In the Figure 2 the model contains that the relationship between Confirmed cases daily in the and Deaths cases daily in the United States from February 29, 2020 throughout April 6, 2020. Clearly, it shows that the cases of confirmed data will be dropped as the end of April 2020, and then slightly rebounded in mid-May 2020 but decreased after that. The numbers of cases daily is fluctuated around 10,000 cases.

Causal Impact function is one of feature selections in BSTS model that may perform causal inference by counterfactual predictions. It is supposed that the time series of the treated unit is explained with a set of covariates while do not allow themselves to be impacted by the intervention. It is currently popular analysis method of time series model. Typically, it contains pre-period, post-period, mode.arg and/ or alpha, etc. for this paper, a time series model is automatically built and estimated. The argument mode. arguments provide over the model.

In Table 1 R output suggests that the predictive values are the range of (73594, 439814) for total confirmed cases that infected COVID-19 virus in the United States. For cumulative prediction the value is  1e+05 of cumulative trend.

Table 1. The R outputs of posterior inference using causal impact function
Posterior Inference (Causal Impact)                                                   
Average Cumulative
Actual 6480 395258
Prediction(s.d.) 7210 (1206) 439814 (73594)
95% CI [4883, 9624] [297834, 587088]
Absolute effect (s.d.) -730 (1206) -44556 (73594)
95% CI [-3145, 1597] [-191830, 97424]
Relative effect (s.d.) -10% (17%) -10% (17%)
95% CI [-44%, 22%] [-44%, 22%]
Posterior tail-area probability p:   0.25301
Posterior prob. of a causal effect:  75%

> summary (impact, “report”)

Analysis report {Causal Impact}

During the post-intervention period, the response variable had an average value of approx. 6.48K. In the absence of an intervention, we would have expected an average response of 7.21K. The 95% interval of this counterfactual prediction is [4.88K, 9.62K]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -0.73K with a 95% interval of [-3.14K, 1.60K]. For a discussion of the significance of this effect, see below. Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 395.26K. Had the intervention not taken place, we would have expected a sum of 439.81K. The 95% interval of this prediction is [297.83K, 587.09K].

The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -10%. The 95% interval of this percentage is [-44%, +22%].

This means that, although it may look as though the intervention has exerted a negative effect on the response variable when considering the intervention period as a whole, this effect is not statistically significant, and so cannot be meaningfully interpreted. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period. The probability of obtaining this effect by chance is p = 0.274. This means the effect may be spurious.

> impact$summary

Table 2. The summary on predictive average values and the ranges.
Actual Pred Pred.lower Pred. upper
Average 6479.639 7210.071 4808.34 9627.339
Cumulative 395258.000 439814.318 293308.71 587267.675
Pred.sd AbsEffect AbsEffect.lower
Average 1216.347 -730.4314 -3144.747
Cumulative 74197.197 -44556.3184 -191829.583
  AbsEffect.upper AbsEffect.sd RelEffect
Average 1671.3 1216.347 -0.1013071
Cumulative 101949.3 74197.197 -0.1013071
  RelEffect.lower RelEffect.upper RelEffect.sd
Average -0.4365699 0.2318008 0.1673295
Cumulative -0.4365699 0.2318008 0.1673295
  Alpha P
Average 0.05 0.2740964
Cumulative 0.05 0.2740964

In Table 2 we can see that pre.lower and Pred. upper are 293308.71 and 587267.67. For time series data of response variables and other covariates, data is often given a zoo object, a vector, a matrix or others. The response variable could be the 1st column and other covariates distributes to other columns; a zoo object will be regarded as its time that indicates the x-airs in plot technique. Option “response” is used to describe the observed response as supplied to Causal Impact function. Option “cum.reponse” provides cumulative response with the modeling time. Option “point.pred” is posterior mean of counterfactual predictions. “Point.pred.lower” and “point.pred.upper” are a set of  lower limit or upper limit (1-alpha) posterior intervals. “Pomt.effect” is point-wise posterior causal effect. It may include lower and upper of the poster intervals “Cum.effect.lower” or “Cum.effect.Upper” are similar as the above. The results show that the range of total cumulative confirmed cases infected COVID-19 virus in the United States should be around 4808 to 9627 daily and most likely to be (293309, 587268). That is, total numbers infected COVID-19 virus would be nearly 600,000 in the United States in subsequent months.

Figure 3 shows the data and estimated smoothed values. the actual data are shown as solid lines, the points denote smoothed values with 2 standard error bounds are shown as a blue color; tick marks indicate days of observation.

Figure 3 shows the data and estimated smoothed values. the actual data are shown as solid lines, the points denote smoothed values with 2 standard error bounds are shown as a blue color; tick marks indicate days of observation.

Figure 4.   Structural models: the trend and seasonal parameters of the total confirmed variable.

Figure 4 shows that estimated trend component and seasonal component of the total confirmed variable with the variable days. Gray areas represent some root MSE bounds.

Figure 5. Visualization of BSTS predicting original series.

The Figure 5 is a visualization of predictive values for original series. It suggests that, after 38th day, which the last day collect dataset, the total confirmed cases infected COVID-19 would be fluctuated with the range of  1e+05. That is, it conforms that total confirmed cases could be increased in subsequent months. Therefore, it meets with the forecast of Figure 2 (The prediction plot for original data, pointwise, and cumulative), which is  1e+05 of cumulative trend.

Figure 6. Plots for scale value and cumulative absolute errors of level, trend and season with days that started in February 29.

In Figure 6, we can see clearly that, using “AddlocalLinearTrend “and seasonal functions could allow us analyze level, trend and season change with variable “days” after making logarithm of data. The results show that predictive values regressed with original data. It explains enough that the advantages of flexibility of local linear trend, level and season of BSTS state component models for the short-term predictions.

Figure 7. The plot of average coefficients from six parameters.

Figure 7 generates a histogram that displays average coefficients from Fatality, Day, Deaths, Recovered, and Confirmed daily. It shows that average coefficient of Fatality seems to be more significant than other variables in the model. Hence, the average coefficient of Fatality rate should be worthy in our analysis on these parameters. The author would like to point that average coefficient from Conformed Daily looks like to not have any status in the model.

Figure 8. Histogram of probabilities for data variables

Figure 8 provides us inclusion probabilies comparing each variable in the analysis system. The results shows that the probability from variable Recovered is nearly 0.07. It looks like to play more important role than other variables in the data. Second one is more than 0.04 of  variable Death; around 0.038 of Confirmed Daily, lastly, other two variables, Fatality and Day, are equal to or lower than 0.02. Thus, it suggests that in the empirical estimation and analysis on the model we should eye on them.

Discussion

For most models of BSTS we know that “niter” is number of MCMC sample to draw. If there is higher number, it is more accurate inferences. “standardized data” allows all columns of the data with moments estimated for the pre-intervention period prior to fitting the model to be standardized. It means that empirical Bayes accessing setting the priors so that the results will be linear transformations of the data. “priot.level.sd.” denotes in terms of data standard deviations that does have the Gaussian random walk of the local level. It can choose and let dataset with low residual volatility when regressing out known predictors such as total confirmed or fatality in this data. “nseasons” is period of the seasonal components. In general, it contains a seasonal component, set this to entire number larger than 1. Thus, BSTS model defaults to 1 that means no used seasonal component. “season. duration” is a kind of duration of each season. “dynamic. Regression” includes the coefficients of time varying regression. It may combine local trend or local level that effects one of overspecification. Also, for posterior distribution, it can summarize the proportion if observing the data [17].

In this paper Causal Impact function of BSTS model is applied to analyze and forecast the model of  the total confirmed cases in the United States from February29 to April 6.The author thinks that the total conformed cases that infected COVID-19 virus in the United States will be most likely to increase straightly, total numbers infected COVID-19 virus would be nearly 600,000 in the United States in near future (in the subsequent months). And then will appear in the peak around the md-May 2020. The empirical results suggest that the flexibilities of local linear trend, seasonality, and contemporaneous covariates of dynamic/static coefficients have good effects on short term predictions.

Conflict of interest

None

Acknowledgments

The author thanks editors and those anonymous reviewers who took a lot time to review this manuscript and gave active comments.

References

  1. Tian H.Y., Liu Y.H., Li Y.D., Wu C.H. et al. An investigation of transmission control measures during the first 50 days of the COVID-19 epidemic in China. Science. 2020; (368) 6491: 638-642.
  2. Stephen M. K., Christine T., Edward G., Yonatan H. G., et al. Projecting the transmission dynamics of SARS-C0V-2 through the post-pandemic period. Science. 2020; (368)6493: 860-868.
  3. Fotios P., and Spyros M. Forecasting the novel coronavirus COVID-19. PLOS One. 2020; 15(3): e0231236.
  4. Verity R., Okell L.C., Dorigatti IWinskill P. et al. Estimates of the severity of coronavirus diseases 2019: a model-based analysis. Lancet Infectious Disease. 2020; 20(6): 669-677.
  5. Taylor JW. Exponential smoothing with a dampedmultiplicative trend. International Journal of Forecasting. 2003; 19(4): 715-725.
  6. Liming Xie. Time series analysis and prediction on cancer incidence rates. Journal of Medical Discovery: 2(3); jmd17030. DOI:10.24262/jmd.2.3.17030.
  7. Liming Xie. Analyzing and forecasting HIV data using hybrid time series models. Asian Journal of Probability and Statistics. 2018; 2(3): 1-12.
  8. Liming Xie. Estimation of parameters on Texas reservoirs using least absolute shrinkage and selection operate. Springer. 2019; 3 (2):93-104.
  9. Liming Xie. Clustering analysis of the survey for mobility reasons in the US 1999-2017. Asian Journal of Probability and Statistics. 2019; 3(1): 1-12.
  10. Forster P., Forster L., Renfrew C., and Michael Forster (2020). Phylogenetic network analysis of SARS-CoV-2 genomes. Proceedings of the National Academy of Sciences of the United States of America. 2020; 117(17): 9241-9243.
  11. Ottar N. Bjornstad (2018). Epidemics models and data using R: Springer Nature. Switzerland AG, Gewerbestrasse 11, 6330 Cham, Switzerland. (Accessed June 2018, athttps://download.e-bookshelf.de/download/0011/8568/98/L-G-0011856898-0035204877.pdf )
  12. Steven L. Scott and Varian H. R. Predicting the present with Bayesian structural time series. SSRN. 2013: 5(1/2):4-23.
  13. Sunghae Jun (2019). Bayesian structural time series and regression modeling for sustainable technology management. Sustainability. 2019: 11(18): 4945.
  14. Rob J. Hyndman. Forecasting: Principles & Practice. University of Western Australia. 2014 September 23-25. (AccessedSeptember 23, 2014, at  https://robjhyndman.com/uwafiles/fpp-notes.pdf )
  15. Shumway R. H., and Stoffer D. S. Time series analysis and its applications with R examples (4th Edition). Springer: Switzerland AG, Gewerbestrasse 11, 6330 Cham, Switzerland. 2017.(Accessed September 1, 2016, at https://www.stat.pitt.edu/stoffer/tsa4/tsa4.pdf )
  16. Scott S., and Varian H. R. Predicting the Present with Bayesian Structural Time Series. International Journal of Mathematical Modeling and Numerical Optimization. 2013; 5(1/2): 4-23.
  17. Avril Coghlan. A little book of R for Bayesian statistics (Release 0.1). Nov.07 2017.

Copyright

© This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/