Scrutinising the COVID-19 data on 590.000 cases. A retrospective, population-based descriptive study for data quality surveillance and a review at 4.540.000 cases

Oriol Gallemí Rovira

doi:10.1101/2020.05.26.20113316

Summary

Background Reports on the detected positive patients with COVID-19 are as per today the best estimation of a country spread of the pandemic. In order to evaluate the early indicators for true lethality and recovery time, the data where the model is built must be quality checked. Each country sets different procedures and criteria for fatality count due to COVID-19 and the health system is stressed by having insufficient testing, untracked patients and premature discharge. In this paper the dynamics behind such data quality issues are discussed throughout the disease course to support better modeling and decision-making processes in a stressed healthcare system.

Methods Based on data compiled and relayed by the Johns Hopkins University, tracking COVID-19 over 590.000 patients (march 27^th, 2020), the data is clustered and compared with discrete regression. Regression parameters are restricted by a time interval of 1 day and must be meaningful for the diagnostic (i.e. a fatality cannot occur before the patient displays symptoms). Cumulative infection curves are taken and built. Infection baseline is based on the country official declaration. Infection synthetic curves are built from the Fatality count and the Recovered patient count. The adjusted parameters are τ=time to fatality (days), δ=time to discharge of recovered patients (days) and φ=case fatality rate (CFR in per unit, P.U.). Therefore, the discharge rate (recovery rate) is forced to be (1-φ).

Using forward or backward formulas have no other influence than the time reference. In both circumstances, time from Onset and Symptoms are neglected and shall be added if such dates are to be plot. There is a gap of two weeks since exposure to Hospital Admission to detection and the earlier the diagnose is done, the better the outcome.

Cumulative figures are used to smoothen the deviation and to provide the best estimator possible at the present time. The delay factor allows to compare figures belonging to the same date of detection.

Fast, daily models which can be used and integrated to a filtering stage on the parameter estimator in a complex approach are left out of scope. Continuous models can also be used and interpolation among the data points is another source of noise to be considered, especially when counting methods are suddenly changing as it is the case with COVID-19.

Countries were grouped as found representative for methodology illustration purposes. Results are discussed and compared across the different groups and potential indicators of this behavior are drawn for further study.

Findings From 593.291 cases in the sample, and its 7 representative groups, the recovery time and the local CFR are negatively correlated, having the highest fatality rates (21%, Spain) the countries with shorter recovery time (11 days, Spain). Also, CFR can be an indicator of Infection inconsistencies (i.e. South Korea, CFR 1%, Time to recovery 25 days).

At the review part, focus is made on the inconsistencies detected in Germany and South Korea datasets as well as the potential misfits on China and Spain.

Overall, the Time to Fatality ranges between 4 and 8 days, and the mean is of 6 days (South Korea, 7 days; Japan, 6days). Only Germany and France are detecting earlier than other countries and admit 10 days before fatality occurs.

To date, shortening hospital discharge times seem to lead to patient reinfections (COVID-19 positive), and studies are working on this line.

Interpretation One simple explanation for the local CFR and Recovery time correlation is to define such rate as a measure of the healthcare system overload. Anomalous CFR indexes point to a stressed healthcare system. The higher the overload, the more focus on critical cases and hence the higher local CFR.

The COVID-19 intrinsic CFR is unlikely to change by a factor of 10x from countries with similar lifestyle, GDP per capita and health services (i.e. the Mediterranean Basin, Northern Europe, etc.). Because of this fact, early CFR measured before Healthcare system overwhelming (COVID-19 free flow) are considered to be more accurate than the measured CFR while the outbreak is still ongoing,

Finally, the synthetic Infection indexes may be a helpful indirect measure of the real population infection rate and also used for data quality audit. Any model built upon inconsistent data will be complex to explain and justify.

Funding No specific funding is raised.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

No specific funding is raised.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Not applicable

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

For the complete set of data (Including ancillary datasets) and elaborated files contact the corresponding author at oriol.gallemi{at}iqs.edu. Latest data can be retrieved from the JHU at GitHub repository

https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

https://cnecovid.isciii.es/covid19/#documentaci%C3%B3n-y-datos

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.