Science Strikes Back at Covid-19

M.R. Temple-Raston, PhD, CIO and Chief Data Scientist at Decision Machine, Precision Alpha and Brandthro

Summary

This article proposes to use the global publicly available COVID-19 data on the web, highly developed (data) science, and readily available computational power to better understand the nature, size, and lifecycle of the COVID-19 pandemic. What can we (quickly) learn and how could it be applied to our recovery?

Size and Shape of a Pandemic

At the time of writing, New York City has seen some slowing in its confirmed COVID-19 cases, and this has led to some optimistic headlines. But at the same time, it is reported that several thousand COVID-19 deaths are to be added to the death totals.

These new, presumed COVID-19 cases did not come from a hospital visit and formal hospital testing. Apparently, and not surprisingly, many COVID-19 deaths occur at home, with no testing, with no formal isolation protocols to prevent further infections, and no tracking. The current NYC process to gather data about confirmed cases has an enormous gap and needs to be improved if testing is to help bring the pandemic under control. To New York’s credit, they have responded by adding the missing data.

When the contagion is no longer in the early stages a much better measure than confirmed cases, is the death toll. The death toll is the bottom-line measure of our success. By integrating the death toll data with data science, scientific fact can be produced.

Machine Learning at its very foundation integrates measurement data and states with the mathematical machinery of science to express probabilities, classifiers (ratios of probabilities), expected values, and so on. In some problems, the probabilities can be calculated exactly, in more complex problems or often simply for ease or convenience, the probabilities are estimated, numerically, by curve fitting.

In our problem, we can take the death toll and the mortality to calculate exactly (in closed-form) the probabilities of contracting the disease, the daily disease source strength, and the daily infection (how many people are subsequently infected by one person). Mortality is defined as the ratio of people who died from the disease, divided by everyone who contracted the disease. It is difficult to know precisely the local mortality while the contagion is in progress.

The “most probable value” for the inferred COVID-19 cases can be calculated as a function of the mortality. Assuming a modest and conservative mortality of 3%, we compare New York City’s confirmed cases to the inferred cases from the observed New York City death toll (see plot below) between March 20^th and April 14^th.

Note that in the early days of the contagion the confirmed cases were larger than the inferred cases, since deaths lag the symptoms and initial testing of the disease. Note that there was a big jump in deaths between Day 17 and 18, due to a change in the way deaths are counted. No accounting in the confirmed cases is even attempted.

The plot indicates that our current testing protocols are missing many COVID-19 cases, and that the rate of increase of the inferred cases is rising faster than the confirmed cases. The slope of the inferred cases does not depend on the mortality.

If the mortality were less than 3%, then the inferred cases in the plot would be bigger still. Conversely, if the confirmed and inferred cases agreed, the mortality would need to be more than 7.1%.

If late to the game or challenged in execution, then whatever is discovered in testing may miss the scale of the problem. Without comprehensive and proactive COVID-19 testing, confirmed cases alone can at most tell us that a wave is going to hit, but not the size.

Moreover, if the confirmed cases are not representative, then the behavior of a small sample (tip of the iceberg) tells us next to nothing about the entire population, and indeed the behavior of the sample can be very misleading. For example, if we exhaustively test an overly narrow population.

What can data science do to help? We will not create a generic machine learning model with arbitrary parameters to fit the data. Instead, we use the same data science to design and implement a detector specifically for the problem at hand, with no arbitrary parameters.

Effectively, we create a Geiger Counter for COVID-19.[1] You may recall that a Geiger Counter measures the source strength of radioactive material (indirectly) through ionization; we do the same for COVID-19. We measure the source strength and other quantities of the COVID-19 disease (indirectly) through the daily death toll. The efficiency of the detector corresponds to the mortality of the disease.

The clicks-per-second in a Geiger Counter is replaced by the deaths-per-day. Machine learning infers what the source strength of the disease and the infections per day must be in order to agree with the observed death toll.

To establish a basis for comparison between countries and regions, we use a standard mortality of 3%. Exact expressions are known for the expected and most-probable infections. So, when needed, we can scale the results to any value of mortality.

In the plot below, we calculate the source strength and total infections per daily for Hubei province in China based on the daily death toll. The death toll data is provided by Johns Hopkins.

Let’s take a moment to understand the general mechanics of a contagion. Compare the COVID-19 infections per day profile calculated above, with a traditional fire development curve that shows the time history of a fuel limited fire (plot below, provided by NIST). We have superimposed the fire development phases on the Hubei plot above to make the comparison clearer. We see a very similar profile. The lifecycle phases (defined by changes in second-derivatives) for infectious diseases are also independent of the mortality.

Question: What could go wrong if we stop social distancing at the wrong time?

Think of the coronavirus as the fuel and the oxygen as us (humans). As humans are depleted, the contagion decays or slows (through social distancing and self-isolation).

If humans mix prematurely with an already hot contagious environment, the energy level increases. This human “ventilation” can result in a rapid increase in the contagion, potentially leading to a flashover condition.

Similarly, a backdraft could be created by a sudden ventilation of humans into a fully developed or decaying contagion (an improving environment) too early. Recall that backdrafts are explosive.

In this section, the effort taken to study the contagion through data science has given us crucial insight and caution.

Monitoring and Recovery

By monitoring the shape of COVID-19 with a COVID Geiger Counter using the death toll, we charted the progress of the contagion (www.decision-machine.com/covid19) to the present. Comparing the progress to Hubei (Figure 2 above) the stage in the contagion lifecycle can be determined.

For example, for New York City and many others, it has been straightforward to identify the ignition, development, fully developed, and early decay phases of COVID-19 in the infections per day plot in the chart below. In the first plot below, we see New York City reaching a top and testing a drop into Recovery.

There are two additional measurements in the chart above that are both descriptive, and independent of the mortality value: the death toll energy and the infection ratio.

The death toll energy is analogous to the energy of a ball thrown straight up in the air…with a rocket engine attached to it, with limited fuel. It is not hard to tell through measurement when the rocket engine has stopped burning fuel, when the ball coasts to its maximum height, and when it begins to fall. The death toll energy is simply not a function of the mortality.

Similarly, the ratio between the disease incidents and how many people were infected by an incident of the disease is also independent of the mortality.

By comparing charts for countries further along in their lifecycle, critical values for the infection ratio were identified.

The COVID-19 Geiger Counter and the additional measurements provide reliable, unbiased and scientific judgement of the contagion. Our scientific results, published daily, repeatedly contradicted most of what was reported in the media by wishful thinking.

This is the best we can do when the contagion is developing and the opportunity to control the spread through testing has been missed. More lives where lost than should have been. But now we need to look forward, trying not to make it worse, and to figure out how what we have learned that can be used and optimized for a full-scale recovery.

How Can Company Testing Produce Safe Workplaces?

In this section we design a new COVID-19 detector for a contagion that is not in ignition, and where testing is still an effective means to prevent ignition. We center our design on companies in a county with a large commuting base of employees from the county and from surrounding counties.

We will use New York City as an example, but the detector design could be applied generally to any concentration of companies in a county.

Data science and disease metrics are computed to support and inform company testing, to create companies that are fact-based, data driven networked COVID-19 detectors. A successful design would optimize the number of tests and how they are applied in each company to create a safe employee environment with minimal risk.

Assume that when we are ready for testing:

Effective COVID-19 tests we be produced (in the coming weeks and months) in enough quantity and with an economic and stable price point so that businesses can purchase them,
Tests will invariably improve, and there will be a need to integrate new tests, swap out old tests, introduce new processes, and better information to produce safer work environments at an acceptable cost,
Only employees from recovering counties will be eligible to participate in Get Back to Work (GBW) programs, and a recovering county would be defined unambiguously and without bias from the lifecycle (mortality independent) phases defined by data science in the previous sections,
A full protocol from Home to Office, and back to Home can be developed by each company, with minimum safety standards established (example: Take your temperature at Home. If you have a temperature above x, don’t come in, etc.).

The essential idea is illustrated for New York City in the figure below. For communities and surrounding counties, the testing in New York City would supplement and complement local testing. At the time of writing Suffolk, Nassau, Westchester and Rockland counties were dropping into Recovery. The dotted lines to New Jersey and Connecticut denote future integrations into the network as the relevant commuting counties drop into Recovery.

In addition to the lifecycle state, infection energy and infection ratio calculated for each county, data science can also calculate exactly the “coupling” between counties (the edges). From the coupling, data science can calculate the expected impact of increases in infection between any county on New York City. By expanding the network, we can also calculate the impact of a spike in infections from any one county, on the others.

As importantly, on those hot-spots still simmering with new COVID-19 cases, New York City can now focus its testing on those communities that are: being hit disproportionately and producing new uncontrolled cases; areas more difficult to reach than through employers; and, communities largely invisible to our current testing protocols.

Conclusion

Data Science can provide precise, unbiased and scientific contextual information to manage the containment of infectious diseases. COVID-19 provides us with an opportunity to focus on what can be accomplished.

In the first two sections of this article we discuss our current environment and what we can learn and conclude from analysis of the publicly available data. The third section is forward looking. It redesigns the data science in the previous sections to create a framework of managed, networked, company-based COVID-19 detectors that can be used to create safer workplaces. A fully integrated testing framework based on science is our best hope of in safely creating the new normal.

[1] The relevant science and mathematics have a long history. Hans Geiger worked for Ernest Rutherford as a physicist in Manchester, UK. This period was a high point in science (1908) with theory and experiment intertwined.