Post in evidenza

Covid-19 la nostra app è sempre attuale

  Con l'assidua collaborazione  Marco Mingione  e  Pierfrancesco Alaimo Di Loro  abbiamo creato uno strumento web interattivo che consen...

venerdì 20 marzo 2020

Why go for a Poisson regression and not for a log-scale linear regression


This post’s purpose is to clarify some aspects of the analysis of data describing an epidemic. The available data is composed of positive integers, i.e. counts. They keep track of how many people get sick, how many people recover, how many people die, etc... In the case of an epidemic, counts are observed on a time scale.
Counts have an important and almost ever-present characteristic: their trend and their dispersion (i.e. their mean and variance) vary jointly, i.e. they are dependent. With continuous data (for example measurements) we may have very large values very close to each other, or very small values very far from each other. With count data this is generally not the case: series with generally large counts are more variable than series with generally small counts. This is not an abstract fact; it is instead, an observable and measurable behavior, which can influence the way in which counts can be mathematically described.
Anyone with some elementary math knowledge, the first thing that comes to mind when looking at the way an epidemic spreads from the start is “it grows exponentially!”. It would, therefore, seem natural to take the logarithm of the counts and then draw a straight line through the data.
The intuition behind this is correct but it does not consider the issues mentioned above. When dealing with a time-evolving count process – such as an epidemic – as time goes on the mean and variance of the data will change depending on each other. “Drawing a straight line” instead implies that the trend and variability of the data behave independently on one another.
The correct approach when dealing with count data is to use generalized linear models (GLM), introduced in the 1970s. GLMs take the type and behavior of data into account and allow the application of linear regression schemes even to Poisson or Negative Binomial models, which generally do a better job at describing this type of data.
Of course, the issues outlined above can be neatly and elegantly formalized. This, however, might not be the place to delve into mathematical rigor.
There is another very relevant point that is worth mentioning: to assume an exponential growth is to assume that counts may grow to infinity. This is true both for linear regressions on log-transformed data and for Poisson or Negative Binomial regressions. This means that these approaches are limited to short-term analyses. In the medium-term, cumulative counts will saturate according to a curve with an asymptote. In this case, one should use Poisson or Negative Binomial GLMs with asymptotic growth curves.

Nessun commento:

Posta un commento