Bridging the gap: Using reservoir ecology and human serosurveys to estimate Lassa virus incidence in West Africa

Forecasting how the risk of pathogen spillover changes over space is essential for the effective deployment of interventions such as human or wildlife vaccination. However, due to the sporadic nature of spillover events and limited availability of data, developing and validating robust predictions is challenging. Recent efforts to overcome this obstacle have capitalized on machine learning to predict spillover risk. Past approaches combine infection data from both humans and reservoir to train models that assess risk across broad geographical regions. In doing so, these models blend data sources that separately describe pathogen risk posed by the reservoir and the realized rate of spillover into the human population. We develop a novel approach that models as separate stages: 1) the contributions of spillover risk from the reservoir and pathogen distribution, and 2) the resulting incidence of pathogen in the human population. Our methodology allows for a rigorous assessment of whether forecasts of spillover risk can reliably predict the realized spillover rate into humans, as measured by seroprevalence. In addition to providing a rigorous cross-validation of risk predictions, this methodology could shed light on human habits that modulate or amplify the resultant spillover. We apply our method to Lassa virus, a zoonotic pathogen that poses a high threat of emergence in West Africa. The resulting framework is the first forecast to quantify the extent to which predictions of spillover risk from the reservoir explain regional variation in human seroprevalence. We use predictions generated by the model to revise existing estimates for the annual number of new human Lassa infections. Our model predicts that between 935,200 – 3,928,000 humans are infected by Lassa virus each year, an estimate that exceeds current conventional wisdom. Author Summary The 2019 emergence of SARS-2 coronavirus is a grim reminder of the threat animal-borne pathogens pose to human health. Even prior to SARS-2, the spillover of so-called zoonotic pathogens was a persistent problem, with pathogens such as Ebola and Lassa regularly but unpredictably causing outbreaks. Machine-learning models that can anticipate when and where animal-to-human virus transmission is most likely to occur could help guide surveillance effort, as well as preemptive countermeasures to pandemics, like information campaigns or vaccination programs. We develop a novel machine learning framework that uses data-sets describing the distribution of a virus within its host and the range of its animal host, along with human immunity data, to infer rates of animal-to-human transmission across a focal region. By training the model on data from the animal host, our framework allows rigorous validation of spillover predictions on human data. We apply our framework to Lassa fever, a viral disease of West Africa that is spread to humans by rodents, and update estimates of symptomatic and asymptomatic Lassa virus infections in humans. Our results suggest that Nigeria is most at risk for the emergence of new strains of Lassa virus, and therefore should be prioritized for outbreak-surveillance.


Abstract
Forecasting how the risk of pathogen spillover changes over space is essential for the effective deployment of interventions such as human or wildlife vaccination. However, due to the sporadic nature of spillover events and limited availability of data, developing and validating robust predictions is challenging. Recent efforts to overcome this obstacle have capitalized on machine learning to predict spillover risk. Past approaches combine infection data from both humans and reservoir to train models that assess risk across broad geographical regions. In doing so, these models blend data sources that separately describe pathogen risk posed by the reservoir and the realized rate of spillover into the human population. We develop a novel approach that models as separate stages: 1) the contributions of spillover risk from the reservoir and pathogen distribution, and 2) the resulting incidence of pathogen in the human population. Our methodology allows for a rigorous assessment of whether forecasts of spillover risk can reliably predict the realized spillover rate into humans, as measured by seroprevalence. In addition to providing a rigorous cross-validation of risk predictions, this methodology could shed light on human habits that modulate or amplify the resultant spillover. We apply our method to Lassa virus, a zoonotic pathogen that poses a high threat of emergence in West Africa.
The resulting framework is the first forecast to quantify the extent to which predictions of spillover risk from the reservoir explain regional variation in human seroprevalence.

Author Summary
The 2019 emergence of SARS-2 coronavirus is a grim reminder of the threat animal-borne pathogens pose to human health. Even prior to SARS-2, the spillover of so-called zoonotic pathogens was a persistent problem, with pathogens such as Ebola and Lassa regularly but unpredictably causing outbreaks. Machine-learning models that can anticipate when and where animal-to-human virus transmission is most likely to Introduction 1 Emerging infectious diseases (EIDs) pose a deadly threat to public health. 2 Approximately 40% of EIDs are caused by pathogens that circulate in a non-human 3 wildlife reservoir (i.e., zoonotic pathogens) [1]. Prior to full scale emergence, interaction 4 between humans and wildlife creates opportunities for the occasional transfer, or 5 spillover, of the zoonotic pathogen into human populations [2]. These initial spillover 6 cases, in turn, represent newly established pathogen populations in human hosts that 7 are subject to evolutionary pressures and may potentially lead to increased transmission 8 among humans [2,3]. Consequently, a key step in preempting the threat of EIDs is 9 careful monitoring of when and where spillover into the human population occurs. 10 However, because the majority of EIDs from wildlife originate in low and middle income 11 regions with limited health system infrastructure, accurately estimating the rate and 12 geographical range of pathogen spillover, and therefore the risk of new EIDs, is a major 13 challenge [1].
between disease presence and the environment can be extended across a region of 20 interest. Using these techniques, previous studies of Lassa fever (LF) have derived risk 21 maps that assess the likelihood of human LF cases being present in different regions of 22 West Africa [4,5]. Fitted risk maps are often assessed, in turn, by evaluating the ability 23 of a model to discriminate between case data and background data that was left out of 24 the training process [5,7]. Though such models have demonstrated impressive 25 discrimination abilities when evaluated by such binary classification ability, these 26 forecasts are not explicitly vetted on their ability to predict the magnitude of pathogen 27 spillover from the reservoir into humans. As a result, the extent to which predicted risk 28 explains the realized variation in human exposure to the pathogen is unclear. 29 To address this need, we develop a multi-layer machine learning framework that 30 accounts for the differences between how data involving a wildlife reservoir, and data 31 from human serosurveys, can simultaneously inform spillover risk in people and 32 rigorously assess whether predicted risk quantifies the rate of new cases in humans. Our 33 approach uses machine learning algorithms that, when trained on data from the wildlife 34 reservoir alone, estimate the likelihood that the reservoir and the zoonotic pathogen are 35 present in an area. These predictions are then combined into a composite estimate of 36 spillover risk to humans. Next, our framework uses estimates of human pathogen 37 seroprevalence, as well as estimates of human population density, to translate the 38 composite risk estimate into a prediction of the realized rate of zoonotic spillover into 39 humans. Omitting human seroprevalence data from the training process of the risk-layer 40 has several advantages. First, in the case of LF, due to modern transportation and the 41 longevity of Lassa virus antibodies in humans, a general concern is that the reported 42 location of human disease or Lassa virus antibody detection is not the site at which the 43 infection occurred [10][11][12]. Training the risk layer on reservoir data alone helps avoid 44 these biases. Secondly, in our framework, human seroprevalence estimates provide an 45 ultimate test of the risk layer's ability to correlate with cumulative human exposure to 46 the pathogen. rodent-to-human transmission is believed to account for the majority of new LASV 52 infections [11,15]. LASV spreads to humans from its primary reservoir, the 53 multimammate rat Mastomys natalensis, through food contaminated with infected 54 rodent feces and urine, as well as through the hunting of rodents for food 55 consumption [16]. Because M. natalensis have limited dispersal relative to humans, 56 direct LASV detection in the rodents is likely to indicate actual areas of spillover risk. 57 We evaluate each layer of our framework for its ability to predict different attributes 58 of LASV spillover into humans. Our model demonstrates a clear ability to predict 59 spillover risk as measured by the spatial distribution of the LASV pathogen and 60 reservoir, and a more moderate correlation between the predicted risk and human 61 seroprevalence. We also use our machine learning framework to estimate the total LASV 62 spillover into humans. Data from longitudinal serosurveys has been used to estimate 63 that between 100,000 and 300,000 LASV infections occur each year, and that between 74 64 -94% of LASV infections result in sub-clinical febrile illness or are asymptomatic [17].

65
Though these estimates are often used to describe the magnitude of LASV spillover into 66 humans [11,18,19], their generality is unclear because they are based on extrapolation 67 from serosurveys conducted in the 1980's in Sierra Leone [17]. More recent estimates 68 indicate that as many as 13 million LASV infections may occur each year [20]. and measured seroprevalence of human LASV serosurveys. The first two data-sets 74 generate response variables for the model layers that predict LASV risk. The human 75 seroprevalence data are used to evaluate the combined LASV risk layer for its ability to 76 predict LASV spillover in humans and are also used to calibrate the stage of the model 77 that predicts human LASV spillover. Our full data-set and the script files used to fit 78 the models are available in a github repository [21]. 79 We collected data on documented presences of M. natalensis using the African 81 Mammalia database [22], supplemented with additional presences found in the 82 literature. Because M. natalensis is morphologically similar to other rodents in the area 83 (e.g., Mus baoulei, Mastomys erythroleucus), we only include those presences that have 84 been confirmed with gene sequencing methods. Each presence was verified with the help 85 of a rodent expert (E.F.C).

86
Fitting the model requires supplementing the presence-only data with background 87 points, also called pseudo-absences [23,24]. Background points serve as an estimate of 88 the distribution of sampling effort for the organism being modeled [25]. We used   Surveys of Mastomys natalensis for Lassa virus 100 We compiled a data-set from published studies that sampled M. natalensis rodents for 101 indicators of LASV. The majority of the studies used were found using Table 2  Africa that contained a M. natalensis LASV survey was classified into the categories 112 "Lassa positive" or "Lassa negative." Specifically, a pixel was defined as Lassa positive if, 113 at some point, a M. natalensis rodent was captured within the pixel, and the rodent 114 tested positive for LASV using a RT-PCR assay. Pixels were classified as Lassa negative 115 if five or more M. natalenis rodents in total were tested for LASV infection by 116 or tested for any previous arenavirus exposure using a serological assay, and all rodents 117 tested were negative. This procedure allowed us to classify 74 unique pixels in total : 36 118 were classified as Lassa negative, and 38 were classified as Lassa positive ( Figure 1).

119
Human seroprevalence data 120 Lastly, we collected human arenavirus seroprevalence data. To ensure that the measured 121 seroprevalence was representative of a larger village population, we required that the 122 individuals tested for the study were chosen at random from a village. This criterion  commonly associated with small villages and is considered a serious agricultural 134 pest [44,45]. To allow the model the possibility to learn these relationships, we include 135 predictors that describe MODIS land cover features as predictors, and also include 136 human population density within each pixel. We also include elevation in meters.

137
Because climate seasonality and crop maturation affect the breeding season of M. 138 natalensis, we include various measures of the seasonality of the vegetative index 139 (NDVI), precipitation, and temperature [46]. See S1 Appendix for a complete list of environmental variables. LASV is often associated with M. natalensis, so we use the   Lastly, we used a susceptible-infected-recovered-susceptible (i.e., SIRS) model to derive 161 human incidence from the predictions of seroprevalence. among features [48].

174
Prior to inclusion in the model-fitting procedure, each feature variable was vetted for 175 its ability to distinguish between presences and absences in each of the layers.

176
Specifically, for each risk layer's binary response variable, we performed a 177 Mann-Whitney U-test on each candidate feature. In doing so, we test the null 178 hypothesis that the distribution of a feature is the same between pixels that are 179 classified as a presence or (pseudo) absence. We only include predictors for which the 180 null hypothesis is rejected at the α = 0.05 level.

181
For a given training set, we fit the BCT model using the gbm.step function of the 182 "dismo" package in the statistical language R [49]. This specific function uses 10-fold 183 cross-validation to determine the number of successive trees that best model the 184 relationship between response and features without over-fitting the data [49]. The composed of many small incremental improvements. A smaller learning rate was used in 188 the D L layer because the corresponding data-set was smaller. The parameter that 189 describes the maximum number of allowable trees was set to a large value (10 7 ) to 190 ensure that the cross-validation fitting process was able to add trees until no further 191 improvement occurred [47]. background pixels that are available in the overall data-set [24].   [50].

216
The D L layer is generated by the averaged predictions of 25 boosted classification 217 tree models, each of which is trained to discriminate between pixels that are Lassa  For each pixel across West Africa, the equations that describe the number of humans in 264 each of the classes are: By dividing R * by the total population size at steady-state, b d , we find that the 269 long-term seroprevalence, denoted Ω * : 270 Next, we use Eq (5) to estimate the LASV spillover rate F S * , given that the steady-state LASV seroprevalence is Ω * . Solving Eq (5) for F in terms of Ω * yields: The rate of new cases is given by These analyses were derived using Mathematica. The notebook file is available in the 274 github repository [21].

275
By substituting our prediction of human LASV seroprevalence for Ω * , we can 276 estimate the total human infection rate using Eq (7). Calculating these estimates

293
The general effect of seroreversion can be understood by comparing estimates of new 294 cases, η, with that obtained by the same equation with λ = 0, denoted η 0 . We find that 295 with our parameter values, when seroreversion is included in the model, relative to estimates obtained when 298 seroreversion does not occur. 299 We also derive a null estimate of the yearly number of LASV spillover cases in 300 humans from Eq (7). This estimate assumes that the incidence of LASV in humans is 301 the same everywhere in the West African study region. Specifically, we calculate Ω * as 302  natalensis [45], all West Africa countries likely harbor this primary rodent reservoir of 323 LASV (Fig 3a). Similar to other Lassa risk maps [4,5], our D L layer predictions 324 indicate that the risk of LASV in rodents is primarily concentrated in the eastern and 325 July 14, 2020 16/31 western extremes of West Africa (Fig 3b). The combined risk, shown in Fig 3c, indicates 326 that environmental features suitable for rodent-to-human LASV transmission are 327 primarily located in Sierra Leone, Guinea, and Nigeria.  White dots indicate locations for which measured seroprevalence fell within 0.1 of the prediction. Measured seroprevalence at red dots was 0.1 or more greater than that predicted, and seroprevalence at blue dots was 0.1 or more below the prediction.

364
Machine learning approaches that forecast the risk of emerging infectious diseases such 365 as LF are often not assessed on their ability to predict proxies of pathogen spillover into 366 human populations [5,27]. Our forecasting framework advances these approaches by  Because our framework traces the spillover risk into humans back to the spatial 386 heterogeneity in Lassa risk and human density across West Africa, our approach allows 387 us to predict which countries have the highest per-capita risk of LASV infection (e.g., 388 Guinea, Sierra Leone) due to attributes of the reservoir and those that have the highest 389 number of human cases because of their large human population size (e.g., Nigeria).

390
Clarifying and distinguishing these two different types of risk helps to manage  Using this framework, we are able to generate predictions of the number of new cases 398 of LASV infection within different regions of West Africa. Our results indicate that magnitude of new cases in Nigeria is driven primarily by its greater human population 401 density, rather than an increased per-capita risk. If these predictions are correct, 402 Nigeria is likely to represent the greatest risk of LASV emergence because the large 403 number of annual spillover events allows for extensive sampling of viral strain diversity 404 and repeated opportunities for viral adaptation to the human populations [53]. 405 In addition to identifying the countries most at risk for viral emergence, our model 406 framework provides updated estimates for the rate of LASV spillover across West Africa. 407 Previous estimates of 100,000 -300,000 cases per year were based on longitudinal 408 studies from communities in Sierra Leone conducted in the 1980's [17]. Using the population of West Africa has increased by a factor of 2.4 since that time, making 422 these estimates outdated [56]. Later estimates that were partially based on the same 423 longitudinal serosurveys derived an upper bound of 13 million LASV infections, but 424 only considered the number of cases in Nigeria, Guinea, and Sierra Leone [20]. 425 Furthermore, these later estimates are derived from the maximum observed human 426 LASV seroconversion rate in the Sierra Leone study, which likely does not apply across 427 West Africa. In contrast, our estimates are based on human seroprevalence data that 428 comes from six countries in West Africa and spans a 47 year time period. Because our 429 data-set was obtained from a broader spatial and temporal range, our estimates are less 430 likely to be biased by sporadic extremes in LASV spillover.

431
Our modeling framework has the benefit of being extendable, thereby giving inside a domestic dwelling. The incidence of LF is generally believed to peak in the dry 438 season, when M. natalensis migrate into domestic settings [44,57]. Temporal 439 fluctuations in population density, due to seasonal rainfall, would provide another 440 important insight into the seasonal burden of human LF cases [11]. Understanding this 441 ecological connection is important because distributing vaccines at seasonal population 442 lows in wildlife demographic cycles can, in theory, substantially increase the probability 443 of pathogen elimination [58,59]. Incorporating these temporal layers will become more 444 feasible as more time-series data on population density in the focal reservoir species 445 become available.

446
Other potentially important risk layers that could be added are geographic 447 distributions for other known reservoirs of LASV. Specifically, several species of rodents 448 are known to be capable of harboring the virus [27]. Though M. natalensis is believed 449 to be the primary reservoir that contributes to human infection, it is unknown whether 450 this holds across all regions of West Africa. Understanding the relationship between the 451 habitat suitability of different rodent reservoirs and human LF burden may also help 452 determine whether M. natalensis is the host at which intervention strategies should 453 always be directed. Finally, additional virus sequence data could be used to train a risk 454 layer that forecasts the presence or absence of specific genomic variants that are more 455 likely to cause either severe disease or more efficient human-to-human transmission 456 cycles.

457
Although the methods we have used here make efficient use of available data, the 458 accuracy of our risk forecasts remains difficult to rigorously evaluate due to the limited 459 availability of current data from human populations across West Africa. The sparseness 460 of modern human data arises for two reasons: 1) the lack of robust surveillance and 461 testing across much of the region where LASV is endemic and 2) the absence of publicly 462 available databases reporting human cases in those countries that do have sophisticated 463 surveillance in place. Improving surveillance for LASV across West Africa and developing publicly available resources for sharing the resulting data would allow more 465 robust risk predictions to be developed and facilitate targeting effective risk reducing 466 interventions. Despite these limitations of existing data, the structured 467 machine-learning models we develop here provide insight into what aspects of 468 environment, reservoir, and virus, contribute to spillover, and the potential risk of 469 subsequent emergence into the human population. By understanding these connections, 470 we can design and deploy more effective intervention and surveillance strategies that 471 work in tandem to reduce disease burden and enhance global health security.

472
Supporting information captions 473 S1 Appendix. Details on the predictors used in the model and model fits. 474