Forecasting SARS-CoV-2 outbreak through wastewater analysis: a success in wastewater-based epidemiology
Rubén Cañas Cañas1,2,3,4, Raimundo Seguí López-Peñalver1, Jorge Casaña Mohedo1,5, José Vicente Benavent Cervera1, Julio Fernández Garrido6, Raúl Juárez Vela7, Ana Pellín Carcelén1, Óscar García-Algar3,4,8, Vicente Gea Caballero1, Vicente Andreu-Fernández1,3,9()
. Faculty of Health Sciences, Valencian International University (VIU), Valencia 46002, Spain . Global Omnium, Valencia 46005, Spain . Grup de Recerca Infancia i Entorn (GRIE), Institut d'Investigacions Biomèdiques August Pi i Sunyer (IDIBAPS), Barcelona 08036, Spain . Department de Cirurgia i Especialitats Mèdico-Quirúrgiques, Universidad de Barcelona, Barcelona 08036, Spain . Faculty of Health Sciences, Universidad Católica de Valencia San Vicente Mártir, Valencia 46001, Spain . Department of Nursing, University of Valencia, Valencia 46001, Spain . Faculty of Health Sciences, La Rioja University, Logroño 26006, Spain . Department of Neonatology, Instituto Clínic de Ginecología, Obstetricia y Neonatología (ICGON), Hospital Clínic-Maternitat, BCNatal, Barcelona 08028, Spain . Biosanitary Research Institute, Valencian International University (VIU), Valencia 46002, Spain
The COVID-19 pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), triggered a global emergency that exposed the urgent need for surveillance approaches to monitor the dynamics of viral transmission. Several epidemiological tools that may help anticipate outbreaks have been developed. Wastewater-based epidemiology is a non-invasive and population-wide methodology for tracking the epidemiological evolution of the virus. However, thorough evaluation and understanding of the limitations, robustness, and intricacies of wastewater-based epidemiology are still pending to effectively use this strategy. The aim of this study was to train highly accurate predictive models using SARS-CoV-2 virus concentrations in wastewater in a region consisting of several municipalities. The chosen region was Catalonia (Spain) given the availability of wastewater SARS-CoV-2 quantification from the Catalan surveillance network and healthcare data (clinical cases) from the regional government. By using various feature engineering and machine learning methods, we developed a model that can accurately predict and successfully generalize across the municipalities that make up Catalonia. Explainable Machine Learning frameworks were also used, which allowed us to understand the factors that influence decision-making. Our findings support wastewater-based epidemiology as a potential surveillance tool to assist public health authorities in anticipating and monitoring outbreaks.
Fig.1 Summary of Data Processing steps (A) and Machine Learning Workflow (B).
Fig.2 Distribution of SARS-CoV-2 concentrations in wastewater and COVID-19 cases without transformation (A) and Box-Cox transformation (B). The data are represented in each situation as a scatterplot and a histogram distribution for the two variables.
Fig.3 Heatmap depicting the qualitative progression of clinical cases (top) and SARS-CoV-2 wastewater treatment plant concentrations over time (weekly resolution). Municipalities were grouped into five bins and ordered from top to bottom based on population size for visualization. Darker colors indicate higher values of the represented variable.
Model
Fit time (s)
Score time (s)
R2
RMSE
LGBM
0.04
0.00
0.78
0.480
0.03
0.00
0.79
0.458
0.03
0.00
0.81
0.447
0.03
0.00
0.78
0.469
0.03
0.00
0.78
0.447
ETRM
0.56
0.02
0.78
0.480
0.56
0.02
0.78
0.458
0.57
0.02
0.80
0.458
0.56
0.02
0.77
0.480
0.57
0.02
0.78
0.458
MLPNN
0.14
0.00
0.71
0.100
0.14
0.00
0.72
0.100
0.14
0.00
0.71
0.100
0.14
0.00
0.68
0.100
0.14
0.00
0.70
0.100
Tab.1 Performance comparison of the top-performing models using 5-fold cross-validation
Fig.4 Analysis of correlation and residuals of the final model predictions on the validation data set. (A) Correlation between real and predicted values; the linear fit is shown in the figure as a red dashed line and R2 is annotated in the figure. (B) Residuals (predicted values-real values) plotted against the predicted values for clinical cases; MAE and RMSE are annotated in the figure.
Fig.5 Recreation of epidemiologic curves for three municipalities of Catalonia: Montcada i Reixac, Vallirana, and Vilanova i La Geltrú. The complete time series of data for the municipalities are represented by a blue line and the predictions by a red line, starting July 2021. Values for MAE for each municipality are annotated in the figure.
Fig.6 Model explainability using the SHAP framework. (Left) Feature importance of each variable introduced into the model is represented by the mean absolute SHAP value of each variable across all local predictions. (Right) Beeswarm plot showing the impact of each variable on the model output across all local predictions; the color gradient represents the value for each feature, red for higher values and blue for lower values.
Fig.7 SHAP dependence plot between seasonality (Month) and clinical cases (15d-MA). The SHAP values for the variable “Month” in each prediction are represented by a gradient color, with red indicating higher values and blue indicating lower values according to the 2-week moving average of clinical cases. This illustrates how the impact of this contextual feature varies.
COVID-19: Coronavirus Disease 2019
ETRM: Extra-trees regressor model
INE: National Institute of Statistics (Spain)
LGBM: Light gradient boosting model
LOESS: Locally estimated scatterplot smoothing
MAE: Mean absolute error
MLPNN: Multi-layered perceptron neural network
PCR: Polymerase chain reaction
RMSE: Root mean squared error
SARS-CoV-2: Severe acute respiratory syndrome coronavirus 2
SHAP: Shapley additive explanations
WBE: Wastewater-based epidemiology
WHO: World Health Organization.
WWTP: Wastewater treatment plant
1
E S Adamidi, K Mitsis, K S Nikita. (2021). Artificial intelligence in clinical care amidst COVID-19 pandemic: a systematic review. Computational and Structural Biotechnology Journal, 19: 2833–2850 https://doi.org/10.1016/j.csbj.2021.05.010
2
Y Ai, F He, E Lancaster, J Lee. (2022). Application of machine learning for multi-community COVID-19 outbreak predictions with wastewater surveillance. PLoS One, 17(11): e0277154 https://doi.org/10.1371/journal.pone.0277154
3
R Arabzadeh, D M Grünbacher, H Insam, N Kreuzinger, R Markt, W Rauch. (2021). Data filtering methods for SARS-CoV-2 wastewater surveillance. Water Science and Technology, 84(6): 1324–1339 https://doi.org/10.2166/wst.2021.343
4
A C Atkinson, M Riani, A Corbellini. (2021). The Box–Cox transformation: review and extensions. Statistical Science, 36(2): 239–255 https://doi.org/10.1214/20-STS778
5
L Blum, M Elgendi, C Menon. (2022). Impact of Box-Cox transformation on machine-learning algorithms. Frontiers in Artificial Intelligence, 5: 877569 https://doi.org/10.3389/frai.2022.877569
6
A L Booth, E Abels, P Mccaffrey. (2021). Development of a prognostic model for mortality in COVID-19 infection using machine learning. Modern Pathology, 34(3): 522–531 https://doi.org/10.1038/s41379-020-00700-x
7
K Chadaga, S Prabhu, B K Vivekananda, S Niranjana, S Umakanth. (2021). Battling COVID-19 using machine learning: a review. Cogent Engineering, 8(1): 1958666 https://doi.org/10.1080/23311916.2021.1958666
8
H Chen, Z Chen, L Hu, F Tang, D Kuang, J Han, Y Wang, X Zhang, Y Cheng, J Meng. et al.. (2024). Application of wastewater-based epidemiological monitoring of COVID-19 for disease surveillance in the city. Frontiers of Environmental Science & Engineering, 18(8): 98 https://doi.org/10.1007/s11783-024-1858-6
9
T Clement, N Kemmerzell, M Abdelaal, M Amberg. (2023). XAIR: A systematic metareview of explainable AI (XAI) Aligned to the software development process. Machine Learning and Knowledge Extraction, 5(1): 78–108 https://doi.org/10.3390/make5010006
10
M L Daza-TorresJ C Montesinos-LópezM KimR Olson C W BessL RuedaM SusaL TuckerY E GarcíaA J Schmidt, et al. (2023). Model training periods impact estimation of COVID-19 incidence from wastewater viral loads. Science of the Total Environment, 858(Pt 1): 159680
11
de Catalunya Generalitat (2023). Register of COVID-19 tests performed in Catalonia. Catalunya: Generalitat de Catalunya
12
M Gregovic, L Filipovic, I Katnic, M Vukotic, T Popovic (2023). Machine learning models for statistical analysis. The International Arab Journal of Information Technology, 20 (Special Issue 3A): 505–514
13
L Guerrero-Latorre, N Collado, N Abasolo, G Anzaldi, S Bofill-Mas, A Bosch, L Bosch, S Busquets, A Caimari, N Canela. et al.. (2022). The Catalan surveillance network of SARS-CoV-2 in sewage: design, implementation, and performance. Scientific Reports, 12(1): 16704 https://doi.org/10.1038/s41598-022-20957-3
14
D T Hill, M A Alazawi, E J Moran, L J Bennett, I Bradley, M B Collins, C J Gobler, H Green, T Z Insaf, B Kmush. et al.. (2023). Wastewater surveillance provides 10-days forecasting of COVID-19 hospitalizations superior to cases and test positivity: a prediction study. Infectious Disease Modelling, 8(4): 1138–1150 https://doi.org/10.1016/j.idm.2023.10.004
15
Nacional de Estadística Instituto (2023). Population by Municipality. Madrid: Instituto Nacional de Estadística
16
S Islam, T Islam, M R Islam (2022). New coronavirus variants are creating more challenges to global healthcare system: a brief report on the current knowledge. Clinical Pathology, 15: 2632010X221075584
17
H A Jeng, R Singh, N Diawara, K Curtis, R Gonzalez, N Welch, C Jackson, D Jurgens, S Adikari. (2023). Application of wastewater-based surveillance and copula time-series model for COVID-19 forecasts. Science of the Total Environment, 885: 163655 https://doi.org/10.1016/j.scitotenv.2023.163655
18
B Joseph-Duran, A Serra-Compte, M Sàrrias, S Gonzalez, D López, C Prats, M Català, E Alvarez-Lacalle, S Alonso, M Arnaldos. (2022). Assessing wastewater-based epidemiology for the prediction of SARS-CoV-2 incidence in Catalonia. Scientific Reports, 12(1): 15073 https://doi.org/10.1038/s41598-022-18518-9
19
I Karabayir, S Goldman, S Pappu, O Akbilgic. (2020). Gradient boosting for Parkinson’s disease diagnosis from voice recordings. BMC Medical Informatics and Decision Making, 20(1): 228 https://doi.org/10.1186/s12911-020-01250-7
20
G Ke, Q Meng, T Finley, T Wang, W Chen, W Ma, Q Ye, T Y Liu (2017). LightGBM: a Highly Efficient Gradient Boosting Decision Tree. Long Beach: Curran Associates Inc.
21
M Kumar, M Joshi, A K Patel, C G Joshi. (2021). Unravelling the early warning capability of wastewater surveillance for COVID-19: a temporal study on SARS-CoV-2 RNA detection and need for the escalation. Environmental Research, 196: 110946 https://doi.org/10.1016/j.envres.2021.110946
22
S Lalmuanawma, J L C Hussain. (2020). Applications of machine learning and artificial intelligence for COVID-19 (SARS-CoV-2) pandemic: a review. Chaos, Solitons & Fractals, 139: 110059 https://doi.org/10.1016/j.chaos.2020.110059
23
K Li, S Yao, Z Zhang, B Cao, C M Wilson, D Kalos, P F Kuan, R Zhu, X Wang. (2022). Efficient gradient boosting for prognostic biomarker discovery. Bioinformatics, 38(6): 1631–1638 https://doi.org/10.1093/bioinformatics/btab869
24
X Liao, X Liu, Y He, X Tang, R Xia, Y Huang, W Li, J Zou, Z Zhou, M Zhuang. (2024). Alternate disinfection approaches or raise disinfectant dosages for sewage treatment plants to address the COVID-19 pandemic? From disinfection efficiency, DBP formation, and toxicity perspectives. Frontiers of Environmental Science & Engineering, 18(9): 115 https://doi.org/10.1007/s11783-024-1875-5
25
R S López-Peñalver, R Cañas-Cañas, J Casaña-Mohedo, J V Benavent-Cervera, J Fernández-Garrido, R Juárez-Vela, A Pellín-Carcelén, V Gea-Caballero, V Andreu-Fernández. (2023). Predictive potential of SARS-CoV-2 RNA concentration in wastewater to assess the dynamics of COVID-19 clinical outcomes and infections. Science of the Total Environment, 886: 163935 https://doi.org/10.1016/j.scitotenv.2023.163935
26
X Lu, L Wang, S K Sakthivel, B Whitaker, J Murray, S Kamili, B Lynch, L Malapati, S A Burke, J Harcourt, A Tamin, N J Thornburg, J M Villanueva, S Lindstrom. (2020). US CDC real-time reverse transcription PCR panel for detection of severe acute respiratory syndrome Coronavirus 2. Emerging Infectious Diseases, 26(8): 1654–1665 https://doi.org/10.3201/eid2608.201246
27
S M Lundberg, S I Lee (2017). A unified approach to interpreting model predictions. Long Beach: Curran Associates Inc., 4768–4777
28
S Marimuthu, T Mani, T D Sudarsanam, S George, L Jeyaseelan. (2022). Preferring Box-Cox transformation, instead of log transformation to convert skewed distribution of outcomes to normal in medical research. Clinical Epidemiology and Global Health, 15: 101043 https://doi.org/10.1016/j.cegh.2022.101043
29
R H Pirzada, B Ahmad, N Qayyum, S Choi. (2023). Modeling structure–activity relationships with machine learning to identify GSK3-targeted small molecules as potential COVID-19 therapeutics. Frontiers in Endocrinology, 14: 1084327 https://doi.org/10.3389/fendo.2023.1084327
30
Core Team (2024) R. R: a Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing
31
W Randazzo, E Cuevas-Ferrando, R Sanjuán, P Domingo-Calap, G Sánchez. (2020). Metropolitan wastewater analysis for COVID-19 epidemiological surveillance. International Journal of Hygiene and Environmental Health, 230: 113621 https://doi.org/10.1016/j.ijheh.2020.113621
32
W Randazzo, J Piqueras, Z Evtoski, G Sastre, R Sancho, C Gonzalez, G Sánchez. (2019). Interlaboratory comparative study to detect potentially infectious human enteric viruses in influent and effluent waters. Food and Environmental Virology, 11(4): 350–363 https://doi.org/10.1007/s12560-019-09392-2
33
O E Santangelo, V Gentile, S Pizzo, D Giordano, F Cedrone. (2023). Machine learning and prediction of infectious diseases: a systematic review. Machine Learning and Knowledge Extraction, 5(1): 175–198 https://doi.org/10.3390/make5010013
34
R Sarker, A S M Roknuzzaman, M Nazmunnahar, M J Shahriar, M R Hossain. (2023). The WHO has declared the end of pandemic phase of COVID‐19: way to come back in the normal life. Health Science Reports, 6(9): e1544 https://doi.org/10.1002/hsr2.1544
35
K A Schneider, H C J Tsoungui Obama, N Adil Mahmoud Yousif. (2023). A flexible age-dependent, spatially-stratified predictive model for the spread of COVID-19, accounting for multiple viral variants and vaccines. PLoS One, 18(1): e0277505 https://doi.org/10.1371/journal.pone.0277505
36
M Shang, Y Kong, Z Yang, R Cheng, X Zheng, Y Liu, T Chen. (2023). Removal of virus aerosols by the combination of filtration and UV-C irradiation. Frontiers of Environmental Science & Engineering, 17(3): 27 https://doi.org/10.1007/s11783-023-1627-y
37
L S Shapley (1952). A Value for n-Persons Games. Santa Monica: The Rand Corporation
38
J A Silva. (2023). Wastewater treatment and reuse for sustainable water resources management: a systematic literature review. Sustainability, 15(14): 10940 https://doi.org/10.3390/su151410940
39
A Tiwari, S Adhikari, D Kaya, M A Islam, B Malla, S P Sherchan, A I Al-Mustapha, M Kumar, S Aggarwal, P Bhattacharya. et al.. (2023). Monkeypox outbreak: wastewater and environmental surveillance perspective. Science of the Total Environment, 856: 159166 https://doi.org/10.1016/j.scitotenv.2022.159166
40
J A Vallejo, N Trigo-Tasende, S Rumbo-Feal, K Conde-Pérez, A López-Oriona, I Barbeito, M Vaamonde, J Tarrío-Saavedra, R Reif, S Ladra. et al.. (2022). Modeling the number of people infected with SARS-CoV-2 from wastewater viral load in Northwest Spain. Science of the Total Environment, 811: 152334 https://doi.org/10.1016/j.scitotenv.2021.152334
41
G van Rossum (1995). Python reference manual. Amsterdam: Centrum voor Wiskunde en Informatica
42
O Vandenberg, D Martiny, O Rochas, A Van Belkum, Z Kozlakidis. (2021). Considerations for diagnostic COVID-19 tests. Nature Reviews. Microbiology, 19(3): 171–183 https://doi.org/10.1038/s41579-020-00461-z
43
E Weinan. (2020). Machine learning and computational mathematics. Communications in Computational Physics, 28(5): 1639–1670 https://doi.org/10.4208/cicp.OA-2020-0185
44
H Wickham, M Averick, J Bryan, W Chang, L Mcgowan, R François, G Grolemund, A Hayes, L Henry, J Hester. et al.. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43): 1686 https://doi.org/10.21105/joss.01686
45
X Zheng, K Zhao, X Xu, Y Deng, K Leung, J T Wu, G M Leung, M Peiris, L L M Poon, T Zhang. (2023). Development and application of influenza virus wastewater surveillance in Hong Kong. Water Research, 245: 120594 https://doi.org/10.1016/j.watres.2023.120594
46
Y Zhu, W Oishi, C Maruo, S Bandara, M Lin, M Saito, M Kitajima, D Sano. (2022). COVID-19 case prediction via wastewater surveillance in a low-prevalence urban community: a modeling approach. Journal of Water and Health, 20(2): 459–470 https://doi.org/10.2166/wh.2022.183
47
M A Zoran, R S Savastru, D M Savastru, M N Tautan, L A Baschir, D Tenciu. (2022). Assessing the impact of air pollution and climate seasonality on COVID-19 multiwaves in Madrid, Spain. Environmental Research, 203: 111849 https://doi.org/10.1016/j.envres.2021.111849