Optimal machine learning techniques for meteorological modeling of PM2.5 concentration in five major polluted cities of South-East Asia

Shafi, Sedra; Scafetta, Nicola

doi:10.5194/egusphere-egu25-3240

The rapid decline in air quality across Southeast and Western Pacific Asia is occurring at an accelerated pace due to population growth and industrial development. The region’s Meteorological factors, including the monsoon seasonality, exert a significant influence on air pollution levels, particularly PM2.5 concentrations. In this study, we employ a statistical modeling approach to derive daily PM2.5 levels from meteorological parameters in five major polluted cities: Lahore (Pakistan), Delhi (India), Dhaka (Bangladesh), Hanoi (Vietnam), and Shanghai (China). The incorporated meteorological parameters are wind speed, barometric pressure, temperature, and rainfall, which are known to affect air pollution levels from 2020 to 2022. The statistical modeling was based on the comparative analysis of 35 different machine learning (ML) regression techniques with the purpose of selecting the algorithms most efficient for reconstructing and predicting PM2.5 levels from meteorological variables alone. Specifically, each ML regression model was trained to reconstruct daily PM2.5 levels in 2020–2021, and then used to reconstruct both missing daily PM2.5 levels in 2020–2021 and forecast the whole of 2022 using only the 2022 meteorological records. The results indicated that most of the daily and seasonal variability in daily PM2.5 levels could be reconstructed from meteorological conditions. However, the performance of the various ML models (as assessed by Root Mean Square Error tests) exhibited considerable variability. Among the tested models, the Ensembles Boosted Tree ML method demonstrated optimal efficiency during the training period (the first 2 years, 2020 and 2021) and it also was highly efficient in predicting the third year (2022) using only meteorological data. Additionaly, the Trilayer Neural Network ML method was found the most effective at reconstructing the data after 3 years of training and may therefore be preferred to fill in short periods of missing PM2.5 data. In contrast, our comparative analyses showed that the traditional multi-linear regression models under-performed in both constructing and predicting PM2.5 data. This study demonstrates the necessity and usefulness of assessing multiple ML regression methodologies for selecting which ones better perform for reconstructing the data of interest (in our case PM2.5 records) from their hypothesized constructors (in our case meteorological parameters). In particular, this study has highlighted the utility of using ML regression techniques for forecasting air quality and reconstructing missing pollution data, which is crucial for policy-making across South-East and Western-Pacific Asia regions, where only limited pollution monitoring infrastructure are available.

Optimal machine learning techniques for meteorological modeling of PM2.5 concentration in five major polluted cities of South-East Asia / Shafi, Sedra; Scafetta, Nicola. - EGU25-3240:(2025). ( EGU General Assembly 2025 Vienna, Austria 27 Apr – 2 May 2025) [10.5194/egusphere-egu25-3240].