Optimized Machine Learning with TPE for Air Quality Classification and Public Health Risk Estimation

Authors

  • Ayun Hapsari Faculty of Law, Social and Political Science, Universitas Terbuka, Tangerang Selatan, Indonesia

Keywords:

Air Quality Index, Machine Learning, Tree-structured Parzen Estimator, Hyperparameter Optimization, Urban Pollution Prediction

Abstract

Air pollution in rapidly urbanizing cities such as Delhi poses a critical threat to public health due to the persistent exceedance of safe thresholds for particulate matter and gaseous pollutants. Accurate air quality classification and timely health risk estimation are essential to support early warning systems and guide urban policy interventions. This study develops a multi-class Air Quality Index (AQI) classification framework using Logistic Regression, Random Forest, Decision Tree, Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Gradient Boosting, applied to a comprehensive dataset of daily pollutant concentrations (PM2.5, PM10, NO₂, SO₂, CO, and O₃) and meteorological parameters from Delhi. Data preprocessing included outlier removal, feature scaling, and label encoding of AQI categories, followed by an 80:20 train-test split to ensure robust model evaluation. Model performance was assessed using Accuracy, Precision, Recall, and F1-score. The experimental results show that ensemble and kernel-based models achieved the highest predictive accuracy, with Random Forest reaching an accuracy of 0.7611 and an F1-score of 0.7522, followed closely by Decision Tree and Gradient Boosting with F1-scores above 0.74. Logistic Regression and SVC maintained moderate yet consistent performance, while KNN was more sensitive to data distribution, achieving an F1-score of 0.605. Confusion matrix analysis revealed that misclassifications were mostly confined to adjacent AQI categories, reflecting the natural difficulty of distinguishing borderline pollution levels. The novelty of this study lies in integrating multi-class AQI classification with a structured machine learning framework capable of mapping environmental conditions directly to health risk levels. By aligning predictions with WHO and US-EPA thresholds, the framework facilitates actionable insights for public health authorities, enabling the design of early warning systems and targeted interventions for vulnerable populations. These findings advance the technical landscape of urban air quality management and provide a scalable foundation for health-oriented environmental decision-making in highly polluted megacities.

References

[1] M. Méndez, M. G. Merayo, and M. Núñez, “Machine learning algorithms to forecast air quality: a survey,” Artificial Intelligence Review, vol. 56, no. 9, pp. 10031–10066, 2023.

[2] S. A. Horn and P. K. Dasgupta, “The Air Quality Index (AQI) in historical and analytical perspective: a tutorial review,” Talanta, vol. 267, p. 125260, 2024.

[3] K. Oliveira, V. Rodrigues, S. Slingerland, K. Vanherle, J. Soares, S. Rafael, et al., “Assessing the impacts of citizen-led policies on emissions, air quality and health,” Journal of Environmental Management, vol. 302, p. 114047, 2022.

[4] B. Angelevska, V. Atanasova, and I. Andreevski, “Urban air quality guidance based on measures categorization in road transport,” Civil Engineering Journal, vol. 7, no. 2, pp. 253–267, 2021.

[5] D. Chhabra, K. Jahangiri, S. Sohrabizadeh, Z. Ghomian, A. Shahsavani, and D. C. Wahi, “Exploring strategies to mitigate the adverse health impacts of air pollution on children in India: A qualitative study,” Cureus, vol. 16, no. 7, 2024.

[6] M. Nieuwenhuijsen, A. de Nazelle, J. Garcia-Aymerich, H. Khreis, and B. Hoffmann, “Shaping urban environments to improve respiratory health: recommendations for research, planning, and policy,” The Lancet Respiratory Medicine, vol. 12, no. 3, pp. 247–254, 2024.

[7] D. Tang, Y. Zhan, and F. Yang, “A review of machine learning for modeling air quality: Overlooked but important issues,” Atmospheric Research, vol. 300, p. 107261, 2024.

[8] I. E. Agbehadji and I. C. Obagbuwa, “Systematic review of machine learning and deep learning techniques for spatiotemporal air quality prediction,” Atmosphere, vol. 15, no. 11, p. 1352, 2024.

[9] G. Ravindiran, G. Hayder, K. Kanagarathinam, A. Alagumalai, and C. Sonne, “Air quality prediction by machine learning models: A predictive study on the Indian coastal city of Visakhapatnam,” Chemosphere, vol. 338, p. 139518, 2023.

[10] S. K. Natarajan, P. Shanmurthy, D. Arockiam, B. Balusamy, and S. Selvarajan, “Optimized machine learning model for air quality index prediction in major cities in India,” Scientific Reports, vol. 14, no. 1, p. 6795, 2024.

[11] G. Ravindiran, S. Rajamanickam, K. Kanagarathinam, G. Hayder, G. Janardhan, P. Arunkumar, et al., “Impact of air pollutants on climate change and prediction of air quality index using machine learning models,” Environmental Research, vol. 239, p. 117354, 2023.

[12] A. S. Mohan and L. Abraham, "An ensemble deep learning approach for air quality estimation in Delhi, India," Earth Science Informatics, 2024, doi: 10.1007/s12145-023-01210-5.

[13] F. George, P. Joshi, S. Dey, R. K. Mall, and S. Ghosh, "A framework for city-specific air quality health index: a comparative assessment of Delhi and Varanasi, India," Environmental Research Letters, 2025, doi: 10.1088/1748-9326/aded26.

[14] C. B. Pande, N. Radwan, S. Heddam, K. O. Ahmed, F. Alshehri, S. C. Pal, and M. Pramanik, “Forecasting of monthly air quality index and understanding the air pollution in the urban city, India based on machine learning models and cross‑validation,” J. Atmos. Chem., vol. 82, no. 1, art. no. 1, 2025, doi: 10.1007/s10874-024-09466-x.

[15] N. S. Gupta, Y. Mohta, K. Heda, R. Armaan, B. Valarmathi, and G. Arulkumaran, “Prediction of air quality index using machine learning techniques: a comparative analysis,” J. Environ. Public Health, vol. 2023, no. 1, art. no. 4916267, 2023.

[16] D. Li, J. Wang, D. Tian, C. Chen, X. Xiao, L. Wang, Z. Wen, M. Yang, and G. Zou, “Residual neural network with spatiotemporal attention integrated with temporal self-attention based on long short-term memory network for air pollutant concentration prediction,” Atmos. Environ., vol. 329, p. 120531, Jul. 2024.

[17] S. Suyahman, S. Sunardi, M. Murinto, and A. N. Khusna, “Data Augmentation Using Test-Time Augmentation on Convolutional Neural Network-Based Brand Logo Trademark Detection,” Indonesian Journal of Artificial Intelligence and Data Mining, vol. 7, no. 2, pp. 266–274, 2024.

[18] S. Sunardi and S. Suyahman, “Analisis Komparasi Prediksi Serangan DDoS Menggunakan Machine Learning,” in Proceeding of Informatics Collaborations and Dissemination Meeting, vol. 1, no. 1, pp. 84–91, May 2025.

[19] S. Suyahman and A. Hapsari, “VGG-Based Feature Extraction for Classifying Traditional Batik Motifs Using Machine Learning Models,” Preservation, Digital Technology & Culture, 2025.

[20] F. S. De Menezes, G. R. Liska, M. A. Cirillo, and M. J. Vivanco, “Data classification with binary response through the Boosting algorithm and logistic regression,” Expert Systems with Applications, vol. 69, pp. 62–73, 2017.

[21] S. Kabiraj, M. Raihan, N. Alvi, M. Afrin, L. Akter, S. A. Sohagi, and E. Podder, “Breast cancer risk prediction using XGBoost and random forest algorithm,” in Proc. 2020 11th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), Jul. 2020, pp. 1–4. IEEE.

[22] P. Dinesh, A. S. Vickram, and P. Kalyanasundaram, “Medical image prediction for diagnosis of breast cancer disease comparing the machine learning algorithms: SVM, KNN, logistic regression, random forest and decision tree to measure accuracy,” in AIP Conf. Proc., vol. 2853, no. 1, p. 020140, May 2024.

[23] S. Suyahman, S. Sunardi, M. Murinto, and A. N. Khusna, “Siamese Neural Network Optimization Using Distance Metrics for Trademark Image Similarity Detection,” International Journal of Computing, vol. 17, no. 1, pp. 1–12, 2025.

[24] V. Kushwah and P. Agrawal, “Hybrid model for air quality prediction based on LSTM with random search and Bayesian optimization techniques,” Earth Sci. Informatics, vol. 18, no. 1, p. 32, 2025.

[25] P. Dinesh, A. S. Vickram, and P. Kalyanasundaram, “Medical image prediction for diagnosis of breast cancer disease comparing the machine learning algorithms: SVM, KNN, logistic regression, random forest and decision tree to measure accuracy,” in AIP Conf. Proc., vol. 2853, no. 1, p. 020140, May 2024.

[26] C. Chen and H. Seo, “Prediction of rock mass class ahead of TBM excavation face by ML and DL algorithms with Bayesian TPE optimization and SHAP feature analysis,” Acta Geotech., vol. 18, no. 7, pp. 3825–3848, 2023.

[27] S. Suyahman, S. Sunardi, M. Murinto, and A. N. Khusna, “Siamese Neural Networks with Chi-square Distance for Trademark Image Similarity Detection,” Scientific Journal of Informatics, vol. 11, no. 2, pp. 429–438, May 2024, doi: 10.15294/sji.v11i2.4654.

[28] S. Suyahman, S. Sunardi, and M. Murinto, “Comparative Analysis of CNN Architectures in Siamese Networks with Test-Time Augmentation for Trademark Image Similarity Detection,” Scientific Journal of Informatics, vol. 11, no. 4, Nov. 2024.

Downloads

Published

2025-08-02

How to Cite

[1]
A. Hapsari, “Optimized Machine Learning with TPE for Air Quality Classification and Public Health Risk Estimation”, Journal of Artificial Intelligence and Legal Technology, vol. 1, no. 1, pp. 9–14, Aug. 2025.

Issue

Section

Articles