ORCID
- Shang-Ming Zhou: 0000-0002-0719-9353
Abstract
ObjectiveThe rising incidence and mortality in bladder cancer (BC) underscore the importance of identifying asscociated features. Current reliance on haematuria as a primary indicator for BC proves inadequate. While mining electronic health records (EHRs) offer potential of identifying BC-related signals, traditional data-driven methods struggle with high-dimensional datasets. This study aims to uncover novel BC-associated clinical signals by developing Parsimony-driven cAtegory-balaNced binary Signal extractor for Primary Care EHRs (PanSPICE) tailored to extremely high-dimensional data linked from multi-centres.MethodsWe collected BC cases and control patients (n = 64,884) linked at patient-level from Welsh nationwide databases, yielding 48,261 features in primary care settings. The PanSPICE approach begins with information gain to pre-rank features, then applies Retentive Stickiness Binary Particle Swarm Optimisation (RSBPSO) combined with C5.0 classification tree to overcome computational barriers in feature selection. A two-layer optimisation treated clinical signals in care processes (POC), diagnoses (DIAG), and medications (MED) separately to prevent feature masking. A tailored fitness function for RSBPSO to simultaneously optimise model performance and feature sparsity. Associations of the selected features were interpreted using logistic regression models adjusted for deprivation indices.ResultsThe PanSPICE identified 38 optimal features (AUC (area under the curve) = 0.81, 95 % CI: 0.80–0.82), including urinary tract infections (OR = 2.19, 95 % CI: 2.05–2.14) and inverse associations with stroke (OR = 0.64, 95 % CI: 0.54–0.74) and dementia (OR = 0.25, 95 % CI: 0.17–0.35). Gender stratification revealed female-specific urine glucose testing association (OR = 1.24, 95 % CI: 1.08–1.43). Certain medications, such as trimethoprim, were positively associated with BC, while others, including ramipril and prednisolone, showed protective effects.ConclusionThe PanSPICE enables efficient high-dimensional EHR analysis, revealing under-recognised potential BC risk profiles and protective comorbidities. Gender-specific differences in BC associations highlight the importance of gender-stratified analyses, while computational advances provide a template for EHR-based clinical discovery. Findings warrant further mechanistic research into neurological protective pathways.
DOI Link
Publication Date
2025-11-15
Publication Title
Journal of Biomedical Informatics
Volume
172
ISSN
1532-0464
Acceptance Date
2025-11-14
Deposit Date
2025-11-19
Funding
This project was funded by the Faculty of Health PhD Studentship at the University of Plymouth, UK (Ref.: GD105249-110).
Keywords
Bladder cancer, Electronic health records, Feature selection, Machine Learning, Parsimony, Particle Swarm Optimisation, Primary care, Sex differences in bladder cancer
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Recommended Citation
Wang, X., Preston, A., Aning, J., & Zhou, S. (2025) 'Unveiling novel bladder cancer associations from multicentred primary and secondary care electronic health records by machine learning: a case-control study', Journal of Biomedical Informatics, 172. Available at: 10.1016/j.jbi.2025.104959
