Abstract

This research explores application of Natural Language Processing (NLP) methods to better understand and improve customer service feedback in telecom industry. We collect novel reallife datasets of customer reviews and social media comments within British telecommunication industry over one year. Then, we apply three main modelling approaches. First, we study topic modelling for short text: we propose a new evaluation metric for GSDMM model and demonstrate that our metric helps to choose meaningful topics when texts are very brief. Second, we perform an extensive comparative study of word embedding models for text classification: we test nine word embedding and feature engineering methods including Word2Vec, FastText, BERT, Doc2Vec, TF-IDF, together with seven classifiers on small, medium, and large datasets. We propose the feature engineering method that uses the first principal component in place of taking the average of word embeddings which has been commonplace in practical applications of Word2Vec and FastText. We also measure energy consumption and training time of each model to assess the trade-off between accuracy and efficiency. Third, we study the same word embedding methods applied to text clustering: we compare Self-Organising Map (SOM), K-means, K-medoids, BIRCH, and Gaussian Mixture models on different embeddings, and study the effect of various feature engineering approaches on clustering results. Also, we propose new effective formulas for Class based TF-IDF for cluster representation. Our results show that the new proposed hyper parameter tuning method for GSDMM achieves better topic coherence for short reviews in comparison with other clustering approaches. Moreover, the results of the empirical studies demonstrate a superior performance of our proposed PCA based feature engineering method for Word2Vec and Fast Text in the contexts of short text classification and text clustering. The comparisons of word embedding models show that BERT often gives highest classification accuracy but with much higher energy cost, while BIRCH and K-means are robust clustering choices across embedding models. Finally, we present practical guidelines for telecom analysts: choose parsimonious features for faster inference, balance model complexity with energy use, and adopt our evaluation metric when dealing with short feedback from telecom customers. Therefore, this work contributes both methodological improvement for short-text analytics and actionable insights for industry practitioners.

Awarding Institution(s)

University of Plymouth

Supervisor

Craig McNeile, Malgorzata Wojtys

Document Type

Thesis

Publication Date

2026

Embargo Period

2026-04-30

Deposit Date

April 2026

Creative Commons License

Creative Commons Attribution-NonCommercial 4.0 International License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Share

COinS