Evaluating Generative AI for telecom
- Large Language Models (LLMs) have transformed how businesses operate, particularly in the telecom industry. Companies use these models to enhance customer service and streamline processes, offering significant advantages (help answering questions).
- This blog post explores the challenges of using LLMs in telecom and discusses how to measure their success within the telecom domain.
Principal AI Technology Leader, Business area cloud and software services

Principal AI Technology Leader, Business area cloud and software services
Principal AI Technology Leader, Business area cloud and software services
What is Generative AI? What are LLMs?
Generative AI (GenAI) is a technology that can create new content by learning patterns from existing data. It works with different types of data, such as words, pictures, sounds, or numbers in structured data.
Language models (LMs) are revolutionizing how we interact with technology. They are designed to predict the next word in a sentence by analyzing the preceding words, making communication effective and more intuitive. Some models use simple statistics, while others use complex systems called neural networks. When a model has many parameters, in order of billions or trillions, it is called a Large Language Model (LLM).
In the telecom industry, LLMs are used for chatbots, intelligent searches, and creating synthetic data for network simulations. These use cases involve answering questions from users pertaining to telecom, commonly known as Question-and-Answer (QA) task for the telecom domain.
LLMs trained on domain data are not widely available. To use them for QA applications, one approach is to provide domain-specific information relevant to the question. This technique is known as retrieval augmented generation (RAG). The aim is to provide additional context to the LLM via the prompts by retrieving domain-specific information and augmenting it with the query to generate the response.
Evaluating LLMs for telecom solutions involves two steps: (i) selecting the right model and (ii) assessing its performance for specific tasks (referred to as system performance evaluation). The process might need to be repeated, depending on the complexity of the task and the data availability.
Choosing the right LLM involves considering factors like scalability, stability, and robustness with minimal errors and adapting the model to the telecom domain. To improve the performance of the chosen model, it is necessary to track its performance based on the specific tasks for which it has been trained.
It’s also important to evaluate the model systematically to ensure it meets business needs and provides a good return on investment (ROI).
LLMs are primarily trained using data sourced from the internet, they generate new content and they may not always give reliable answers, especially in specific domains like telecom. They can produce incorrect or inconsistent content, known as hallucinations. Evaluating the quality of LLMs is important for businesses, especially in telecom, where reliability is crucial. Therefore, it is important to train/adapt these models to be telecom-aware.
In this blog, we will explore the challenges in selecting and evaluating LLMs for QA tasks in the telecom domain. We will examine three key challenges pertaining to data, model selection/adaptation, and choice of metrics for evaluating LLMs for QA, with a detailed focus on metrics for the RAG approach.
Our top five takeaways, based on our research, for choosing LLMs for QA tasks in telecom
- Models (LLMs) trained on telecom data can perform better for question-answering tasks than those that are not.
- When models are trained on a specific domain, for example, telecom, we also need a telecom benchmark dataset in order to evaluate the performance of that model.
- Model evaluation often involves the use of similar LLM models which act as judges, referred to as Oracle LLMs. The assumption is that the Oracle LLM is aware of the domain and hence can judge the performance of a trained model. Oracle LLMs, however, are not foolproof.
- Be aware of the cost and compute requirements for the choice of LLMs during training and inference.
- Evaluation of generative AI applications is an evolving field, and metrics must be chosen after careful consideration based on the task at hand.
In the upcoming section, we explore the intricacies of AI systems, focusing on how they handle various types of data, including text, images, audio, and numerical information.
Decoding AI: Tokens, embeddings, and the RAG method
AI systems work with different types of data, like words in a sentence, pixels in a picture, sounds in music, or numbers in structured data. These basic units are often called tokens. These tokens are converted into numerical vectors. The models that produce these vectors are called embedding models, and the vectors are known as embeddings. The embeddings help models identify semantically similar content by measuring the closeness of these vectors.
In the RAG approach for QA tasks, data is stored as vectors using embedding models in a special database called the vector database, as shown in Figure 1. When a user asks a question, it’s converted into an embedding. Similar data chunks are retrieved from the vector database by comparing the similarity of embeddings. The retrieved data chunks are augmented in the prompt along with the user query and sent to the LLM to generate answers. These steps correspond to retrieval, augmentation, and generation processes in the RAG approach.
Figure 1: A basic workflow for retrieval augmented generation (RAG) with telecom-adapted models
Challenges in evaluation
One of the main challenges with selecting LLMs is assessing the quality of their outputs. This includes challenges related to the data used, the metrics for evaluation, and the availability of factually correct outputs to compare against. In the following subsection, we will elaborate on a few of these challenges.
Data-related challenges
When evaluating machine learning models, it is crucial to prevent data leakage between the training and test datasets. This ensures that the test results are accurate and reliable. However, this becomes more complicated with many LLMs because they are usually trained on data sourced from the internet, making it harder to completely separate the two datasets. Many commercial and open-source LLMs do not document where their training data comes from. While the model weights may be available to everyone, the actual data and processes used to train them often are not shared publicly. This makes it hard to evaluate the models accurately without risking data leakage. Additionally, the outputs from LLMs can sometimes be biased because the training data might have biases.
To test how well LLMs work, we need custom evaluation datasets that match specific domains and tasks. Typically, open-source datasets and leaderboards (such as Hugging Face) benchmark LLM metrics on open-source datasets. There exist publicly available telecom datasets such as TeleQnA, and GSMA Open-Telco LLM Benchmarks. However, these may not suffice for all business needs. Hence, it is important and necessary to create domain and task-specific datasets. Like any AI dataset, it should cover a wide range of sources, types, and task variations. It is possible that the evaluation of an application does not have a relevant dataset to start with. Hence, datasets are manually created. TeleQuAD is one such customized dataset and contains QAs curated from 3GPP standards. While this approach is preferred, it is time-consuming and not easily scalable. Advances in LLMs can help create synthetic datasets to augment such efforts. Methods such as prompt-based instructions, providing few-shot examples in the prompts, or creating a taxonomy-based instruction dataset by sampling from different nodes of the taxonomy help in synthetic dataset creation.
Metric-related challenges
LLM-based RAG approach has several AI models as building blocks. The metrics for components relating to retrieval models are well-established. However, measuring the effectiveness of a generator and how the entire RAG system works together is an area of active and evolving research. Metrics for evaluating LLMs involve checking the factual accuracy, relevance, and semantic similarity of generated answers. Models need to be domain-aware, especially in telecom, and should be trained or adapted accordingly. For example, for semantic similarity, we use embedding models to generate vectors that are compared for closeness and hence similarity. Here, we must consider telecom (domain) adapted embedding models for this semantic similarity measure. Similarly, current literature considers larger LLM models as expert evaluators or Oracle(s) to evaluate the answer. Again, these also must be domain-aware Oracle LLMs. Another challenge is that LLMs often hallucinate and give long and verbose answers. So, it becomes important to evaluate them based on how much detail is expected for the task or application.
LLM evaluation criteria
Model selection criteria
Leaderboards such as Hugging Face leaderboard report the overall macro-level performance of LLM models on a variety of tasks and use corresponding standardized benchmark datasets. Developers and engineers might pick an LLM based on these scores. However, factors such as organizational business requirements, risk assessments, and compliance frameworks often guide the overall LLM selection process. Some important criteria for choosing a model include license type, data handling, model size, ease of retraining, and deployment needs such as infrastructure, latency, configurability, reliability, scalability, consistency, and cost.
System (task) evaluation criteria
Typically, LLMs are domain-adapted or fine-tuned to improve performance for the considered tasks. System evaluation criteria focus on how well LLMs respond to user prompts and how stable and reliable their performance is when context changes. The process of selecting and evaluating models can be iterative to improve performance for specific tasks.
RAG evaluation metrics
Assessing LLM-based RAG applications for QA tasks requires a diverse set of metrics, some examples of which are shown in Figure 2. These metrics help us understand how well the system is performing and can be broadly classified into classical (statistical) methods and language model-based methods. For traditional tasks like retrieval or classification, metrics such as accuracy, precision, and recall are common. However, for assessing QA using the RAG system, we might need the following metrics:
- Lexical metrics: These look at individual characters or words using methods like edit distance, BLEU, ROUGE, and WER.
- Embedding-based metrics: These include BERTScore and BARTScore, which measure how similar the sentences are. Other language-based metrics include MENLI, BLEU-RT.
- Oracle LLM-based metrics: These might use measures such as GPTScore and SelfCheck-GPT to evaluate aspects like factual correctness and relevance.
Using multiple metrics can give a better picture, and we recommend a combination of metrics to effectively assess the LLMs for any application/task.
Table 1 below shows how different components of the RAG-based system can be evaluated, along with a view of the corresponding metrics. In technical domains, such as telecom, the answers from an LLM and those from domain-adapted models using RAG can vary, which can significantly impact their associated metrics.
Evaluation components | System/Task-based metrics | |
---|---|---|
Retrieval | Prec@K, Recall@K, F-1Score, NDCG, MRR,… | |
Generation | Lexical-based | Edit distance, N-gram-BLEU, ROUGE, METEOR,… |
Embedding and Language-based | BERTScore, BARTScore, MoverScore, Cosine similarity, MENLI, BLEU-RT … | |
Oracle LLM-based | Faithfulness, answer relevancy, factual correctness, GPTScore, Summarization score, SelfCheckGPT,… |
Table 1: Evaluation components and representative metrics for an exemplary QA task using RAG architecture
The figure below presents a sample question paired with one relevant and two irrelevant contexts, along with the similarity scores using general (publicly available) and telecom domain-adapted embedding models.
Question: When does the Serving RNS send a message to the new RNS?
Relevant Context: Enhanced SRNS relocation. The successful operation of the Enhanced SRNS relocation procedure is as follows. When the Serving RNS (RNS-A) decides to perform the Enhanced SRNS Relocation procedure, it will send an IUR-ENHANCED- RELOCATION-REQUEST message to the new RNS (RNSB). The IUR-ENHANCED RELOCATION-REQUEST message shall contain the necessary information to set up a CS Radio Access Bearer in RNS-B.
Non-relevant context 1: If the tracing of IMEIs is implemented, then the way the trace facility is used and organized, including restrictions due to national laws and regulations, will be a matter for the PLMN operator. An IMEI may be traced to find out the current IMSI or the location or behavior of faulty or stolen equipment reported via the EIR. This TS describes one method of handling IMEI tracing specifically tracing via the VLR.
Non-relevant context 2: The mobile phones synchronize with the mobile network by listening to the primary and secondary synchronization signals (PSS and SSS). The horizontal separations are the OFDM symbols.
Similarity score
The domain-adapted model scores higher when paired with relevant context (0.62) and lower with irrelevant context (0.21 and 0.22). The generic model, although has better numbers for relevant context (0.72), doesn’t distinguish well between relevant (0.72) and non-relevant contexts (0.57 and 0.60). This could lead to irrelevant context being included in top K retrieved results. This highlights the importance of using domain-specific models for better accuracy.
The below table shows a few RAG evaluation metrics (such as BLEU, BERTScore, factual, and answer correctness) for a few sample questions. The table lists questions, correct answers (ground truth), answers from a general-purpose LLM (GPT-4 Turbo) that has not been adapted to the telecom domain, general-purpose LLM in RAG setting, telecom- adapted LLM (but not RAG), and RAG output with telecom-adapted retrievers.

Table 2: Two telecom domain questions-and-answers using LLMs directly and with the RAG approach, with and without domain adaptation, using Mistral 12B as Oracle LLMs for some metrics.
Download the above data as a table format.
The responses generated by a general-purpose LLM do not align with ground truth answers. Hence, metric scores reflect lower values. However, when using RAG with models adapted to the telecom domain (when the retriever or generator models are trained with telecom information), the answers generated are closer to the ground truth, and the scores are also higher. These examples show how training models specifically on telecom data can improve the accuracy of answers for telecom-related questions.
Sometimes, lexical metrics and similarity-based metrics, which reflect the closeness of word choices, do not show how much the generated answer matches the ground truth (refer to BLEU and BERTScores row 1 of Table 3). In other cases, however, the response matches closely with ground truth. The scores of these metrics are higher (refer to BLEU scores in row 2 of Table 3). We notice that metrics like BERTScore might not always reflect the true accuracy of the answer. Oracle-based metrics, which look at factual correctness, offer a clearer picture of how accurate the responses are. Thus, we conclude that no single metric can fully assess the performance of a system, so it is crucial to use various metrics to evaluate different aspects of the responses.
It is clear from the above results that domain-aware models have improved performance. However, it is important to remember that adapting/training/fine-tuning these models requires significant effort and cost. These factors are important to consider when deciding which model and method to use for a particular task.
Conclusions
The first part of this blog series focuses on text-based question and answer tasks. We have discussed why it is important to evaluate LLMs well and the difficulties that arise from these models. Measuring the effectiveness of RAG approaches for QA tasks can be challenging due to the various factors we must consider for each application. Also, Oracle LLMs are not perfect judges – they also hallucinate – so it’s important to pick metrics carefully. Hence, the original problem of evaluating generator models is not completely solved. In future blog posts, we will explore metrics for other tasks like text-to-SQL, creating computer code, and new generative AI paradigms, including AI agents.
Read more
Read the abstracts and full papers of some of our peer reviewed research papers on this topic
- Evaluation of RAG metrics for question answering in the telecom domain from the Workshop on Foundation Models in the Wild, ICML, 2024.
- Investigating Distributions of Telecom Adapted Sentence Embeddings for Document Retrieval from the Workshop on Next-Gen networks through LLMs, Action Models, and Multi-Agent Systems, International Conference on communications (ICC2025).
- Using large language models to understand telecom standards, from the IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024.
- Generative AI in mobile networks: a survey, vol. 79 (1), Annals of Telecommunications, 2024, pp. 15-33.
- An Evaluation Survey of Knowledge-Based Approaches in Telecommunication Applications, ADMP, 2024.
Read the blog post Adopting neural language models for the telecom domain
RELATED CONTENT
Like what you’re reading? Please sign up for email updates on your favorite topics.
Subscribe nowAt the Ericsson Blog, we provide insight to make complex ideas on technology, innovation and business simple.