Specializing Large Language Models for Telecom Networks
Table of Contents
Introduction
The advent of transformers1 made it possible to train models on large datasets efficiently. Prior to their arrival, deep learning for Natural Language Processing (NLP), used Recurrent Neural Networks (RNN) like Long Short-Term Memory (LSTM)2. Since they handled words in sentences sequentially, training them took time. Additionally, there was a good chance that some context would be missed in long texts and paragraphs. Transformers, on the other hand, operate on bulk text at the same time so they are more efficient at training times and can preserve context in long text and paragraphs. Due to the large number of parameters and volume of data involved in their training, transformer based models for NLP applications are usually termed as Large Language Models (LLM).
Although LLMs can generate text that is more human-like and can perform complex tasks like text summarization and question answering they come with some limitations. They can store some knowledge but the amount stored may not be sufficient for specific applications like medicine or engineering. Additionally, they are likely to come up with false information. This is called hallucination. Instruction for LLMs can be modified to achieve better performance and reduce these limitations. This processes is called Prompt Engineering. A pre-trained model can also be trained further to perform better at specific tasks. Due to the number of parameters involved, fine-tuning LLMs require much compute resources with large memory. Also we risk losing some capabilities of the pre-trained model at the fine-tuning stage in a process called catastrophic forgetting. Low-Rank Adaptation (LoRA)3 allows us to freeze all of the model parameters and then train a new compressed set of parameters. This reduces the amount of memory and compute resources required to specialize an LLM to a specific application. LoRA is an example of a collection of techniques that aims to fine-tune LLMs with few parameters called Parameter Efficient Fine-Tuning (PEFT)4.
The progress in the adaptation of artificial intelligence in applications has necessitated the ability to compare vectors efficiently. This has led to the development of vector stores. These are storage systems that are designed to store various data types including vectors. They provide the ability to compare a vector with multiple vectors and fetch the top $k$ data stored that are most similar to the input vector. They support algorithms such as cosine similarity, $k$-nearest neighbor(KNN) and Maximal Marginal Relevance (MMR)5. These algorithms are essential in RAG applications because they are employed in fetching relevant documents that relate to a query.
This is an endeavor to specialize a large language model for telecommunications network specifications. However, it can extended to other domains like law and medicine. In the following sections, we begin with a review of relevant literature in this area, examining existing research. Subsequently, we present our methodology and findings, followed by a discussion of the results.
Background
Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is a system that provides external knowledge as context to LLMs during generation. In general, RAG consists of a retriever and a generator. The retriever obtains relevant information relating to a query from a storage. The information is passed on to the generator as context to generate the appropriate response to the query. This mitigates hallucination because the LLM is provided up-to-date context about what it has to generate. It also reduces the size of the model required to perform certain tasks in specific domains because it does not rely much on the knowledge stored in the LLM's parameters during its pre-training.
The retriever usually sits on top of a vector store or vector database, which it queries for relevant information. These stores are populated with data and their corresponding vector representations. The process of generating a vector representation of data is called embedding. Embedding generates a fixed-dimensional vector, which may be less accurate for large bodies of text. In such situations, larger texts are broken down into chunks, and each chunk is embedded and stored separately. Chunking also introduces the risk of losing the overall context of the larger text. To mitigate this risk, chunks are often overlapped. Overlapping chunks increases the space and memory required to store texts.
The quality of documents retrieved is very essential to the overall performance of the RAG system. In most cases, similarity algorithms like cosine similarity or $k$-nearest neighbor ($k$-NN) or BM256 are applied to retrieve documents that are semantically similar to the query. In some applications, relevant documents may not necessarily have to be semantically similar to the input query algorithms like maximal marginal relevance (MMR)5 may be used. MMR improves the diversity of the documents retrieved by ensuring that documents that are semantically similar to documents already selected are de-prioritized. Other techniques like context re-ranking are used to further fine-tune the output of the retriever.
The LLM used as a generator may have limits on the number of tokens that it can process accurately. This limit should not be exceeded in order to guarantee the model's performance. While prompt engineering is generally relied on to generate the appropriate response it may be essential to fine-tune the model to perform certain specific tasks like answering multiple choice questions.
Literature Review
Retrieval Augmented Generation (RAG) was first introduced by Patrick et al.7. They state that although models get to store some information in their parameters that information can be out of date after some time. They propose a RAG model which consists of a retriever and a generator. The retriever is based on Dense Passage Retrieval (DPR)8. It consists of a bi-encoder, one to encode documents while the other for the query. In their implementation, they use a pre-trained BERT9 encoder for both query and documents. The generator consisted of BART-large10 model. They demonstrated that RAG had a superior performance at question answering and was able to answer questions correctly even when the documents retrieved did not have the answer.
Liu et al. in 11 demonstrate that although LLMs are able to take in long context, in question answering, their performance is affected by the location of the relevant context. They observe that models perform best when relevant information is at the beginning or at the end of the context even with models that handle very long contexts.
Methodology
We approached the problem of getting an LLM to answer multiple choice questions (MCQ) in the telecommunication domain by breaking it into tasks consisting of data preparation, large language model fine-tuning and inference. These tasks come together as a RAG system.
Experimental Setup
Our work was done on an HPZ840 server with $64$GB RAM, $50$GB swap memory, $2$ processors with $24$ CPU cores each and $2$ NVIDIA GeForce GTX $1080$ Ti GPUs. We used microsoft/phi-212 as base model for PEFT fine-tuning. The LoRA rank was $8$ and the model was trained for $6$ epochs at learning rate of $0.00005$. The rank value was chosen for because it was able to fit into the limited memory of the GPUs during training. The all-MiniLM-L6-v2 was used for embedding chunks and BAAI/bge-reranker-base13 was used for context ranking. This models were chosen because the have the same maximum context length. Chroma14 database was used as the vector store.
Data Preparation
The training documents were split into chunks and each chunk got embedded into a vector and then both vector and chunk text stored in a vector store. The chunk size was set at $512$ and all-MiniLM-L6-v2 sentence transformer15 was used for embedding. It outputs $384$ dimensional dense vectors. all-MiniLM-L6-v2 has a $512$ token length limit so we decided to set our chunk size at that value using langchain's recursive character text splitter16 although the character length of a sentence may not be equal to the token length. This was safe because tokens are usually chunks of characters therefore with our setup it was always going to be less than or equal to $512$.
Figure 1: Data preparation process.
Fine-tuning
We used the Parameter Efficient Fine-Tuning's Low-Rank Adaptation (LoRA) method for training the base model because it reduces the number of parameters to be trained and thus resources required. We retrieved relevant documents (chunks) from the vector store using only the question of each MCQ to generate prompts. Our prompt required the LLM to select the answer and also explain why it chose the option. The retrieval process is described by figure \ref{figure-retrieval-process}. Documents with similar embedding as the question are queried from the vector store using two algorithms, similarity using L2 distance and Maximal Marginal Relevance (MMR). The results are merged and then duplicates are dropped. The rest are ranked according to their relevance to the question and the top $k$ documents are selected. $k$ is set to $2$ during training and $7$ at inference time. They are reordered to have the most relevant documents at the beginning and the end. The list of final items is joined by a period and space, ". ", and that becomes the context prompting the LLM to answer the question. The model was trained for $6$ epochs.
Figure 2: Relevant document retrieval process.
Inference
This stage uses a similar retrieval process as the one at the fine-tuning stage however the top $7$ documents are selected. To improve the inference time, only $4$ tokens are generated since we only require the option number. We expect the model to say the correct option number within this number of tokens.
Results
Our fine-tuned model was able to achieve $0.77$ and $0.75$ accuracy scores on both private and public tests respectively. Since the model had a maximum context length of $2048$ tokens it was necessary that only relevant contexts are retrieved. One approach that the team explored was using an LLM which is able to handle a larger context length to answer the question and provide an explanation thereby compressing a lengthy context to a short one. Then the LLM, with the limited maximum context length, uses the resulting context to select the right answer. Although this had potential it defeated the purpose of using a smaller model since it would require more compute resources to run the larger model. Additionally, there was an increase in the inference time.
Conclusion
We have shown how we specialized a large language model to answer multiple choice questions about telecommunication networks. We discussed how we retrieved relevant documents from the vector store and also explored how large contexts could be compressed with more powerful LLMs. Due to how essential the retrieval of relevant documents is to this endeavor we recommend that further research should be into using techniques like multi-query retrieval and using graph representation to store documents and retrieve them.
Source Code
The source code for our work and instructions on using it are on github.
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2023). Attention Is All You Need.
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
Xu, L., Xie, H., Qin, S. J., Tao, X., & Wang, F. L. (2023). Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment.
Carbinell, J., & Goldstein, J. (2017). The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. ACM SIGIR Forum, 51(2), 209-210.
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Overview of the Third Text REtrieval Conference (TREC–3).
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-T. (2020). Dense Passage Retrieval for Open-Domain Question Answering.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts.
Microsoft Research. (2023). Phi-2: The Surprising Power of Small Language Models. Accessed: 2024-08-09.
Xiao, S., Liu, Z., Zhang, P., & Muennighoff, N. (2023). C-Pack: Packaged Resources To Advance General Chinese Embedding.
Chroma. (2024). Chroma: The AI-native Open-Source Embedding Database. Accessed: 2024-08-09.
Reimers, N., & Gurevych, I. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.
LangChain Contributors. (2024). LangChain: Build context-aware reasoning applications Version 0.2