Testing Embedding Models for RAG

Development | Petar Poljarević

Testing Embedding Models for RAG

Thursday, Nov 7, 2024 • 11 min read

How we evaluated and compared the RAG performance and embedding speed of different embedding models.

Retrieval-Augmented Generation (RAG) is a powerful approach that combines traditional information retrieval techniques with advanced generative models to enhance the quality and relevance of generated responses. The primary objective of this project was to evaluate the performance of various embedding models in RAG and to assess their processing speed in executing this task.

The Dataset

Before testing different models, it was imperative to identify a suitable dataset. Specifically, we sought a dataset rich in text chunks that could serve as answers, accompanied by a substantial collection of questions relevant to those text chunks.

For this purpose, we utilized the SQuAD (Stanford Question Answering Dataset), made accessible by P. Rajpurkar et al. This dataset comprises over 100,000 questions, each paired with one of 20,000 text chunks. More specifically, the answer to each question was contained inside exactly one chunk (the golden chunk). The text chunks were generated from a wide array of Wikipedia articles.

For further details about this dataset, please refer to this link.

We loaded the dataset using HuggingFace’s datasets library and focused exclusively on the train portion of the dataset, which contained around 19,000 text chunks and around 87,000 questions. This dataset was also subject to a slight reduction during the preparation phase of the project. Further details regarding this reduction can be found in a later section.

Embedding Models

With the dataset selected, we proceeded to identify embedding models for evaluation. We consulted HuggingFace, specifically the Leaderboard, to compile a list of Sentence Transformers under the Retrieval section.

Due to specific constraints, we filtered the models to meet the following criteria:

Each model must possess fewer than one billion parameters, as larger models would necessitate excessive memory and computational power, increasing the embedding time drastically.
Each model’s embedding dimensions must not exceed 2,000; this will be elaborated upon shortly.
Each model must accommodate at least 512 maximum tokens as input.
The model must be monolingual, trained on English data.

Due to the impracticality of testing every model that met these criteria, we selected a single model to serve as a baseline. We chose all-MiniLM-L12-v2 for this purpose, as - during some of our previous research - it demonstrated a balance of strong retrieval performance, minimal memory requirements, and efficient embedding times. From the previously filtered models, we further refined our selection to include only those that ranked above our chosen baseline.

Additionally, we opted to evaluate one of OpenAI’s models, specifically the text-embedding-3-small model, which operates under a commercial license, unlike the open-source models sourced from HuggingFace. Given the model’s relatively low pricing, we were prepared to consider its adoption should its performance demonstrate a notable improvement over that of the open-source alternatives.

A comprehensive list of the models employed is available in the table located in the Results section.

The Database

To facilitate the retrieval of answers to each question, we established a local PostgreSQL database equipped with the pgvector extension. Since pgvector supports the storage of vectors up to 2,000 dimensions, we ensured that all selected embedding models adhered to this constraint.

Indexing

To enhance querying speed, we implemented a Hierarchical Navigable Small World (HNSW) index with default parameters (m=16, ef_construction=64) for each column that represents an embedding vector.

Text Chunks (“Answers”)

Each text chunk was stored in the database alongside its lemmatized and no-stopwords versions.

Lemmatized Form: This version of the text has undergone lemmatization, a process that reduces words to their base or dictionary form. For example, the words “running,” “ran,” and “runs” would all be reduced to the lemma “run.” This normalization helps to unify different forms of a word, thereby improving the model’s ability to retrieve relevant information during queries.
No-Stopwords Form: This version excludes common stop words such as “and,” “the,” “is,” and “in”—that do not carry significant meaning and are often filtered out in natural language processing tasks. By removing these words, the focus is shifted to the more informative content of the text, enhancing the embedding process and improving the retrieval effectiveness.

Each of these forms was embedded using all of the models we picked earlier, and the embeddings were also stored in the database.

Questions

We replicated the aforementioned process for the questions, with one small modification: each question was linked to its corresponding chunk (the golden chunk), with the chunk’s ID serving as a foreign key.

Each question was also converted to its lemmatized and no-stopwords versions, and the process used for embedding and storing the questions was exactly the same as it was for the chunks.

Technical Specifications

For the purposes of this project, we used a PC with the following specifications:

OS: Windows 10 (Ubuntu 22 via WSL)
CPU: Intel Core i5-10500 @ 3.10GHz
RAM: 32GB (DIMM 2400MHz)

Project Workflow

Extracting the Text Chunks

This process utilized HuggingFace’s datasets library and was rather simple: we loaded all chunks into memory, removed the duplicates, and saved all remaining chunks to an external file.

Additionally, we tokenized each chunk with each of the embedding models and excluded all chunks containing over 512 tokens when tokenized with any of the models. Note that this step resulted in the removal of only a few dozen entries from the dataset, which was insignificant given the initial count of almost 19,000 chunks.

Extracting the Questions

The question extraction process closely mirrored that of the chunk extraction, with the additional requirement of identifying and recording the golden chunk’s ID for each question in the external file.

Embedding the Chunks

Embedding was conducted using three different libraries: sentence-transformers, Transformers, and langchain-openai, all of which are available via pip. The embedding process entailed the following steps:

Data and model loading

We loaded our filtered dataset (the one with chunks larger than 512 tokens removed) into memory. After that, we loaded all of the selected embedding models into memory (with the exception of text-embedding-3-small, which requires an external connection).

While this process required a lot of memory, it improved the performance of the embedding process itself.

Conversion and embedding

Each chunk was converted to its no-stopwords form (using the nltk library) and its lemmatized form (using the spacy library). Each of those forms was embedded using all models, with all forms and all their respective embeddings being inserted into the database afterwards.

Attempts at performance improvement

Due to a large amount of textual data, the embedding process was quite slow. In an attempt to speed it up, we tried embedding the questions in batches rather than individually, but this actually resulted in increased CPU usage and slower processing times (most likely due to the large sizes of individual chunks), so we decided to revert to embedding all chunks individually.

Embedding the Questions

The embedding of questions followed a similar methodology as the chunk embedding, with one notable difference: the attempted performance improvement.

Initially, we aimed to replicate the exact process used for the chunks; however, we soon discovered that the estimated time for this approach was going to be even greater than the time required for chunk embedding - most likely due to the large amount of questions.

Given that each question has a significantly lower token count compared to the chunks, we hypothesized that the “batch embedding” technique would yield better performance in this context. Fortunately, this assumption proved correct, reducing the total processing time by about 75%.

Generating the Indexes

To enhance querying speed, we implemented a HNSW index utilizing cosine distance across each vector column. Currently, we have only tested it with default parameters, though we intend to experiment with additional configurations in the future.

Querying

With the database populated with the necessary data, we proceeded to conduct tests. The querying procedure itself was quite simple: we looped through all of the questions, and, for each column, we retrieved the list of top 10 chunks most likely to be the golden chunk.

Retrieving the chunks

Each question and each chunk were embedded using different models, and the results of those embedding were high-dimensional vectors. Once they were in vector format, we could compare how similar are they to each other using distance functions. In our case, the vectors were compared using the cosine distance function.

For example, let’s say our current question-model pair is represented with the vector x. To retrieve our list of top 10 chunks, we would need to loop through every single chunk in the database and find the 10 vectors which have the lowest cosine distance to our vector x. While this would give us the exact solution for each vector, it would simply be too slow. For this reason, we added a HNSW index to each column in our database, as we’ve previously mentioned here. This index will only give us the approximate solution, but will do so in a significantly smaller amount of time.

Data Analysis

Upon completing the querying process, the final step was to analyze the results. The aim was to examine each embedding model’s performance based on the quality of answers retrieved for each question (the “top 10” list).

To achieve this, we employed Hit rate at rank r metric, which determines how many questions had their golden chunk located at the r-th position within the top 10 chunks list. Any questions without a retrieved golden chunk were categorized as “misses.”

Using the results of the Hit rate metric, we could also calculate the Top 3, Top 5, Top 10, and Mean reciprocal rank (MRR) metrics.

Results

The conclusive results of this project are summarized in the table below. Further testing remains necessary; however, our current focus will likely be on models 5, 6, and 7, aiming to balance performance with size, memory usage, embedding dimensions, and max token constraints, as well as their embedding speed.

No.	Model Name	Model Size (Mil. Params)	Memory Usage (GB, FP32)	Embedding Dimensions	Max Tokens	Rank 1	Top 3	Top 5	Top 10	Miss	MRR
1	stella_en_400M_v5	435	1.62	1024	8192	65.19%	80.36%	84.72%	88.86%	11.14%	0.74
2	nomic-embed-text-v1	137	0.51	768	8192	63.82%	78.32%	82.56%	86.71%	13.29%	0.72
3	nomic-embed-text-v1.5	137	0.51	768	8192	59.29%	74.66%	79.4%	84.4%	15.6%	0.68
4	bge-small-en-v1.5	33	0.12	384	512	59.27%	74.57%	79.26%	84.19%	15.81%	0.68
5	e5-large-v2	335	1.25	1024	512	75.97%	89.48%	92.5%	94.89%	5.11%	0.83
6	e5-base-v2	110	0.41	768	512	73.48%	88.2%	91.55%	94.34%	5.66%	0.81
7	e5-base-4k	112	0.42	768	4096	73.47%	88.17%	91.53%	94.31%	5.69%	0.81
8	e5-large	335	1.25	1024	512	58.84%	74.41%	79.14%	83.96%	16.04%	0.68
9	e5-base	109	0.41	768	512	56.75%	71.91%	76.82%	82.19%	17.81%	0.65
10	jina-embeddings-v2-base-en	137	0.51	768	8192	61.13%	76.24%	80.89%	85.57%	14.43%	0.70
11	instructor-large	335	1.25	768	512	62.02%	77.78%	82.51%	87.2%	12.8%	0.71
12	e5-small	33	0.12	384	512	53.47%	68.61%	73.53%	78.98%	21.02%	0.62
13	instructor-base	110	0.41	768	512	58.15%	74.03%	79.08%	84.35%	15.65%	0.67
14	gtr-t5-base	110	0.41	768	512	54.65%	69.85%	74.92%	80.51%	19.49%	0.63
15	all-MiniLM-L12-v2	33	0.12	384	512	57.32%	73.74%	78.97%	84.56%	15.43%	0.67
16	text-embedding-3-small	?	?	1536	8191	61.83%	77.34%	82.07%	86.81%	13.19%	0.71

Note: The no-stopwords and lemmatized forms were excluded from the final results, as all models demonstrated optimal performance with the original text form.

Speed Testing

Moving forward, our focus will shift to the models e5-large-v2, e5-base-v2, and e5-base-4k. Additionally, we will continue testing our baseline model, the all-MiniLM-L12-v2, as it is the model we have already ran several tests on during some of our previous research. With the selection of models narrowed down, we will evaluate their embedding speeds relative to different chunk sizes, where “size” refers to the number of tokens contained within each chunk.

Dataset

For the purposes of these tests, we created a dummy dataset consisting of 500 text chunks, each comprising n repetitions of the word “hello,” where n ranges from 1 to 500. Thus, the first chunk consists of the word “hello,” the second chunk contains “hello hello,” the third chunk has “hello hello hello,” and so on. The specific embedded forms of these chunks are not critical; our primary concern is the time required for embedding.

Timing the Models

We iterated through the dataset, measuring the embedding time for each model-chunk pair. Due to significant variations in embedding times, we repeated this process 15 times and calculated the average time for each model-chunk combination. The results are illustrated in the graph below.

Embedding Times

As observed, the embedding time for each model is proportional to its size, with smaller models, such as all-MiniLM-L12-v2, demonstrating the fastest performance, requiring less than 0.05 seconds even for chunks containing 500 tokens. Conversely, larger models, like e5-large-v2, exhibited markedly slower times, often exceeding one second once chunk sizes reached 400 tokens. This disparity becomes increasingly pronounced when embedding larger text chunks—those serving as answers—which need to be embedded only once, as their embeddings are stored in a database. In contrast, questions are typically much shorter than their respective answers and must be embedded rapidly. Thus, while low embedding times are generally important, they are particularly critical for shorter chunks.

Embedding Time Percentage

Although all three models identified as “optimal” exhibit slower performance compared to our baseline model, it is essential to assess the impact of this difference on answer generation. To facilitate this assessment, we measured their embedding time percentage and relative answering speed.

During some of our previous research, we have assessed our baseline model takes approximately three seconds to generate an answer, encompassing the question embedding process and subsequent operations. By factoring in this three-second baseline with the variance between a model’s embedding time and our baseline model’s embedding time, we can estimate the total time required to generate an answer using a new model. Our focus is on the ratio of a model’s embedding time to the overall answer generation time. The graph below illustrates this ratio.

Embedding Percentage

For example, if we consider a question containing 100 tokens and employ the e5-large-v2 model, approximately 9% of the time will be spent embedding the question. Conversely, using either of the two e5-base models results in an approximate 2% embedding time. When dealing with a 500-token question, the e5-large-v2 model increases the embedding time percentage to about 27.5%, while the two e5-base models maintain around 8-9%.

In comparison, our baseline model allocates merely 1% of the time for question embedding.

Since dedicating up to 9% of the total time to question embedding is generally acceptable, we should consider the adoption of either the e5-base-v2 or e5-base-4k models.

Relative Answering Speed

As previously mentioned, the average time required to generate an answer using our baseline model is approximately three seconds. By calculating the expected time needed for a new model to generate an answer and normalizing it against the three seconds, we can determine the percentage increase in answer generation time. The graph below presents these findings.

Relative Answering Speed

For instance, with a 100-token chunk, employing the e5-large-v2 model will result in a roughly 9% increase in answer generation time, while utilizing either of the two e5-base models will yield an increase of approximately 2%. For a 500-token chunk, the e5-large-v2 model’s usage will lead to an approximate 38% increase in answer generation time, while the two e5-base models will introduce around a 10-11% increase.

The 10-11% increase is deemed acceptable, warranting consideration of either the e5-base-v2 or e5-base-4k models.

Final Verdict

As we narrow our selection to two models, both exhibiting comparable performance and embedding speeds, the question arises: which model should we choose? The primary distinction between the two lies in their max_tokens parameter. The e5-base-4k model accommodates 4,096 tokens as input, while the e5-base-v2 model supports a maximum of only 512 tokens. More detailed information about these models is available here. In summary, the 4k version serves as an upgraded iteration of the v2 version, functioning identically for inputs of 512 tokens or fewer, while being fine-tuned for inputs with token counts ranging from 512 to 4,096.

Given that the 4k version represents an enhancement of e5-base-v2, we have opted to proceed with the e5-base-4k model.

Denis Susac

CEO at Mono

Let's talk 1:1 about choosing embedding models, enhancing RAG workflows, or building AI solutions for your business.

Schedule a call