Voice Agents
Benchmarks: 2.5x better than conventional RAG
Seqtra Benchmarks

David Lee



Mean Average Precision (mAP)
mAP measures mean of average precision of each query in the test set. The average precision of a query denotes how well the relevant “documents” are ranked within the top k retrieval of the system. In our case, a document is a chunk. “Document” is the standard term used in information retrieval for each datum retrieved by the system.
Average Precision is calculated as follows:
First, consider a query that retrieves a set of items.
Compute the precision for each position where a relevant item is retrieved in the ranked results list.
Average these precision values, but only for the positions where relevant items were retrieved.
2WikiMultihopQA
Dataset link: https://github.com/Alab-NII/2wikimultihop
This dataset is a QA dataset where the question is asked about a collection of wikipedia articles and each question requires retrieving paragraphs of multiple such articles to answer it correctly.
So, each request for our analysis consists of following variables:
Question/query: Example, Who is the mother of the director of film Polish-Russian War (Film)?
Context: i.e. paragraphs (with title) among which to select a certain set of paragraph as retrieved documents. For example,
[
['Xawery Żuławski', ['Xawery Żuławski (born 22 December 1971 in Warsaw) is a Polish film director.',
'In 1995 he graduated National Film School in Łódź.',
'He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.',
…]],
['Polish-Russian War (film)',
['Polish-Russian War', '(Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.']],
…,
['Snow White and the Seven Dwarfs (1955 film)',
['Snow White and the Seven Dwarfs( USA:" Snow White") is a 1955 German film, directed by Erich Kobler, based on the story of Schneewittchen by the Brothers Grimm.']],
]
Ground Truth: ['Polish-Russian War (film)', 'Xawery Żuławski']
As you can see from 3, the relevant paragraphs are only a subset of “context” enlisted in point 2.
Another column that requires further explanation is the “type” column. There are four question types in the dataset, which are explained below:
(The following description is taken from the dataset paper section 2.2)
Comparison question is a type of question that compares two or more entities from the same group in some aspects of the entity. For instance, a comparison question compares two or more people with the date of birth or date of death (e.g., Who was born first, Albert Einstein or Abraham Lincoln?).
Inference question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the Knowledge Base (KB). They utilized the logical rule to acquire the new triple (e, r, e2), where r is the inference relation obtained from the two relations r1 and r2. A question–answer pair is created by using the new triple (e, r, e2), its question is created from (e, r) and its answer is e2. For instance, using two triples (Abraham Lincoln, mother, Nancy Hanks Lincoln) and (Nancy Hanks Lincoln, father, James Hanks), we obtain a new triple (Abraham Lincoln, maternal grandfather, James Hanks). A question is: Who is the maternal grandfather of Abraham Lincoln? An answer is James Hanks.
Compositional question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the KB. Compared with inference question, the difference is that no inference relation r exists from the two relations r1 and r2. For instance, there are two triples (La La Land, distributor, Summit Entertainment) and (Summit Entertainment, founded by, Bernd Eichinger). There is no inference relation r from the two relations distributor and founded-by. In this case, a question is created from the entity e and the two relations r1 and r2: Who is the founder of the company that distributed La La Land film? An answer is the entity e2 of the second triple: Bernd Eichinger.
Bridge-comparison question is a type of question that combines the bridge question with the comparison question. It requires both finding the bridge entities and doing comparisons to obtain the answer. For instance, instead of directly comparing two films, we compare the information of the directors of the two films, e.g., Which movie has the director born first, La La Land or Tenet? To answer this type of question, the model needs to find the bridge entity that connects the two paragraphs, one about the film and one about the director, to get the date of birth information. Then, making a comparison to obtain the final answer.
Methods
Since we want to show because of the established connection in the graph, more relevant items are fetched in the topk retrieval, we compare the following methods:
Chunking by title: Here, each document (in IR terms) is denoted by each paragraph in the context, i.e content of each title, and k such paragraphs are fetched during retrieval, i.e. retrieved as independent paragraphs without considering any links. Retriever used is vector-based RAG
Chunking by Seqtra: Here, each document is a paragraph, but for every k retrieval, the system also retrieves its connected paragraphs, which are linked through shared topics. The retrieval system is seqtra’s late chunker
Metrics
Hypothesis: Retrieving chunks through linkages perform better than if they were retrieved as having no connection

Chunking by title: Denotes that a single chunk is a paragraph belonging to the section as defined by the section header, and retrieved using vector-based RAG
Chunking by Seqtra: Denotes that a single chunk is a chunk retrieved by Seqtra.
mAP@k: Denotes mean average precision on top-k retrieval
Here, we can observe that seqtra outperforms conventional vector-based retrieval. We also see that the jump between each top-k performance is lower in the case of seqtra. That means we demonstrate a better performance than top 10 retrieval of vector-based RAG with fewer seed node retrievals. So, the graph traversal is compensated by having to only retrieve fewer seed nodes.
Mean Average Precision (mAP)
mAP measures mean of average precision of each query in the test set. The average precision of a query denotes how well the relevant “documents” are ranked within the top k retrieval of the system. In our case, a document is a chunk. “Document” is the standard term used in information retrieval for each datum retrieved by the system.
Average Precision is calculated as follows:
First, consider a query that retrieves a set of items.
Compute the precision for each position where a relevant item is retrieved in the ranked results list.
Average these precision values, but only for the positions where relevant items were retrieved.
2WikiMultihopQA
Dataset link: https://github.com/Alab-NII/2wikimultihop
This dataset is a QA dataset where the question is asked about a collection of wikipedia articles and each question requires retrieving paragraphs of multiple such articles to answer it correctly.
So, each request for our analysis consists of following variables:
Question/query: Example, Who is the mother of the director of film Polish-Russian War (Film)?
Context: i.e. paragraphs (with title) among which to select a certain set of paragraph as retrieved documents. For example,
[
['Xawery Żuławski', ['Xawery Żuławski (born 22 December 1971 in Warsaw) is a Polish film director.',
'In 1995 he graduated National Film School in Łódź.',
'He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.',
…]],
['Polish-Russian War (film)',
['Polish-Russian War', '(Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.']],
…,
['Snow White and the Seven Dwarfs (1955 film)',
['Snow White and the Seven Dwarfs( USA:" Snow White") is a 1955 German film, directed by Erich Kobler, based on the story of Schneewittchen by the Brothers Grimm.']],
]
Ground Truth: ['Polish-Russian War (film)', 'Xawery Żuławski']
As you can see from 3, the relevant paragraphs are only a subset of “context” enlisted in point 2.
Another column that requires further explanation is the “type” column. There are four question types in the dataset, which are explained below:
(The following description is taken from the dataset paper section 2.2)
Comparison question is a type of question that compares two or more entities from the same group in some aspects of the entity. For instance, a comparison question compares two or more people with the date of birth or date of death (e.g., Who was born first, Albert Einstein or Abraham Lincoln?).
Inference question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the Knowledge Base (KB). They utilized the logical rule to acquire the new triple (e, r, e2), where r is the inference relation obtained from the two relations r1 and r2. A question–answer pair is created by using the new triple (e, r, e2), its question is created from (e, r) and its answer is e2. For instance, using two triples (Abraham Lincoln, mother, Nancy Hanks Lincoln) and (Nancy Hanks Lincoln, father, James Hanks), we obtain a new triple (Abraham Lincoln, maternal grandfather, James Hanks). A question is: Who is the maternal grandfather of Abraham Lincoln? An answer is James Hanks.
Compositional question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the KB. Compared with inference question, the difference is that no inference relation r exists from the two relations r1 and r2. For instance, there are two triples (La La Land, distributor, Summit Entertainment) and (Summit Entertainment, founded by, Bernd Eichinger). There is no inference relation r from the two relations distributor and founded-by. In this case, a question is created from the entity e and the two relations r1 and r2: Who is the founder of the company that distributed La La Land film? An answer is the entity e2 of the second triple: Bernd Eichinger.
Bridge-comparison question is a type of question that combines the bridge question with the comparison question. It requires both finding the bridge entities and doing comparisons to obtain the answer. For instance, instead of directly comparing two films, we compare the information of the directors of the two films, e.g., Which movie has the director born first, La La Land or Tenet? To answer this type of question, the model needs to find the bridge entity that connects the two paragraphs, one about the film and one about the director, to get the date of birth information. Then, making a comparison to obtain the final answer.
Methods
Since we want to show because of the established connection in the graph, more relevant items are fetched in the topk retrieval, we compare the following methods:
Chunking by title: Here, each document (in IR terms) is denoted by each paragraph in the context, i.e content of each title, and k such paragraphs are fetched during retrieval, i.e. retrieved as independent paragraphs without considering any links. Retriever used is vector-based RAG
Chunking by Seqtra: Here, each document is a paragraph, but for every k retrieval, the system also retrieves its connected paragraphs, which are linked through shared topics. The retrieval system is seqtra’s late chunker
Metrics
Hypothesis: Retrieving chunks through linkages perform better than if they were retrieved as having no connection

Chunking by title: Denotes that a single chunk is a paragraph belonging to the section as defined by the section header, and retrieved using vector-based RAG
Chunking by Seqtra: Denotes that a single chunk is a chunk retrieved by Seqtra.
mAP@k: Denotes mean average precision on top-k retrieval
Here, we can observe that seqtra outperforms conventional vector-based retrieval. We also see that the jump between each top-k performance is lower in the case of seqtra. That means we demonstrate a better performance than top 10 retrieval of vector-based RAG with fewer seed node retrievals. So, the graph traversal is compensated by having to only retrieve fewer seed nodes.
Mean Average Precision (mAP)
mAP measures mean of average precision of each query in the test set. The average precision of a query denotes how well the relevant “documents” are ranked within the top k retrieval of the system. In our case, a document is a chunk. “Document” is the standard term used in information retrieval for each datum retrieved by the system.
Average Precision is calculated as follows:
First, consider a query that retrieves a set of items.
Compute the precision for each position where a relevant item is retrieved in the ranked results list.
Average these precision values, but only for the positions where relevant items were retrieved.
2WikiMultihopQA
Dataset link: https://github.com/Alab-NII/2wikimultihop
This dataset is a QA dataset where the question is asked about a collection of wikipedia articles and each question requires retrieving paragraphs of multiple such articles to answer it correctly.
So, each request for our analysis consists of following variables:
Question/query: Example, Who is the mother of the director of film Polish-Russian War (Film)?
Context: i.e. paragraphs (with title) among which to select a certain set of paragraph as retrieved documents. For example,
[
['Xawery Żuławski', ['Xawery Żuławski (born 22 December 1971 in Warsaw) is a Polish film director.',
'In 1995 he graduated National Film School in Łódź.',
'He is the son of actress Małgorzata Braunek and director Andrzej Żuławski.',
…]],
['Polish-Russian War (film)',
['Polish-Russian War', '(Wojna polsko-ruska) is a 2009 Polish film directed by Xawery Żuławski based on the novel Polish-Russian War under the white-red flag by Dorota Masłowska.']],
…,
['Snow White and the Seven Dwarfs (1955 film)',
['Snow White and the Seven Dwarfs( USA:" Snow White") is a 1955 German film, directed by Erich Kobler, based on the story of Schneewittchen by the Brothers Grimm.']],
]
Ground Truth: ['Polish-Russian War (film)', 'Xawery Żuławski']
As you can see from 3, the relevant paragraphs are only a subset of “context” enlisted in point 2.
Another column that requires further explanation is the “type” column. There are four question types in the dataset, which are explained below:
(The following description is taken from the dataset paper section 2.2)
Comparison question is a type of question that compares two or more entities from the same group in some aspects of the entity. For instance, a comparison question compares two or more people with the date of birth or date of death (e.g., Who was born first, Albert Einstein or Abraham Lincoln?).
Inference question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the Knowledge Base (KB). They utilized the logical rule to acquire the new triple (e, r, e2), where r is the inference relation obtained from the two relations r1 and r2. A question–answer pair is created by using the new triple (e, r, e2), its question is created from (e, r) and its answer is e2. For instance, using two triples (Abraham Lincoln, mother, Nancy Hanks Lincoln) and (Nancy Hanks Lincoln, father, James Hanks), we obtain a new triple (Abraham Lincoln, maternal grandfather, James Hanks). A question is: Who is the maternal grandfather of Abraham Lincoln? An answer is James Hanks.
Compositional question is created from the two triples (e, r1, e1) and (e1, r2, e2) in the KB. Compared with inference question, the difference is that no inference relation r exists from the two relations r1 and r2. For instance, there are two triples (La La Land, distributor, Summit Entertainment) and (Summit Entertainment, founded by, Bernd Eichinger). There is no inference relation r from the two relations distributor and founded-by. In this case, a question is created from the entity e and the two relations r1 and r2: Who is the founder of the company that distributed La La Land film? An answer is the entity e2 of the second triple: Bernd Eichinger.
Bridge-comparison question is a type of question that combines the bridge question with the comparison question. It requires both finding the bridge entities and doing comparisons to obtain the answer. For instance, instead of directly comparing two films, we compare the information of the directors of the two films, e.g., Which movie has the director born first, La La Land or Tenet? To answer this type of question, the model needs to find the bridge entity that connects the two paragraphs, one about the film and one about the director, to get the date of birth information. Then, making a comparison to obtain the final answer.
Methods
Since we want to show because of the established connection in the graph, more relevant items are fetched in the topk retrieval, we compare the following methods:
Chunking by title: Here, each document (in IR terms) is denoted by each paragraph in the context, i.e content of each title, and k such paragraphs are fetched during retrieval, i.e. retrieved as independent paragraphs without considering any links. Retriever used is vector-based RAG
Chunking by Seqtra: Here, each document is a paragraph, but for every k retrieval, the system also retrieves its connected paragraphs, which are linked through shared topics. The retrieval system is seqtra’s late chunker
Metrics
Hypothesis: Retrieving chunks through linkages perform better than if they were retrieved as having no connection

Chunking by title: Denotes that a single chunk is a paragraph belonging to the section as defined by the section header, and retrieved using vector-based RAG
Chunking by Seqtra: Denotes that a single chunk is a chunk retrieved by Seqtra.
mAP@k: Denotes mean average precision on top-k retrieval
Here, we can observe that seqtra outperforms conventional vector-based retrieval. We also see that the jump between each top-k performance is lower in the case of seqtra. That means we demonstrate a better performance than top 10 retrieval of vector-based RAG with fewer seed node retrievals. So, the graph traversal is compensated by having to only retrieve fewer seed nodes.
Like this article? Share it.
Start building your AI agents today
Join 10,000+ developers building AI agents with ApiFlow
You might also like
Check out our latest pieces on Ai Voice agents & APIs.