Search Docs…

Search Docs…

Guide

Using Code

Using Code

The following description also applies to fields you see on UI
In config/request.yaml file:

  1. Change the value of api_token to provided Seqtra API token.

  2. Change the value of project_name to create different file collection. Make sure to do this in order to not mix up your personal files with already existing test file or segregate the knowledge base according to data or application domain. This will also avoid unintentionally reducing the page limit available.

  3. Replace the folder path in file_dir key with your folder path where the PDF files to be uploaded are stored. Please note that we have 100 total page (single or multiple PDFs) limit currently. There is a file upload field if you are using UI.

  4. Replace the query in query key with your own query

  5. You can skip steps 6 and 7 if you assign chunk_only to true.

  6. Replace llm with preferred choice between "claude" and "openai"

  7. Add your LLM API key to llm_key

  8. Run "python client.py". For UI, the command is in the previous section.

  9. The results will be saved under results folder in the form of JSON. You need to select the checkbox in order to save the results when using UI.


The description of the rest of the parameters in request.yaml file are discussed below:

  1. chunk_only: Setting this to false provides the answer to the query using LLM along with the retrieved chunks. Setting it to true provides you with only the relevant chunks.

  2. strategy: We currently provide four strategies to chunk and retrieve relevant context for the given query:

     a) seed_only: This is similar to conventional vector based retrieval, where it will only retrieve chunks which are relevant with respect to the given query but by definition, independent with each other. These will be labeled as chunk during the retrieval, but within the database, the actual categories of these chunks are text related class labels of DocLayNet including "Text", "List-item" and so on.

     b) seed_extended: In addition to a), it also retrieves additional context, i.e. other paragraphs and list items of the document section within which the given seed chunk is embedded in the document.

     c) graph: Along with chunks in "seed_only", it also retrieves additional chunks which are related to the seed chunk, providing additional context. This relationship is established during the ingestion phase either through conceptual linkages, or hyperlink linkages internal to the document (for example, some text pointing to some other paragraph or section within the document).

     d) graph_extended: This combines "graph" strategy with "seed_extended" strategy. This retrieves additional sibling texts of the seed chunk along with the graph linkages.

    You may explore the strategy and adopt the most optimal one for your use case and nature of the document. For example, "graph" strategy might suffice for paragraph-heavy documents while legal documents with list-heavy clauses might require to use "graph_extended" strategy. So, c) and d) are our major offerings, a) and b) are provided as an additional options which you may find in other services also.

  3. num_seed_nodes: This is equivalent to topk parameter in RAG. It is named so in our service, due to the presence of graph linkages and traversal during chunking and retrieval. You may optimize this for your use case.

How to Interpret the output JSON

(These are visualized if you use the UI)
Keys:

  1. "chunks": JSON object in the format of ("chunk_i", "chunk_id") key value pairs, where i runs from 1 to n (number of chunks retrieved). "chunk_id" represents node id in the graph database.

  2. "answer": Answer to the given query based on retrieved chunks. It will be an empty string if "chunk_only" is set to true.

  3. "graph": JSON Object with "nodes" and "edges" keys. Each is a list of JSON objects each representing a graph node in "nodes" case, while a graph edge in "edges" case. This graph represents relationship among chunks in "chunks" key. if, for example, num_seed_node is set to 1, and you have used one of graph strategies, one of the chunks is the seed node, and additional nodes are retrieved due to their links to the seed node as extracted during the ingestion stage. "nodes" data also contains pdf name, page number and bounding box information to locate the exact section of the chunk in the pdf. Bounding box is in the format of (left, top, width, height).

You may further rerank and filter the retrieved chunks if it fits your use case.

Delete project

UI has a delete project button, but to do it through the code, use the following:

Delete project

UI has a delete project button, but to do it through the code, use the following:

from omegaconf import OmegaConf

from src.seqtra_client import SeqtraClient

req_cfg = OmegaConf.load("./config/request.yaml")
SeqtraClient.remove(url=req_cfg.url, project_name=req_cfg.project_name, api_token=req_cfg.api_token)