Emerging Architecture for Generative AI on Textual Data

NandaKishore Joshi
8 min readAug 21, 2023

This article explains the basics of using LLM on custom document with code. It touches on creating, storing and retrieval of vector embeddings from document to use as custom context on LLM’s

Applications of Generative AI are at the forefront post the LLM boom. LLM’s (Large Language Models) are used on structured and unstructured data to generate sensible and smart answers to user questions. What makes them so effective against the traditional chatbot or NLP models is that they can answer the user queries based on the custom context provided dynamically by the user. This capability of the LLM’s have resulted in various applications like question-answering, summarization, text generation etc.

LLM by themselves can answer almost all generic questions as they are trained on huge amount of data. But it is very important to get context based answers when using LLM’s for a particular use case. LLM’s today have capability to understand large input sequence/data (up to 32k to 100k tokens) passed by the user and provide answers with the input context. This is when passing the right context along with the user prompt becomes very important to get the answers without model hallucinating.

This Article talks on general architecture to store textual data from large documents and extract right context based on user prompt to pass to LLM’s.

Fig 1 : Architecture for Generative AI

The above image shows the general architecture of Gen AI for a unstructured data. Each component along with its importance is explained below

Documents :

Documents are the files containing text data. They can be emails, txt, PDF, doc . We many have single or multiple doc and of different size ranging from one to thousand pages. Collecting relevant document from different sources is important to pass the right context to the LLM

Text Splitter :

Documents can be of any size, from one to thousands of page. LLM’s have restrictions on length of input sequence and as a result entire document cannot be used as context to LLM. Hence, documents should be split into chunks of text.

Text splitters from LLama-index or Langchain can be used to split text from documents. User can specify the chunk length along with overlap length. Overlap is the amount of common text between two chunks.

When the document has hundreds of pages, it is a challenge to search the right chunk based on the user input to pass to LLM as a context. This can be addressed to some extent by extracting metadata from chunks and use it in the indexes. LLama-index has this feature to extract metadata for chunk. For more info refer the below link

Extracting Metadata for Better Document Indexing and Understanding — LlamaIndex 🦙 0.7.22 (gpt-index.readthedocs.io)

Prompts :

Prompts are one of the main component of in LLM’s. Prompts can be of two types

First, prompts which get the user question and pass them to LLM with no or less changes. For example

prompt="You area a sales agent responsible to answer the user queries.
Please provide appropriate answer to the user query enclosed between
triple single quotes in a formal and official language
'''query'''
"

In the above prompt, a very basic instruction to the LLM is provided regarding its identity and user query is passed to LLM almost unchanged. These types of prompts are used to answer generic queries when custom context is not required.

Second, prompting can be used to make LLM do series of actions to get an answer to user query. For example, lets say we have a database of Indian companies and LLM needs to answer questions on this data.

user_question="Which are the top 5 richest companies in India?"

Solution can be broken into multiple steps.

Step1 : Find the most appropriate table and column used to answer the user question

Step2 : Create SQL query to answer the question

Step3 : Use appropriate tool to run the query to get the required data

Step4 : Frame the appropriate answer using the question

Prompts for various steps would be

#step1 
prompt_1="Find the most appropriate table name and column which can
answer the question mentioned below :
{question}
Provide the answer in the below format enclosed in triple single quotes

'''
Table_1 with columns column1, column2, column3 ..
Table_2 with columns column1, column2..
'''
Provide answer only if sure. Don't provide wrong answers
"

Once correct table and column names are acquired, step 2 prompt would be

#step 2
prompt_2=" Using the tables and their columns provided in context below
{context}
Write a sql query to answer the question
{question}
Only provide the complete usable SQL query as the output "

In the above prompt, context would be the output of the step 1. In some cases both step 1 and step 2 can be combined in a single step to get the SQL query as the output

In Step 3 , SQL query obtained can be run on any query engine to get the data

Step 4 is to display the answer in the suitable form. For example

#step 4
prompt_4 = "You are a business analyst. Based on the data provided below
{data}
frame a response in formal english to answer the user question
{question}
"

In the above step, data is the output of SQL query execution.

In this case the prompts are designed to perform multiple activities one after the other.

(the chain of prompts shown above is a overkill for the example taken. But the intention is to show the power of chain prompts)

These kind of prompting technique is used when LLM knows the context or a context is provided along with the prompt to the LLM.

Embedding Models:

Embedding models convert text to vectors called as embeddings. This is done effectively based on model weights and its learning of the generic context. Some models can generate better embeddings than others. Its like ChatGPT is better than other opensource models. There are obvious reasons for a model to be better at creating embeddings.

Open source embedding models or OpenAI embedding models can be used to create vectors/embeddings. Each of the opensource LLM’s has their own tokenizer (embedding models)

Vectors are generated on two data sources

  1. Vector generation of prompts
  2. Vector generation of documents — Post generation of chunks, the chunks of text is passed into embedding models to generate vectors

Embeddings of documents are generated to facilitate usage of vector DB. Embeddings are indexed and stored in vector DB which will help in retrieval .

Storage and Search :

In case of a question answering application on a pdf, it is important to retrieve the right part of the pdf which would answer the user question. For example, if the user is asking for the information on travel date on a travel-itinerary pdf, it is crucial to find the part of the pdf which has solution to this question. This cannot be done by searching across the text.

Document is divided into small chunks and embeddings is generated for each chunk. These embeddings are indexed and stored in vector DB like FAISS, Milvus , Pinecone. Vector DB has inbuilt mechanism of vector retrieval based on similarity algorithms. Whenever a user prompts a question, the embedding of the question is matched with all the stored vectors (or on a particular index on DB) to retrieve the top n matches.

These top n matches are used as context for the user query and passed to LLM along with prompts

LLM :

LLM’s are the heart and brain of generative AI applications on text. LLM from OpenAI like ChatGPT — 3.5, GPT — 4 or some of the best opensource models like Falcon 7B or 40B, MPT 7B or 30B, Llama-v2 7B or 13B or 70B and many other models can be used to extract answers from any document.

Once appropriate context is extracted for a prompt, there are multiple ways to pass it to LLM. Below are two examples using Langchain framework

  1. Use the context in the input prompt along with the user question and pass it to LLMChain from langchain
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_pandas_dataframe_agent
import pandas as pd
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain import HuggingFaceHub,LLMChain,PromptTemplate
import torch
from langchain.chat_models import AzureChatOpenAI


#initialize Azure OpenAI model
llm_gpt = AzureChatOpenAI(
openai_api_base=BASE_URL,
openai_api_version=VERSION,
deployment_name=DEPLOYMENT_NAME,
openai_api_key=API_KEY,
openai_api_type="azure",
)

prompt="""
Optimize the code {query}
"""
query="""
Replace the below code with more efficient code. Write all the steps to have working code:
df_result = pd.DataFrame()
for r in regions:
for a in archetypes:
df_ra=df1[(df1['Region']==r)&(df1['Archetype']==a)]
print(r,a)
print(df_ra.shape)
if df_ra.shape[0]>1:
X,y=df_ra[['Inbound_Volume', 'Outbound_Volume','Unsold_Volume','Inbound_Productivity','Outbound_Productivity']],df_ra[['Profit']]
categorical_features_indices = np.where(X.dtypes == np.object)[0]
cbr=CatBoostRegressor()
model=cbr.fit( X, y , cat_features = categorical_features_indices , eval_set = None ,plot=False, verbose=False)
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(X.columns, model.feature_importances_):
feats[feature] = importance
feats['Region']=r
feats['Type']=a
df_r = pd.DataFrame([feats])
df_result=df_result.append(df_r)
print(feats)
"""

prompt_template=PromptTemplate(template=prompt,input_variables=['query'])

question_chain_gpt = LLMChain(llm=llm_gpt,prompt=prompt_template)
question_chain_gpt.run(query)

In the above code snippet, query is the context which is passed to prompt and sent to OpenAI model

2. Using Retrieval-Augmented Generation (RAG) :

This is the approach explained in the architecture where a document is converted into chunks and vectors of these chunks are created and stored in vectorDB. Most similar chunks based on user prompts are extracted as context to the LLM to answer the user questions.

#initialize Azure OpenAI model
llm_gpt = AzureChatOpenAI(
openai_api_base=BASE_URL,
openai_api_version=VERSION,
deployment_name=DEPLOYMENT_NAME,
openai_api_key=API_KEY,
openai_api_type="azure",
)

string="Data.txt"
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
#load the document
loader = TextLoader(string)
documents = loader.load()
#split the text into chunks of 100 characters
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
texts = text_splitter.split_documents(documents)
# create the vectorestore to use as the index
db = Chroma.from_documents(texts, embeddings)
# expose this index in a retriever interface
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k":2})
qa = RetrievalQA.from_chain_type(llm=llm_gpt, chain_type="stuff", retriever=retriever, return_source_documents=True,verbose=True)

#answer user question by calling retriever to get most similar chunk
prompt="Why should a business increase revenue?"
result = qa({"query": prompt})
print(result)

Data:

This is the last step which deals with extracting response from LLM’s in required format. Format of the response becomes very important when it has to be used else where. LLM’s have capability to respond in specific format. The instructions should be provided in the prompt.

For example, we can ask LLM to respond in JSON format or in strings with start and stop words by providing the write instructions

prompt_json="""
Answer the below question
Question : {question}
Output the response in JSON with key as mentioned below
Output :
{country : 'Bharath',
comment : 'Oldest civilization in the World'}
"""

prompt_string="""
Answer the below question
Question : {question}
output the response in the string format as mentioned below
Output :
###Response_Start###
Bharath is the oldest civilization in the world
###Response_End###
"""

Conclusion

This article walks through basic component needed to create a LLM application on textual document.

It is important to understand that when dealing with custom document , providing the right context with the prompt is key to get the effective response. Another way to customize the LLM is by finetuning. Details of finetuning OpenAI and Opensource LLM models will be shared in next article.

WRITER at MLearning.ai / 800+ AI plugins / AI art Copyright

--

--