Building A Python Script For Natural Language Q&A Using LlamaIndex

Zhijing Eu
8 min readMar 12, 2023

In this article I walk thru a simple Python script that can ingest multiple text based documents and enable natural Q&A of the text via a Python library called Llama Index

Introduction — What Motivated This Article Post?

Last year I shared a simple workflow that combined a Python library that deals with VTT (video text transcript files) and converts the output into a nifty dashboard in Power BI that allows for some basic text analysis such as keyword extraction, word frequencies, who spoke the most etc.

Unfortunately, at that time, I didn’t manage to include a way to naturally query the text data in terms of either making automatic summaries or just asking natural questions and getting consolidated answers from the text. This led me to revisit the problem of how to query large volumes of text beyond just simple text analytics against the backdrop of Q4 2022-Q1 2023 where Chat GPT3 started to really pick up…

OpenAI GPT3 Models

Unless you’ve been living under a rock over the last 6 months, you’ve likely already heard of ChatGPT3 and other similar tools like Microsoft’s Bing Chat. These are obviously amazing tools given that it makes it easy for almost anyone to use these powerful models by simply typing in a question without the need for any prior coding experience.

However, at the time of writing, there are still some limitations with these tools. For example, ChatGPT3 and Bing Chat both have limits on both how long the prompts and the responses can be. As of early Mar 2023, for Chat GPT3 it is 4096 tokens(*) and 2000 tokens for Bing Chat.

Therefore, just sticking in large volumes of text into a prompt window hoping to get an answer will just produce the following error message.

:(

(* A token is a sequence of characters that Large Language Models use to understand the statistical relationships between these tokens. Or if you prefer a simpler explanation — Open AI suggests that 75 words is about 100 tokens GPT-3 tokens explained — what they are and how they work | Quickchat Blog)

Furthermore, there may be use cases where it would be helpful to customize the parameters in the LLM algorithms.

OpenAI does provide libraries that allows users to call their models via a simple API. This definitely helps with the fine tuning of the model parameters. Unfortunately, users will still run into issues with data ingestion as there are limits on the number of tokens per prompt

Models — OpenAI API

There are workarounds on these token limits. However they tend to be somewhat unwieldy and typically involve chunking the text into smaller blocks and making summaries of these smaller blocks before summarizing all the individual block summaries.

Llama Index

Therefore LlamaIndex is a really useful Python Library that offers users a set of data structures to index large volumes of data for various LLM tasks and remove concerns over prompt size limitations and data ingestion.

In this walk through example, we will be using multiple transcript files downloaded from videos on the awesome Tom Scott Youtube channel that look a bit like the below (truncated for brevity)

SAMPLE ONE
0:00
- New Zealand has no native land mammals other than bats.
0:03
For around 85 million years, these islands were so remote
0:06
that the rest of the world's mammals just never made it here.
0:09
All the parts of the ecosystem that mammals would fill
........
9:02
Towns, regions, sure. But never a country.
9:04
If anyone can pull this off, it's going to be New Zealand.
SAMPLE TWO
0:00
- Wellington, the capital of New Zealand,
0:02
is a city built on steep hills.
0:03
If you're a tourist, you will probably stop by
0:05
the Wellington Cable Car at some point
.......
4:06
that it's got a human element to it.
4:07
I guess that's why I've been doing it for so long
4:09
and still enjoy it.

First we’ll install LlamaIndex

pip install llama-indexppp

Next, we’ll store all the videos into a single folder called ‘data’ and then use LlamaIndex’s GPTSimpleVectorIndex function which is a good general purpose tool for document retrieval to ingest all the content.

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data').load_data()

index = GPTSimpleVectorIndex(documents)

After which, we can customize the LLM predictor settings

# save to disk (to avoid having to re-index the text later)
index.save_to_disk('index.json')

# customize the LLM settings
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=1024))

# load from disk and set the LLM predictor
index = GPTSimpleVectorIndex.load_from_disk('index.json', llm_predictor=llm_predictor)

And then we are all set ….

response = index.query("Which city has a hundred cable cars and why?")

display(Markdown(f"<b>{response}</b>"))

A: “Wellington, New Zealand has a hundred cable cars because they are a popular form of transportation for locals. The city is built on steep hills, making it difficult to access certain areas without the use of a cable car. The cable cars are also used to help people with mobility issues access their homes, and to allow people to develop properties on hillsides that would otherwise be inaccessible”

response = index.query("Which predator does the box kill ?")

display(Markdown(f"<b>{response}</b>"))

A: “The box kills rats because they are an invasive species that threaten the unique native species of New Zealand. The box is designed to be humane and instant, using a powerful trap and a toxic wax block bait.

Hopefully this will get you started with Llama Index. The link below is the GitHub page for LlamaIndex which has a ton of other examples and more advanced techniques you can use too. For example, besides the Vector Store Index, there are other data structures like the Tree Index and List Index that you can experiment with.

For the more advanced users reading this article , you may be pleased to know that LlamaIndex also supports integrations with LangChain, a Python library that connects OpenAI’s models to the external world. (Something that I will eventually get around to writing a separate article about)

In the demo file below, LlamaIndex is used as a memory module to insert arbitrary amounts of conversation history with a Langchain chatbot

Conclusion

All the examples above still require a bit of coding knowledge to implement which may be worth doing if you have a complex use case and a bit of programming know-how. However as major software companies begin to incorporate this tech ‘under the hood’ of their main products, what will likely continue as a trend is the lowering of barriers for general non-technical users to access and make use of these powerful AI models.

For example, I was still messing around trying to figure out how to link PowerBI to Microsoft’s AzureOpenAI models when I realised that Microsoft themselves announced some upcoming features for Microsoft Teams that will provide similar native functionality (Unfortunately this is going to be available only for Premium subscribers)

Microsoft has also been quietly infusing their other flagship products with LLM powered features. For example, today it’s even possible to use MS Word to natively auto summarize documents! (Again this is only for Premium customers)

The technology behind generative AI models itself is also developing at an accelerating pace. Beyond text-to-text generative models there are already exist other models such as Open AI’s DALL-E2 which turns text into images and Meta’s Make-A-Video which converts text prompts into short videos.

However very recently Microsoft themselves announced the upcoming release of a model called Kosmos-1 which will be MULTI-MODAL — meaning that it will be not just text to text but text to audio, images or video in a single model.

While there has been a lot of (somewhat sensationalist) news coverage around how these recent AI developments will lead to robots taking over ALL our jobs and in general a lot of apprehension about where the tech is heading, I personally maintain a view of cautious optimism.

There is so much potential benefit that these tools can provide to not replace but instead enhance human capabilities and to help us solve the problems that matter. Ultimately where we go from here really depends on how we CHOOSE to use these AI models and the ethical principles we follow.

Source : Dreamstime Stock Image

--

--

Zhijing Eu

Hi ! I’m “Z”. I am big on sci-fi, tech and digital trends.