Prototype RAG in Jupyter Notebook

22 Jun, 2024

What's this?

This is a blog post for Lesson 1 of LLM Zoomcamp: https://github.com/DataTalksClub/llm-zoomcamp/tree/main/01-intro.

The Retrieval part of RAG is expanded in-depth in another pre-course workshop Implement a Search Engine. I intend to write another blog post to cover it, but it's unnecessary to follow this lesson.

Introduction

For the last 3 years, no technology garners more attention than large language models (LLMs). You type what you want to the chatbox, and get the response back, exactly what you want. It is perfect, until you peak under the table.

LLMs at an oversimplified level is a probability model. It aims to solve the question of "What's the correct output for this input" by turning it into a conditional probability problem "Given this input, what's the output that has the highest chance of coming after it?". Playing with probability means that things do not work all the time¹. When it does not work, the most common result is "hallucination" - the model spewing out non-truthful or incorrect output. So after ChatGPT debut, improvements for LLMs focus on improving the probability. RAG is one technique.

Retrieval-Augmented Generation (RAG) is a mouthful, but the idea is simple. LLMs pay more attention to the condition part i.e., input in conditional probability than what it has memorized in its weight. So, if we can put the answer in the input itself, the probability that the LLMs give the correct output (should) dramatically increase. And it did.

How to achieve RAG? As the name suggests, the system has 2 parts: Retrieval i.e., a Search Engine and Generation i.e., an LLM. When users submit input, a search engine finds relevant information from a database, appends it to the input before submitting to the LLM itself, which gives the output².

In this lesson, we build a RAG prototype in Jupyter Notebook. No wheel will be re-invented. The search enginee will be ElasticSearch, the document will be the FAQs collection of all DataTalks.Club courses, and the LLM will be Groq API³.

Setup

Here's my folder setup

01-intro/
├─ rag-intro.ipynb
├─ documents.json
├─ minsearch.py
.env
requirements.txt

The content of the lesson can be done in a single Jupyter notebook. documents.json and minsearch.py will be created during running. requirements.txt is my own file for environment creation. For module 1 it contains

tqdm
python-dotenv
ipykernel
openai
elasticsearch
pandas
scikit-learn
python-dotenv
groq

note that I replaces jupyter with ipykernel because I work inside VS Code, not Jupyter Notebook.

.env is used to store environment secret, which is my Groq API key. python-dotenv is used to read the secret.

Index document

Getting the data and file

wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json

Spin up an ElasticSearch container

docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

Ingest the document

from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Health check
es_client = Elasticsearch("http://127.0.0.1:9200")
es_client.info()

# Create index
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}
index_name = "course_questions"
es_client.indices.create(index=index_name, body=index_settings)

# Ingest documents
documents = get_documents()
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

Doing RAG (modularized)

For development up to this point, see the lecture video. I also modify the codes to use Groq instead of OpenAI for my case and other minor points.

from groq import Groq
from dotenv import dotenv_values

def elastic_search(query: str) -> list:
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    response = es_client.search(index=index_name, body=search_query)
    result_docs = []

    for hit in response["hits"]["hits"]:
        result_docs.append(hit["_source"])
    return result_docs

def build_prompt(query: str, search_results: list) -> str:
    context = ""

    for doc in search_results:
        context += f'section: {doc["section"]}\nquestion: {doc["question"]}\nanswer:: {doc["text"]}\n\n'

    prompt_template = """You're a course teaching assistant. You will answer QUESTION using information from CONTEXT only.

    QUESTION: {question}

    CONTEXT:
    {context}
    """
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

config = dotenv_values("../.env")
client = Groq(
	api_key=config["GROQ_API_KEY"],
)

def llm(prompt: str, model: str) -> str | None:
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model=model,
    )

    return chat_completion.choices[0].message.content

def rag(query: str, model: str = "llama3-8b-8192") -> str | None:
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    return llm(prompt, model)

Testing:

rag("How do I run Kafka?")

Conclusion

That was a good introduction for week 1. Right now, the prototype cannot do much since it's just a Jupyter Notebook + Docker container. For week 2, Alexey will introduce open-source options to replace the APIs for the generation part⁴.

And it's not specific to LLMs - nothing works 100% of the time, especially articial things.↩
An early example is Meta's ATLAS paper: https://www.jmlr.org/papers/volume24/23-0037/23-0037.pdf. The generation part uses a far less capable model than ChatGPT. A later example is Bing Copilot these days, which are Bing search engine + GPT4 generation.↩
The lecture uses OpenAI API, which is paid. I use Groq API because it's capable enough while free. The syntax is tantalizingly the same though.↩
For this, a GPU is needed.↩

#llm #post #study