Hung's Notebook

Open Source LLMs

What's this?

This is a blog post for Lesson 2 of LLM Zoomcamp: https://github.com/DataTalksClub/llm-zoomcamp/blob/main/02-open-source/.

For this week's lesson, we will try to replace, for the Generation part in RAG, 3rd-party API provider with open-source models running locally.

The blog post can be divided into 2 parts, corresponding to the lesson content of Model inference on GPU and Model inference on CPU.

Model inference on GPU with HuggingFace 🤗

HuggingFace 🤗 is the platform for open-source models. The platform supports many things, but for the lesson we need to pay attention to two: Model Repository and transformers Python library.

Model Repository is where contributors share the weights and related materials e.g., training source code. Contributors range from enthusiasts who help quantizing the models for accessibility to corporations like Meta or Mistral AI.

transformers is the official Python library to download the models and work with them, from fine-tuning to running inference. After tremendous growth and support from community, the library has become authoritative for someone who wants to work with open-source LLMs model, maybe even all open-source models1.

For fast inference, we need a GPU. If you have a budget, there are many GPU vendors, from big cloud like Google or AWS to niche ones like Jarvis Labs, vast.ai, or Lambda Labs. For free GPU option, we can use Lightning AI Studio or Saturn Cloud2, which provides a full IDE and allows SSH into the running server. If you just need a Jupyter Notebook, which is enough to complete the lesson, you can use Google Colab or SageMaker Studio Lab, though the latter one's GPU is much, much more scarce.

Before running, additional packages need installing:

# Use uv for faster installation
pip instal uv
uv pip install -U transformers accelerate bitsandbytes sentencepiece

The models explored in the lesson are FLAN-T5 XL, Phi-3 Mini, and Mistral 7B, but I will skip Mistral 7B since it's similar to Phi-3 Mini while being heavier but weaker (on benchmarks)3.

Flan-T5 XL

FLAN-T5 is an improved version of T5, or "Text-to-Text Transfer Transformer", developed by Google. The later introduced the notion of a single language model for every NLP tasks, while the former explored scaling and instruction fine-tuning. The model is less capable than the current state-of-the-art open-source (SOTA OS) models such as Meta's Llama 3 even at 8B only. However, it's still strong enough to get the work done.

To download and load the model:

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto")

All models from transformers must be loaded together with its tokenizer, which is used to translate between words and tokens. device_map is used to automatically load the model to the fastest hardware possible, or set the default hardware.

The llm function can be rewritten to use the response from this model.

def llm(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(input_ids)
    result = tokenizer.decode(outputs[0])
    return result

When running the RAG workflow per previous lesson, the output is actually the relevant part of the context and chopped short. It's because of the default max length of Flan-T5 XL is just 50. To receive longer output, we need to change a parameter in .generate method. We also want to remove special tokens such as paddings in the ouput4.

def llm(prompt, generate_params=None):
    if generate_params is None:
        generate_params = {}

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(
        input_ids,
        max_length=generate_params.get("max_length", 500),
        num_beams=generate_params.get("num_beams", 5),
        do_sample=generate_params.get("do_sample", False),
        temperature=generate_params.get("temperature", 1.0),
        top_k=generate_params.get("top_k", 50),
    )
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return result

The pipeline higher-level API could be used instead of tokenizer and model. The high-level API automatically loads tokenizer and model behind the scene and applies the correct sequence of steps and settings such as skipping special tokens during inference.

from transformers import pipeline

pipe = pipeline("text2text-generation", model="google/flan-t5-xl", device_map="auto")

def llm(prompt, generate_params=None):
    if generate_params is None:
        generate_params = {}
    outputs = pipe(
	prompt,
	max_length=generate_params.get("max_length", 100),
	num_beams=generate_params.get("num_beams", 5),
	do_sample=generate_params.get("do_sample", False),
	temperature=generate_params.get("temperature", 1.0),
	top_k=generate_params.get("top_k", 50),
    )
    return outputs[0]['generated_text']

Phi-3 Mini

Phi-3 Mini is a model exceptionally strong for its size, partly due to great curation of its training dataset. Alt|900

To use the model, we can adapt the example on the model card:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="auto", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
def llm(prompt):
    messages = [
        {"role": "user", "content": prompt},
    ]
    generation_args = {
	"max_new_tokens": 500,
	"return_full_text": False,
	"temperature": 0.0,
	"do_sample": False,
    }
    output = pipe(messages, **generation_args)
    return output[0]['generated_text'].strip()

The output is terse due to the low temperature, which makes the output more deterministic, which means it likely will just copy a part of the context. To see more varied and eloquent responses, increase temperature.

Model inference on CPU with Ollama

Developers have worked tirelessly to run LLMs on CPU alone, including quantizing the weights from 32 bits to just 8 bits or even 4 bits to reduce RAM usage, porting the weights to C/C++ data types, rearranging architecture to use C++ parallel algorithms. Popular projects are llama.cpp and ollama, the later we use in this course with GitHub Codespace.

On Linux, easiest way to install the CLI package is

curl -fsSL https://ollama.com/install.sh | sh

However, I prefer the official Docker image, which will be used with ElasticSearch later.

docker run -it \
    -v ollama:/root/.ollama \
    -p 11434:11434 \
    --name ollama \
    ollama/ollama
docker exec -it ollama bash ollama pull phi3

Ollama exposed OpenAI-compatible APIs, which can be used in the OpenAI client.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',
)

def llm(prompt):
    response = client.chat.completions.create(
        model='phi3',
        messages=[{"role": "user", "content": prompt}],
    )
    
    return response.choices[0].message.content

To run with ElasticSearch on the same Codespace, we need to prepare a Docker Compose file.

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    mem_limit: 4GB
    ports:
      - "9200:9200"
      - "9300:9300"

  ollama:
    image: ollama/ollama
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"

volumes:
  ollama:

I added the mem_limit parameter for the ElasticSearch container so that it does not crash the Codesapce with its memory usage. From the (4-bit quantized) size of the models listed by Ollama, the 4-core 16GB RAM Codespace instance is enough to run even the Phi 3 Medium model.

Streamlit Front-end

For the last part of the lesson about constructing a Streamlit front-end, refer to the source code and the video is more convenient.

Conclusion

After week 2, the prototype sitting inside RAG now has become a Streamlit web app. I am looking forward to week 3 🤗!

And for you, reader, I hope this notes have been useful for you. See you next week✋!


  1. The growth of the community is both fast and fascinating. You can come back after 1 month, and while the front-end may still be tantalizingly the same, the back-end has been refactored completely or support 10+ more cutting-edge features.

  2. Saturn Cloud has a long waitlist, but students from LLM Zoomcamp are whitelisted to use immediately.

  3. It's also available for free through Groq API and I used it in the 1st lesson.

  4. I remove the top_p parameter since it leads to contradictory setting in token sampling.

#llm #post #study