How to Build a Retrieval-Augmented Generation (RAG) System Locally with RLAMA and Ollama

How to Build a Retrieval-Augmented Generation (RAG) System Locally with RLAMA and Ollama
RLAMA, available ffor macOS, Linux and Windows

Implementing and Refining RAG with rlama

Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) by incorporating document segments that substantiate responses with relevant data. The rlama framework facilitates a completely local, self-contained RAG solution, thus eliminating dependency on external cloud services while ensuring confidentiality of the underlying data. Although rlama supports both large- and small-scale LLMs, it is carefully optimized for smaller models without relinquishing the ability to scale to more robust alternatives.

Introduction to RAG and rlama

Under the RAG paradigm, a knowledge repository is queried to fetch contextually appropriate documents, which are then integrated into the model prompt. This mechanism roots the model’s output in verifiable and contemporaneous information. Conventional RAG pipelines rely on multiple separate modules (e.g., document loaders, text splitters, vector databases), but rlama unifies these activities within a single command-line interface (CLI). It carries out:

  • Document ingestion and segmentation.
  • Embedding generation through local models (via Ollama).
  • Archival in a hybrid vector store supporting both semantic and textual queries.
  • Query-based retrieval that provides contextual data for response generation.

Because rlama operates entirely on local infrastructure, it delivers a secure, performant, and streamlined environment.

Step-by-Step Guide to Implementing RAG with rlama

Installing Ollama

First, install Ollama. Download from https://ollama.com/download

Once installed, check the available LLMs on https://ollama.com/search. I suggest starting with:

  • deepseek-r1: DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
  • llama 3.2: Meta's Llama 3.2 goes small with 1B and 3B models.
  • PHI-4: Phi-4 is a 14B parameter, state-of-the-art open model from Microsoft.

Install them using

ollama run <model-name>:<model-version>

Ollama will download the model from the internet and puts you into a chat with it. To exit the chat, simply type /bye.

When Ollama runs on your system, you can also access it using a REST API exposed to http://localhost:11434/api/generate . To learn more, see the Ollama documentation at: https://github.com/ollama/ollama/blob/main/docs/api.md

You also need to install the following text embedding model for RLAMA to be able to generate the vector database for the RAG system:

  • bge-m3: BGE-M3 is a new model from BAAI distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

Installing RLAMA

Download and run the RLAMA installation script from the GitHub repo:

curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh

Check the installation with:

rlama --version

Creating a RAG System

Create an indexed repository (hybrid vector store) from your documents:

rlama rag   

For instance, employing a model such as deepseek-r1:8b:

rlama rag deepseek-r1:8b mydocs ./docs

This process:

  • Recursively enumerates the specified directory for compatible files.
  • Converts documents to plain text and partitions them into manageable segments.
  • Generates embeddings for each segment, utilizing the chosen model.
  • Saves both segments and metadata in a local hybrid vector store (e.g., ~/.rlama/mydocs).

Managing Documents in the RAG System

Preserve accuracy by keeping the index updated:

Add Documents:

rlama add-docs mydocs ./new_docs --exclude-ext=.log

List Documents:

rlama list-docs mydocs

Inspect Chunks:

rlama list-chunks mydocs --document=filename

Update the Model:

rlama update-model mydocs 

Configuring Chunking and Retrieval

Chunk Size & Overlap:

Each partitioned chunk spans approximately 300–500 tokens, enabling fine-grained retrieval. Reducing chunk size increases retrieval precision, while slightly larger segments help maintain context. A partial overlap of roughly 10–20% preserves continuity across segment boundaries.

Context Size:

The --context-size parameter manages the number of chunks fetched per query. A default of 20 chunks suffices for many queries, but smaller or larger values are equally feasible depending on the breadth of the question. Be mindful of the cumulative token capacity of the LLM when adjusting context size.

Hybrid Retrieval:

Although rlama primarily employs dense semantic search, it also retains the original text to allow direct string-based queries. Consequently, retrieval can leverage both vector embeddings and literal text matching.

Running Queries

To perform interactive queries:

rlama run mydocs --context-size=20

At the prompt, specify your question:

How do I install the project?

Behind the scenes, rlama converts the query into an embedding, retrieves the best-matching chunks from the repository, and leverages a local LLM (via Ollama) to produce a context-informed response.

Terminate the session with the exit command (CTRL+C).

Using the rlama API

For programmatic applications, launch the API service:

rlama api --port 11249

Then submit queries over HTTP:

curl -X POST http://localhost:11249/rag -H "Content-Type: application/json" -d '{
  "rag_name": "mydocs",
  "prompt": "How do I install the project?",
  "context_size": 20
}'

The service responds with a JSON payload containing the generated answer and ancillary information.

Retrieval Speed:

  • Adjust context_size to strike an optimal balance between speed and completeness.
  • Favor smaller embedding models for quick turnarounds, or adopt specialized embedding architectures when needed.
  • Preemptively exclude non-relevant files from indexing to reduce overhead.

Retrieval Accuracy:

  • Calibrate chunk size and overlap for precise retrieval.
  • Choose the most appropriate model for your dataset. rlama update-model seamlessly switches embeddings.
  • Modify prompts as necessary to reduce off-topic generation.

Local Performance:

  • Match hardware resources (RAM, CPU, GPU) to your chosen model.
  • Use SSDs for superior I/O speed and enable multithreading for faster inference.
  • For bulk queries, prefer persistent API mode over repeated CLI executions.

Next Steps

  • Enhanced Chunking: Refine chunking methods to enhance RAG outcomes, particularly for smaller language models.
  • Performance Monitoring: Continuously test different model configurations to identify the best arrangement for various hardware capabilities.
  • Future Directions: Look forward to improvements in advanced retrieval methods and adaptive chunking strategies.

Conclusion

By emphasizing confidentiality, speed, and user-friendliness, rlama provides a robust local RAG solution. Whether supporting quick lookups using compact models or conducting detailed analyses with large-scale LLMs, rlama offers a flexible and powerful platform. Its heightened hybrid storage, better metadata structures, and upgraded RagSystem collectively improve retrieval fidelity. Best of luck with your local indexing and querying endeavors.

References