How to Build a Retrieval-Augmented Generation (RAG) System Locally with RLAMA and Ollama

Implementing and Refining RAG with rlama
Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) by incorporating document segments that substantiate responses with relevant data. The rlama framework facilitates a completely local, self-contained RAG solution, thus eliminating dependency on external cloud services while ensuring confidentiality of the underlying data. Although rlama supports both large- and small-scale LLMs, it is carefully optimized for smaller models without relinquishing the ability to scale to more robust alternatives.
Introduction to RAG and rlama
Under the RAG paradigm, a knowledge repository is queried to fetch contextually appropriate documents, which are then integrated into the model prompt. This mechanism roots the model’s output in verifiable and contemporaneous information. Conventional RAG pipelines rely on multiple separate modules (e.g., document loaders, text splitters, vector databases), but rlama unifies these activities within a single command-line interface (CLI). It carries out:
- Document ingestion and segmentation.
- Embedding generation through local models (via Ollama).
- Archival in a hybrid vector store supporting both semantic and textual queries.
- Query-based retrieval that provides contextual data for response generation.
Because rlama operates entirely on local infrastructure, it delivers a secure, performant, and streamlined environment.
Step-by-Step Guide to Implementing RAG with rlama
Installing Ollama
First, install Ollama. Download from https://ollama.com/download
Once installed, check the available LLMs on https://ollama.com/search. I suggest starting with:
- deepseek-r1: DeepSeek's first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
- llama 3.2: Meta's Llama 3.2 goes small with 1B and 3B models.
- PHI-4: Phi-4 is a 14B parameter, state-of-the-art open model from Microsoft.
Install them using
ollama run <model-name>:<model-version>
Ollama will download the model from the internet and puts you into a chat with it. To exit the chat, simply type /bye
.
When Ollama runs on your system, you can also access it using a REST API exposed to http://localhost:11434/api/generate
. To learn more, see the Ollama documentation at: https://github.com/ollama/ollama/blob/main/docs/api.md
You also need to install the following text embedding model for RLAMA to be able to generate the vector database for the RAG system:
- bge-m3: BGE-M3 is a new model from BAAI distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
Installing RLAMA
Download and run the RLAMA installation script from the GitHub repo:
curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh
Check the installation with:
rlama --version
Creating a RAG System
Create an indexed repository (hybrid vector store) from your documents:
rlama rag
For instance, employing a model such as deepseek-r1:8b:
rlama rag deepseek-r1:8b mydocs ./docs
This process:
- Recursively enumerates the specified directory for compatible files.
- Converts documents to plain text and partitions them into manageable segments.
- Generates embeddings for each segment, utilizing the chosen model.
- Saves both segments and metadata in a local hybrid vector store (e.g., ~/.rlama/mydocs).
Managing Documents in the RAG System
Preserve accuracy by keeping the index updated:
Add Documents:
rlama add-docs mydocs ./new_docs --exclude-ext=.log
List Documents:
rlama list-docs mydocs
Inspect Chunks:
rlama list-chunks mydocs --document=filename
Update the Model:
rlama update-model mydocs
Configuring Chunking and Retrieval
Chunk Size & Overlap:
Each partitioned chunk spans approximately 300–500 tokens, enabling fine-grained retrieval. Reducing chunk size increases retrieval precision, while slightly larger segments help maintain context. A partial overlap of roughly 10–20% preserves continuity across segment boundaries.
Context Size:
The --context-size parameter manages the number of chunks fetched per query. A default of 20 chunks suffices for many queries, but smaller or larger values are equally feasible depending on the breadth of the question. Be mindful of the cumulative token capacity of the LLM when adjusting context size.
Hybrid Retrieval:
Although rlama primarily employs dense semantic search, it also retains the original text to allow direct string-based queries. Consequently, retrieval can leverage both vector embeddings and literal text matching.
Running Queries
To perform interactive queries:
rlama run mydocs --context-size=20
At the prompt, specify your question:
How do I install the project?
Behind the scenes, rlama converts the query into an embedding, retrieves the best-matching chunks from the repository, and leverages a local LLM (via Ollama) to produce a context-informed response.
Terminate the session with the exit command (CTRL+C).
Using the rlama API
For programmatic applications, launch the API service:
rlama api --port 11249
Then submit queries over HTTP:
curl -X POST http://localhost:11249/rag -H "Content-Type: application/json" -d '{
"rag_name": "mydocs",
"prompt": "How do I install the project?",
"context_size": 20
}'
The service responds with a JSON payload containing the generated answer and ancillary information.
Retrieval Speed:
- Adjust context_size to strike an optimal balance between speed and completeness.
- Favor smaller embedding models for quick turnarounds, or adopt specialized embedding architectures when needed.
- Preemptively exclude non-relevant files from indexing to reduce overhead.
Retrieval Accuracy:
- Calibrate chunk size and overlap for precise retrieval.
- Choose the most appropriate model for your dataset. rlama update-model seamlessly switches embeddings.
- Modify prompts as necessary to reduce off-topic generation.
Local Performance:
- Match hardware resources (RAM, CPU, GPU) to your chosen model.
- Use SSDs for superior I/O speed and enable multithreading for faster inference.
- For bulk queries, prefer persistent API mode over repeated CLI executions.
Next Steps
- Enhanced Chunking: Refine chunking methods to enhance RAG outcomes, particularly for smaller language models.
- Performance Monitoring: Continuously test different model configurations to identify the best arrangement for various hardware capabilities.
- Future Directions: Look forward to improvements in advanced retrieval methods and adaptive chunking strategies.
Conclusion
By emphasizing confidentiality, speed, and user-friendliness, rlama provides a robust local RAG solution. Whether supporting quick lookups using compact models or conducting detailed analyses with large-scale LLMs, rlama offers a flexible and powerful platform. Its heightened hybrid storage, better metadata structures, and upgraded RagSystem collectively improve retrieval fidelity. Best of luck with your local indexing and querying endeavors.
References
- RLAMA GitHub Repository: https://github.com/DonTizi/rlama
- RLAMA Website: https://rlama.dev/