Leveraging Ollama in Python: Advanced Offline GenAI Solutions

Sascha Corti

24 Jan 2025 • 2 min read

With the growing demand for sophisticated Generative AI (GenAI) solutions, developers increasingly seek tools that are powerful, flexible, and can function offline. Ollama is a standout option, allowing users to interact with AI models locally. This article explores how to use the ollama Python library for streaming and asynchronous responses, integrates Ollama with LangChain, and demonstrates the potential of offline GenAI workflows, featuring the Microsoft PHI4 model capable of advanced reasoning.

Getting Started with Ollama's Python Library

The ollama Python library simplifies interactions with locally hosted AI models. It supports both synchronous and asynchronous operations, as well as real-time streaming of responses. You can install it via pip:

pip install ollama

Check this GitHub link for more documentation.

Setting Up and Using Models

After installation, you can load and interact with models like this:

from ollama import Ollama

# Initialize Ollama client
ollama_client = Ollama()

# Query a model synchronously
response = ollama_client.query("phi4", "What is the Microsoft PHI4 model?")
print(response)

# Query asynchronously
import asyncio

async def query_model():
    async for chunk in ollama_client.query_async("phi4", "Tell me about generative AI."):
        print(chunk, end="")

asyncio.run(query_model())

The ability to stream responses asynchronously is particularly valuable when dealing with large outputs or real-time applications.

Streaming and Asynchronous Capabilities

Streaming Responses

The streaming API allows incremental output, perfect for scenarios where real-time feedback is needed, such as chatbots or live data analysis. Here’s how to implement it:

# Stream responses from the model
for chunk in ollama_client.query_stream("phi4", "Explain the benefits of offline AI models."):
    print(chunk, end="")

Asynchronous Queries

Asynchronous support enables non-blocking calls, ideal for integrating into applications requiring concurrent tasks:

# Perform multiple asynchronous queries
async def main():
    tasks = [
        ollama_client.query_async("phi4", "What are transformers in AI?"),
        ollama_client.query_async("phi4", "Explain attention mechanisms."),
    ]
    results = await asyncio.gather(*tasks)
    for result in results:
        print(result)

asyncio.run(main())

Integrating Ollama with LangChain

LangChain enhances the capabilities of Ollama by providing a framework for building complex pipelines. The integration is seamless, allowing you to leverage Ollama's local models within a LangChain workflow. See LangChain's documentation here.

Installation

Install LangChain and the Ollama integration:

pip install langchain

Using Ollama in LangChain

Here’s an example of using Ollama as an LLM within LangChain:

from langchain.llms import Ollama

llm = Ollama(model="phi4")
response = llm("Generate a summary of the PHI4 model.")
print(response)

Advanced Pipelines

LangChain allows chaining Ollama with other tools, such as retrieval-augmented generation (RAG):

from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader

# Load documents and create a retriever
loader = TextLoader("docs/my_data.txt")
retriever = FAISS.from_documents(loader.load())

# Create QA chain
qa_chain = RetrievalQA(llm=Ollama(model="phi4"), retriever=retriever)
response = qa_chain.run("What insights can be drawn from this data?")
print(response)

Running GenAI Solutions Offline

One of Ollama's standout features is the ability to run completely offline, making it ideal for scenarios requiring data privacy or operating in restricted environments. This is particularly useful with models like Microsoft's PHI4, which offer advanced capabilities for private and secure AI operations.

Why Offline?

Data Privacy: Sensitive information never leaves your environment.
Reduced Latency: Local inference is often faster than relying on cloud-based APIs.
No Dependency on Connectivity: Ensures continuous operation in environments without internet access.

Conclusion

Ollama's Python library, combined with LangChain, enables the creation of sophisticated GenAI solutions that operate entirely offline. Whether you're building a real-time application, integrating complex pipelines, or prioritizing privacy, Ollama and models like Microsoft's PHI4 provide a robust foundation for innovation. Dive into these tools today and unlock the full potential of offline GenAI.