Skip to main content

Over the past month, I’ve been exploring the rapidly evolving world of Local Large Language Models (LLMs). These models have become so accessible that it is now possible to run an LLM on an average spec desktop or laptop. Running these models locally offers several advantages to businesses.

  • Offline Use
  • Privacy
  • No use restrictions
  • No registration
  • No cost – other than the hardware and someone to set it up

The technology is advancing at an extraordinary pace, making information outdated within months. This article reflects my practical, hands-on experiences and knowledge.

For context, LLMs are a type of artificial intelligence designed to understand and generate human-like text. It’s built using neural networks, which are algorithms loosely inspired by the way the human brain processes information. They allow users to engage in dialogue with an AI that can be remarkably human-like.

So what could you use a local LLM for beyond the things you might ask ChatGPT to do? The biggest use case I can think of for a business like yours is for a local document repository and running a local LLM against it. this would for example; allow employees to query and interact with internal knowledge documents, contracts, HR information, Company hand books etc etc, safely, privately, and efficiently.

When using a local LLM to query a document repository, the model doesn’t simply scan for keywords, it actively interprets and understands the content. First, documents such as contracts, HR policies, or manuals are converted into machine-readable text and broken into manageable sections. Each section is transformed into a numerical representation, or “embedding,” that captures its meaning, allowing the system to locate relevant passages even when the user’s question doesn’t exactly match the wording in the documents. When a user asks a question, the LLM uses these embeddings to retrieve the most relevant sections and then synthesises the information to produce a clear, human-readable response. Rather than dumping raw text, the model summarises and organises the information in natural language, providing concise answers, highlighting key points, and even combining details from multiple sources when necessary. The result is an experience much like consulting a knowledgeable colleague: the model reads, understands, and communicates the documents’ contents in a way that is accessible and immediately useful to the user.

For our own use case at Advancery, customer contracts can be complex and bespoke. Using a Local LLM to interpret all these documents quickly for support staff is actually invaluable.

For example. Asking the LLM “Is customer X covered for onsite visits”, or “How many servers are covered on break fix at X customer”, or “Do we support X customers Azure compute”. This list goes on, and all of this information is already documented. It’s just not always a quick process to retrieve.

Im sure you could imagine how this would be of use in your own business. If you think this would have a great application and use in your own business please do give us a call. But in the interests of the article we will fully document how to set up and use your own.

For this project I used the LLM Llama3.1 which was trained by Meta (Facebook). Llama 3.1 is the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. The upgraded versions of the 8B and 70B models are multilingual and have a significantly longer context length of 128K, state-of-the-art tool use, and overall stronger reasoning capabilities. This enables Meta’s latest models to support advanced use cases, such as long-form text summarisation, multilingual conversational agents, and coding assistants.

Hardware used

Apple MacBook Pro (Apple Silicon)
CPU – Apple M3 Pro
RAM – 18GB
OS – 15.1.1 (24B91)

Software needed

Ollama
LLM (Llama3.1)
Homebrew
Python 3.8 or newer
FastAPI

Setting up your environment.

  1. First things first, go download and install Ollama from the link above.
  2. Open a terminal window and run the following command to download Llama3.1 LLM

    ollama pull llama3.1

    This downloads a few GB, so grab a brew while you wait.

  3. When it’s done, test it by typing the following command;
    ollama run llama3.2

    Then ask simple question like “what is the capital of England” and press Enter. You should get the response. Type /Bye to exit.

Project Structure and Model

Create a new folder for you project in the terminal window:

mkdir llm-project

Then go into the folder.

cd llm-project

Lets make some files and folders for the project.

mkdir documents mkdir src touch src/main.py touch src/config.py touch src/server.py Touch __init__.py

In the documents folder, put some PDF files for testing.

 

Install Homebrew

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Python

brew install python

Then run.

python3 -m venv venv

Then run.

source venv/bin/activate

Installing libraries

We need to install some library dependancies so in the terminal window enter the following.

These packages handle different parts of our system:

  • langchain: The main framework for building AI applications
  • langchain-ollama: Connects LangChain to Ollama
  • chromadb: Stores our documents in a searchable format
  • pypdf: Reads PDF files
  • python-dotenv: Manages configuration
pip install langchain langchain-community langchain-ollama chromadb pypdf python-dotenv

 

Building Our Scripts

lets open the src/config.py file in a text editor and past in the following code.

import os
# Base directory of the src folder
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
# Model settings
MODEL_NAME = "llama3.1"
EMBEDDING_MODEL = "nomic-embed-text"
# Database settings
PERSIST_DIRECTORY = os.path.join(BASE_DIR, "..", "chroman_db")
COLLECTION_NAME = "documents"
# Document settings
DOCUMENTS_PATH = os.path.join(BASE_DIR, "..", "documents")
CHUNK_SIZE = 500
CHUNK_OVERLAP = 100
# Ollama settings
OLLAMA_BASE_URL = "http://localhost:11434"

These settings control how our system behaves. The chunk size determines how we split documents (500 characters with 100 character overlap ensures we don’t lose context between chunks).

Connecting LangChain with Ollama


# src/main.py

import os
import sys
import shutil

from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load config
sys.path.append("src")
from config import *

class DocumentQA:
def __init__(self):
print("🚀 Starting Advancery Document System...")
self.llm = None
self.embeddings = None
self.vectorstore = None
self.qa_chain = None

def setup_models(self):
"""Initialize the language model and embeddings"""
print("📡 Connecting to Ollama...")
self.llm = OllamaLLM(model=MODEL_NAME, base_url=OLLAMA_BASE_URL, temperature=0.1)
try:
os.system(f"ollama pull {EMBEDDING_MODEL}")
except Exception:
pass
self.embeddings = OllamaEmbeddings(model=EMBEDDING_MODEL, base_url=OLLAMA_BASE_URL)
print("✅ Models ready!")

def load_documents(self):
"""Load all PDFs and split them into chunks with metadata"""
print("📂 Loading documents...")

if not os.path.exists(DOCUMENTS_PATH):
os.makedirs(DOCUMENTS_PATH, exist_ok=True)
print(f"No documents found. Add PDFs to {DOCUMENTS_PATH}")
return []

loader = PyPDFDirectoryLoader(DOCUMENTS_PATH)
documents = loader.load()
if not documents:
print(f"No PDFs found in {DOCUMENTS_PATH}")
return []

print(f"Loaded {len(documents)} document pages")

splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
length_function=len
)
split_docs = splitter.split_documents(documents)

# Preserve source filename metadata
for doc in split_docs:
if "source" not in doc.metadata:
doc.metadata["source"] = getattr(doc, "metadata", {}).get("source", "Unknown")

print(f"Split into {len(split_docs)} chunks")
return split_docs

def setup_vectorstore(self, documents):
"""Rebuild vectorstore from current documents"""
print("🔍 Setting up vector database (rebuilding)...")

if os.path.exists(PERSIST_DIRECTORY):
shutil.rmtree(PERSIST_DIRECTORY)

self.vectorstore = Chroma.from_documents(
documents=documents,
embedding=self.embeddings,
persist_directory=PERSIST_DIRECTORY,
collection_name=COLLECTION_NAME
)

try:
self.vectorstore.persist()
except Exception:
pass

print("✅ Vector database rebuilt")
return self.vectorstore

def setup_qa_chain(self):
"""Create the question-answering chain"""
print("🔗 Setting up Question chain...")
template = """Use the following context to answer the question.
If you don't know the answer based on the context, say so.

Context: {context}
Question: {question}
Answer: """
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
self.qa_chain = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff",
retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3}),
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
print("✅ Q&A chain ready!")

def ask_question(self, question):
"""Ask a question and return unique PDF sources"""
if not self.qa_chain:
print("⚠️ System not initialized. Run initialize() first.")
return

print(f"\n❓ Question: {question}")
print("🤖 Thinking...")
try:
result = self.qa_chain({"query": question})
answer = result.get("result") or result.get("answer") or ""
sources = result.get("source_documents", [])

# Deduplicate source filenames
unique_sources = list({os.path.basename(doc.metadata.get("source", "Unknown")) for doc in sources})

print(f"💡 Answer: {answer}")
if unique_sources:
print("\n📖 Sources:")
for i, src in enumerate(unique_sources, 1):
print(f" {i}. {src}")

except Exception as e:
print(f"Error: {type(e).__name__}: {e}")

def run_interactive(self):
"""Run interactive Q&A session"""
print("\n" + "=" * 50)
print("🤖 Advancery Document System Ready!")
print("Ask questions about your documents.")
print("Type 'quit' to exit.")
print("=" * 50 + "\n")

while True:
try:
question = input("Your question: ").strip()
except (KeyboardInterrupt, EOFError):
print("\nGoodbye!")
break

if question.lower() in ["quit", "exit", "q"]:
print("Goodbye!")
break
if question:
self.ask_question(question)
else:
print("Please enter a question.")

def initialize(self):
"""Initialize system"""
self.setup_models()
documents = self.load_documents()
if not documents:
return False
self.setup_vectorstore(documents)
self.setup_qa_chain()
return True

if __name__ == "__main__":
qa_system = DocumentQA()
if qa_system.initialize():
qa_system.run_interactive()
else:
print("❌ Failed to initialize system. Add PDFs to the documents folder.")

Testing

Run the following in terminal.

python <span class="hljs-attribute">src</span>/<span class="hljs-selector-tag">main</span><span class="hljs-selector-class">.py

 

Your scripts should run and you should get a prompt asking for questions.

Try asking:

  • “What is this document about?”
  • “What are the main points?”
  • “Can you summarise the key findings?”

The system will search through your documents and provide answers based on the content. For my test I could ask questions about the PDF’s I had loaded.

Examples

 

Web interface

You can even go one step further and create a web front end so other people on your network can access the LLM document project. We have implemented this internally as a knowledge base for staff to access key information quickly and easily.

HTML Code


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document Q&A</title>
<style>
body {
font-family: Arial, sans-serif;
max-width: 800px;
margin: 40px auto;
line-height: 1.6;
}
input[type="text"] {
width: 80%;
padding: 10px;
margin-right: 10px;
}
button {
padding: 10px;
}
#results {
margin-top: 20px;
}
.source {
font-size: 0.9em;
color: #555;
}
</style>
</head>
<body>
<h1>Advancery Document System</h1>
<input type="text" id="question" placeholder="Ask a question...">
<button onclick="askQuestion()">Ask</button>

<div id="results"></div>

<script>
async function askQuestion() {
const question = document.getElementById('question').value.trim();
if (!question) return;

const resultsDiv = document.getElementById('results');
resultsDiv.innerHTML = "🤖 Thinking...";

try {
const response = await fetch('http://YOUR-IP/ask', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: question })
});

const data = await response.json();

resultsDiv.innerHTML = `<strong>Answer:</strong> ${data.answer}<br><br>`;
if (data.sources && data.sources.length > 0) {
resultsDiv.innerHTML += "<strong>Sources:</strong><ul>";
data.sources.forEach(src => {
resultsDiv.innerHTML += `<li class="source">${src}</li>`;
});
resultsDiv.innerHTML += "</ul>";
}

} catch (err) {
resultsDiv.innerHTML = "❌ Error contacting server";
console.error(err);
}
}
</script>
</body>
</html>

Conclusion

The groundwork you’ve established can evolve into advanced document intelligence solutions. Heres a diagram to illustrate exactly whats going on behind the scenes.

 You now understand the key principles behind modern AI applications, while maintaining full control over your data. Begin experimenting with your own documents to discover the valuable insights they can reveal. Or if your prefer, give us a call and we can implement one for you .