Experiments running offline LLMs in Python and Rust locally using Ollama and llama.cpp
Collection of local AI experiments that should run on a recent home computer. Makes use of local Ollama and llama.cpp servers for running completion and chat tasks with Gemma4 and Mistral models. Code is written in Python and Rust and each example has a short description detailing how you can download the model and run the code.
Ollama is an open-source programming language designed for rapid prototyping, education, and research in the field of artificial intelligence (AI). Ollama:
- has a simple API;
- does not require a Python environment; and
- has a model library, making it easy to discover and download new
models.
llama.cpp is a C++ implementation of LLaMA. It is:
- extremely memory-efficient through quantisation;
- works well on CPU-only setups; and
- is available as a library for integration with other applications.
Source: Running Local LLMs
- Introduction
- Setup
- Examples
- Why Run Local LLMs?
- Issues and Support
- Contributions
- Acknowledgements
- License
Example run on Ollama or llama.cpp. Here’s a quick guide to getting those setup on macOS with Homebrew. Follow links for more detailed instructions, and for other operating systems. You will also need Rust or Python set up on your system (depending on which examples you want to run).
brew install ollamaFor other operating systems, or more details, see the Official Ollama Quickstart\Guide.
brew install llama.cppFor other operating systems, or more details, see the LLaMA.cpp HTTP Server Quick Start Guide.
Nothing to install beyond the prerequisites.
- llamacpp-gemma4-e4b-completion
-
Gemma4 LLM completion demo calling local llama.cpp server from Rust
code.
- llamacpp_tts
- Large Language Model text-to-speech (TTS) demo with voice cloning.
- ollama-mistral-instruct-chat
- Data sovereignty: you have more control over your data.
- Offline support: great if you have an unstable connection or are temporarily offline.
- Model fine-tuning: you also have more control over the model run.
- Roadmap - will be implemented in a future release
- Backlog - may be implemented in the future but needs further feedback or interest from the community
- Icebox - no plans to implement as it doesn't currently align with the project's goals or capabilities, may be revised at a later date
- Before working on a new feature it's preferable to submit a feature request first and state that you'd like to implement it yourself
- Please don't submit PRs for feature requests that are either in the roadmap[1], backlog[2] or icebox[3]
- Avoid introducing new dependencies
- Avoid making backwards-incompatible configuration changes
You don't need the latest GPU: with llama.cpp or Ollama, smaller models (up to around 7 billion parameters) can run comfortably on a typical home computer.
For balance, however, running locally, you pay the one-off cost of downloading the model you want to run, you might not be able to run the largest models, depending on your machine’s spec. Also, a cloud service would be more scalable if you needed to step up model usage.
Open an issue if something does not work as expected or if you have some improvements.
Feel free to jump into the Rodney Lab matrix chat room.
New feature suggestions are always welcome and will be considered, though please keep in mind that some of them may be out of scope for what the project is trying to achieve (or is reasonably capable of). If you have an idea for a new feature and would like to share it, you can create a feature request.
Feature requests are tagged with one of the following:
Contributions welcome, write a short issue with your idea, before spending to much time on more involved additions.
[1] [2] [3]
[1] The feature likely already has work put into it that may conflict with your implementation
[2] The demand, implementation or functionality for this feature is not yet clear
[3] No plans to add this feature for the time being
Inspired by:
The project is licensed under BSD 3-Clause License — see the LICENSE file for details.
