Self-Hosting Text Generation Models

Ollama makes it easier to run LLMs locally by offering a framework with pre-quantized open-source models, eliminating the need for in-depth knowledge of tools such as git or transformers. This guide explains how to set up Ollama, download Mistral NeMo, and inference directly on your device.

Self-Hosting Text Generation Models

Ollama is an excellent solution for beginners exploring running large language models (LLMs) locally. Unlike Hugging Face, which requires familiarity with git, transformers, and quantization, Ollama simplifies the process. Ollama provides open-source models through their hub, which are already pre-quantized, enabling you to run smaller models (ranging from 7 to 14 billion parameters) directly on your current machine. In this article, I will explore setting up a local Ollama instance, downloading Mistral NeMo, and inferencing with it on your local machine.

Install Ollama and download a model

Download Ollama on macOS
Download Ollama for macOS

Run the installer and then open your terminal. Once installed, there should be a new command on your machine ollama. By running the command: ollama ls you'll see a blank list as pictured below.

root@blogger1:/home/blogger1# ollama ls
NAME                    ID              SIZE     MODIFIED 
root@blogger1:/home/blogger1#

To install a model, we need to locate one in the Ollama library. Unlike HuggingFace, where anyone can upload models, Ollama curates only the top-performing open-source models and organizes them in a user-friendly library. This guide will use Mistral NeMo, a 12 billion parameter model offered by MistralAI, a French AI company recognized for making the 7 billion parameter category competitive with larger models with Mistral7B. To install Mistral NeMo run the following command:

ollama pull mistral-nemo

The command above might take several minutes since it downloads about a 7GB file.

Inferencing with Mistral NeMo

To begin inferencing with the model, run the following command:

ollama run mistral-nemo

This may take a moment to start since Ollama is loading Mistral NeMo into your GPU's memory, but once it's finished, you'll be greeted with an interface that looks like the following:

root@blogger1:/home/blogger1# ollama run mistral-nemo
>>>

Now, we can start a conversation with the model, and Ollama will automatically tokenize and begin inferencing NeMo's response.

root@blogger1:/home/blogger1# ollama run mistral-nemo
>>> Hi
Hello! How are you today? Is there anything specific you'd like to talk 
about or learn? I'm here to help. ☺

To leave the chat, use the command /bye. This will enable you to exit the chat interface, and Ollama will offload the model from your GPU's RAM after a period of inactivity.

Customizing Mistral NeMo

While using text-to-text models, various parameters can be customized, including temperature and system prompts, among others. In Ollama, we can manipulate these values with a Modelfile. Here's how we'll do this:

nano "Python Assistant.Modelfile"

This will allow us to begin crafting our Modelfile. Every Modelfile starts with a FROM keyword, which signifies which model Ollama will run. After that, we can set parameters with the PARAMETER keyword, in this example, I set the temperature to 1. Ollama also lets us change our system prompt; this allows you to manipulate how the model will respond. In this example, I took the Python bug buster prompt from the Anthropic system prompts library. Another great place to find system prompts is LangSmith.

FROM mistral-nemo
PARAMETER temperature 0.3
SYSTEM """
Your task is to analyze the provided Python code snippet, identify any bugs or errors present, and provide a corrected version of the code that resolves these issues. Explain the problems you found in the original code and how your fixes address them. The corrected code should be functional, efficient, and adhere to best practices in Python programming.
"""

To create our model run the following command:

ollama create py-assist -f "Python Assistant.Modelfile"

To run our model, we use the same command as before, but instead of using mistral-nemo we'll use py-assist.

ollama run py-assist

Now, when we send Python code, you'll notice we'll get a much more helpful response back:

root@blogger1:/home/blogger1# ollama run py-assist
>>> def calculate_average(nums):
... sum = 0
... for num in nums:
... sum += num
... average = sum / len(nums)
... return average
...
... numbers = [10, 20, 30, 40, 50]
... result = calculate_average(numbers)
... print(“The average is:”, results)
There are a couple of issues with the original code:

1. The variable name `sum` shadows the built-in Python function `sum()`. This can lead to unexpected behavior or
errors because you're not actually using the built-in function, but rather your custom variable.

2. There's a typo in the print statement. It should be `result` instead of `results`.

Here's the corrected version of the code:

```python
def calculate_average(nums):
    total = 0
    for num in nums:
        total += num
    average = total / len(nums)
    return average

numbers = [10, 20, 30, 40, 50]
result = calculate_average(numbers)
print("The average is:", result)
```

In the corrected code:

- I've renamed `sum` to `total` to avoid shadowing the built-in function.
- I've fixed the typo in the print statement by changing `results` to `result`.

Using the API

Text-to-text models are extremely useful beyond just chatting. Sometimes, you want to use them for content filtering; other times, you want to build your own chat UI, or maybe you want to connect it to your IDE. These are valid use cases, which Ollama supports with their API and SDK to connect with your specific use cases and applications. To begin using your local API, you'll want to know your machine's IP address, which you can find by running:

ip addr

Once you have your IP address, we'll need to unblock port 11434 on Ubuntu, you can do this with ufw:

ufw allow 11434

Once you have opened port 11434, you can connect to the Ollama API at http://<your-ip>:11434/api. Here is an example chat request to your Ollama API:

root@blogger1:/home/blogger1# curl http://<your-ip>:11434/api/chat -d '{
  "model": "mistral-nemo",
  "messages": [
    { "role": "user", "content": "Hello, world!" }
  ]
}'

Replace <your-ip> with the IP address of your machine.

You should get the following response:

{
  "model": "mistral-nemo",
  "created_at": "2025-00-00T00:00:00.000000Z",
  "message": {
    "role": "assistant",
    "content": "Hello! How can I assist you today? Let me know if you have any questions or topics you'd like to discuss. 😊"
  },
  "done": true,
  "total_duration": 442612,
  "load_duration": 0,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 383809000,
  "eval_count": 298,
  "eval_duration": 4799921000
}

Ollama also offers a Python and JavaScript SDK. You might use this if you want the model to perform more like an agent executing code through tool calling. While tool calling and loops are ideally implemented in production code with LangGraph, Ollama performs effectively for simple tasks using the model's provided system tokens.

import ollama

response = ollama.chat(
    model='mistral-nemo',
    messages=[{'role': 'user', 'content':
        'What is the weather in Strasbourg, France?'}],
    tools=[{
      'type': 'function',
      'function': {
        'name': 'get_weather',
        'description': 'Gets the current weather for a city',
        'parameters': {
          'type': 'object',
          'properties': {
            'city': {
              'type': 'string',
              'description': 'The name of the city',
            },
          },
          'required': ['city'],
        },
      },
    },
  ],
)

print(response['message']['tool_calls'])

This will return the following response:

[{'function': {'name': 'get_weather', 'arguments': {'city': 'Strasbourg'}}}]

Conclusion

While Ollama may not be production-ready and likely never will be due to missing features like request concurrency and scalability, it still remains a powerful framework for those self-hosting models, developers looking to test and implement features quickly, and anyone wanting to become more familiar with LLMs. The extensive community surrounding Ollama fostered numerous integrations and use cases, such as OpenWebUI, enabling self-hosting of your own ChatGPT-like website. However, Ollama might not suit everyone, especially those with robust hardware, which is why I still prefer VLLM for my needs. Yet, I genuinely can't think of a better tool for those eager to experiment with the latest open-source models from Mistral, Meta, and Microsoft. Particularly since Ollama is cross-platform, installation on Linux is straightforward, as the installer manages most Nvidia, Intel, and AMD drivers, and it also supports Mac. This means that if you have an M series chip, you can take advantage of shared RAM between your CPU and GPU, making it optimal for ML workloads.

If you liked this article, you won't want to miss my guide on locally training a text classification model with HuggingFace datasets. HuggingFace is where most of the models in the Ollama library are initially published. Trust me, HuggingFace is an important technology to understand since it's the leading repository for AI models and datasets. Just click here to read it now, and I’ll see you there shortly. Cheers!