How To Self Host LLMs (like Chat GPT)

If you have been living under a rock, Open AI’s Chat GPT has revolutionized many industries and how people fetch knowledge from computer systems. One of the interesting concepts of this approach is that all of your queries and data you send to Chat GPT are being sent to Open AI’s servers, which may introduce privacy concerns. This poses the desire to be able to self host an equivalent large language model (LLM), that way you can be sure that the data you send to the model is kept locally, and safe.

The tool ollama provides an incredibly easy backend to download models, switch between them, and interact with them.

I’m partial to running software in a Dockerized environment, specifically in a Docker Compose fashion. For that reason, I’m typically running Ollama with this docker-compose.yaml file:

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - 11434:11434
    volumes:
      - /home/ghilston/config/ollama:/root/.ollama
    tty: true
    # If you have an Nvidia GPU, define this section, otherwise remove it to use
    # your CPU
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11434/api || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

Then I can start Ollama by running $ docker compose up, which will download the image from the remote repository and start the container.

Warning

If you are running on a Mac, be aware that the GPU is not accessible to Docker containers. Instead, you should just install the native application for GPU support.

When the application starts, take note if Ollama detected your GPU, if you have one. You’ll see some output that looks like this:

time=2024-04-13T02:17:59.463Z level=INFO source=gpu.go:268 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-04-13T02:17:59.463Z level=INFO source=gpu.go:314 msg="Discovered GPU libraries: []"
time=2024-04-13T02:17:59.463Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-13T02:17:59.463Z level=INFO source=routes.go:1164 msg="no GPU detected"

Once running, Ollama can be used in one of two ways:

Directly, at the command line interactively

When Ollama is ran with a command like this: $ ollama run mistral:7b, you’ll get a prompt of:

>>> Send a message (/? for help)

At this point, you can interact with the selected mistral:7b model by simply typing in your message. For example, this interaction is possible:

>>> Write me a Haiku about why large language models are so cool
Model's vast expanse,
Speaking words with grace and sense,
Intellect takes flight.

>>>

At this point, you’re able to interact with the self hosted LLM with ease.

As a backend for other applications

Ollama also exposes a REST API that other applications take advantage of. If this interests you, see my other posts in this series, as we’ll go over how to use Open WebUI.

Directly, at the command line interactively#

As a backend for other applications#

Directly, at the command line interactively

As a backend for other applications