Ollama Benchmarks: The Server (GPU) vs The Laptop (CPU)

Posted on Nov 16, 2024

Intro

This post collects initial benchmarks of Ollama running LLM inference across on my server and my laptop: the server armed with a Radeon 6900 XT GPU and the laptop using CPU-only processing. Both setups run Arch Linux, and ROCm provides AMD GPU acceleration.

The benchmark focuses on token generation speeds (tokens/s) for various models.

The Setup

  • The Server (GPU):
    • Radeon RX 6900 XT
    • 16GB GDDR6 RAM (~448 GB/s)
  • The Laptop (CPU):
    • 11th Gen Intel i7-1185G7 @ 3.00GHz
    • 32GB DDR4 RAM (~26 GB/s)
  • OS & Setup:
    • Arch Linux with ROCm for GPU acceleration (see archwiki)
    • ollama v0.4.2

Benchmark Results

There was a 35% - 110% speedup moving from the Intel i7 CPU to the Radeon GPU, with greater gains generally coming from the larger models (qwen2.5-coder:7b being the exception).

Model GPU CPU Ratio
llama3.2:1b 33 24 1.375
llama3.2:3b 21 14 1.5
llama3.1:8b 15 7 2.1
qwen2.5-coder:7b 11 8 ?

Values are in tokens / s

12/5/24 update: qwq was able to run run inference 71% on the GPU, but the remaining 29% was run on the CPU. So skipping

Methods

The benchmark was run using ollama run MODEL and the prompt Please recite Neal Stephenson’s “In the Beginning was the Command Line”.

The quality of the answer wasn’t evaluated, though I’ll say none did well and the responses were quite diverse, even for the same model.

Future Work

This post can be improved with:

  • increased model coverage, particularly llama 3.2 vision
  • time to first token benchmark