Ollama Benchmarks: The Server (GPU) vs The Laptop (CPU)
Intro
This post collects initial benchmarks of Ollama running LLM inference across on my server and my laptop: the server armed with a Radeon 6900 XT GPU and the laptop using CPU-only processing. Both setups run Arch Linux, and ROCm provides AMD GPU acceleration.
The benchmark focuses on token generation speeds (tokens/s) for various models.
The Setup
- The Server (GPU):
- Radeon RX 6900 XT
- 16GB GDDR6 RAM (~448 GB/s)
- The Laptop (CPU):
- 11th Gen Intel i7-1185G7 @ 3.00GHz
- 32GB DDR4 RAM (~26 GB/s)
- OS & Setup:
- Arch Linux with ROCm for GPU acceleration (see archwiki)
- ollama v0.4.2
Benchmark Results
There was a 35% - 110% speedup moving from the Intel i7 CPU to the
Radeon GPU, with greater gains generally coming from the larger
models (qwen2.5-coder:7b
being the exception).
Model | GPU | CPU | Ratio |
---|---|---|---|
llama3.2:1b | 33 | 24 | 1.375 |
llama3.2:3b | 21 | 14 | 1.5 |
llama3.1:8b | 15 | 7 | 2.1 |
qwen2.5-coder:7b | 11 | 8 | ? |
Values are in tokens / s
Methods
The benchmark was run using ollama run MODEL
and the prompt Please
recite Neal Stephenson’s “In the Beginning was the Command Line”.
The quality of the answer wasn’t evaluated, though I’ll say none did well and the responses were quite diverse, even for the same model.
Future Work
This post can be improved with:
- increased model coverage, particularly llama 3.2 vision
- time to first token benchmark