Running Local LLMs
Its been a while since I’ve had a topic in technology that I felt strongly enough to write about. After some experimentation with Ollama and testing out local LLMs, I felt the spark return. When this blog was originally written, August 2025, the novelty of chatGPT had worn off. What most people don’t know about LLMs is that the chatGPT experience of mid 2023 can now be had on your desktop or laptop computer! Why is this important? Its important because as LLMs get better and better, running local language models will become more and more viable. In addition, it increases privacy for those who use and find value in LLMs. Its my strong opinion that a niche has opened in the AI space for those who can understand and find use cases for local LLMs.
My unique contribution to this technology shift is to show that with modest hardware, individuals can still access a GPT4 level experience as it was had circa mid 2023. Additionally, I’ll show what I observed in my home lab with various sizes of local LLM from 8 B to 70 B. The utility here is to help people build a mental map between LLM size and the hardware required to run such models.
What this blog won’t do is give you step-by-step guidance for installing LLM software such as Ollama or Open WebUI. You can find out how to install such things via searching Google, YouTube, or ask your preferred cloud based LLM. Once you install Ollama, which is available on Windows, MacOS and Linux, you can download LLMs of various sizes to run on your computer.
For ease, I’ll list one guide here for installing Ollama: https://www.youtube.com/watch?v=3W-trR0ROUY
The Ollama website can be found here: https://ollama.com/
With Ollama out of the way, lets talk about compute resources. For the last few weeks I’ve tested various language models on 3 pieces of computer hardware I own. To put my performance numbers into context, I think its important I list the specifications of those computers.
First, I have a desktop PC I bought built back in 2010. I wrote about my experience building PCs and why I think such activities can help people in their IT careers: https://networknative.blogspot.com/2023/07/my-pc-build-why-building-your-own-pc.html
December 2010 PC specs as of August 2025:
Nvidia GPU was about $230, but computer is 15 years old and a newer PC of equal compute power could be had for around $500 or less.
Core i7 950 (3.02 Ghz - quad core with Hyper-Threading)
PCIe gen3 motherboard
12 GB of DDR3 Memory
Nvidia GeForce RTX 3060 12 GB video card
Asus Sabertooth X58 Motherboard (PCIe gen 3)
120 GB SSD drive connected via SATA
Asus DVD/CD Burner Drive
Next, I built a computer more recently in December 2022:
PC Specs:
Estimated at $1,900 in Aug 2025 ($2,300 in December 2022)
AMD 7700X (8 core 16 thread at 5.5 Ghz)
PCIe Gen5 motherboard
32 GB DDR 5 Memory
B650 TOMAHAWK WIFI MOBO (PCIe Gen 5.0)
Nvidia 3080 Video Card (10 GB vram)
2 TB NVMe M.2 Storage
Celsius S36 AIO cooler
Finally, I have a home lab server I used to teach myself VMware NSX. There are no GPUs here, but it still provides value when it comes to understanding what's needed for running local LLMs.
Dell PowerEdge R630 with:
~ $500 on ebay at August 2025
2 × Intel Xeon E5-2660 v3 (10 cores / 20 with Hyper-Threading)
PCIe gen3 motherboard
20 physical CPU / 40 logical CPUs total with Hyper-Threading
128 GiB RAM
2 NUMA nodes, one per socket
Notably, no GPU.
After 2 weeks of experimenting, I’ve determined the local LLM I like best for running on my primary desktop computer is Qwen3 from Alibabba, followed closely by Gemma3 from Google DeepMind because Gemma is also a vision model. That said, lets dig into the numbers and my testing of various local LLMs. To better understand the memory footprint of an LLM, I’ll be using Qwen3 8B, Qwen3:14b, Qwen3:30b, and Deepseek-r1 70B on the only machine I have with enough memory to run it. Lets see how it turned out.
Lets look at a benchmark comparing Qwen3 30B to other versions of Qwen3 (thinking vs non-thinking), Gemini-Flash 2.5 and chatGPT-4o. We can see that Qwen3 is competitive with both OpenAI and Google’s cloud hosted LLMs.
So, how much compute resource does each model consume? Well, it depends on model size and quantization. The Qwen models I’m using today are quantized to Q4_K_M with a context window of 409,060 tokens. Another reason Qwen3 is able to perform so well is that its architecture is something called MoE, or Mixture of Experts. During inference, only the portion of the model's neural network will be activated at inference time.
I’ll be comparing Qwen3 8B, 14B and 30B and Deepseek-r1 70b for both token rate and memory footprint. We’ve all watched Nvidia's stock price as it rocketed over the last 3 years to become the most valuable company in the world. This will partly be a demonstration of why as you observe the difference between LLM performance on a CPU and GPU.
For the purpose of a consistent token rate test, I’ll use the following prompt in my local LLM: “Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril.”
Qwen3 8B
2010 PC (Nvidia 3060 – 12 GB vram) Ubuntu Linux
2022 PC (Nvidia 3080 – 10 GB vram) Windows 11 (WSL)
Dell PowerEdge R630 – Ubuntu Linux
ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
2022 PC (Nvidia 3080 – 10 GB vram)
total duration: 16.5389157s
load duration: 1.4681324s
prompt eval count: 36 token(s)
prompt eval duration: 109.5674ms
prompt eval rate: 328.56 tokens/s
eval count: 1344 token(s)
eval duration: 14.960607s
eval rate: 89.84 tokens/s
ollama run qwen3:14b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
2022 PC (Nvidia 3080 – 10 GB vram)
total duration: 2m17.3188573s
load duration: 4.4674632s
prompt eval count: 36 token(s)
prompt eval duration: 232.6438ms
prompt eval rate: 154.74 tokens/s
eval count: 1694 token(s)
eval duration: 2m12.6181516s
eval rate: 12.77 tokens/s

ollama run qwen3:30b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 2m32.9570292s
load duration: 10.3321836s
prompt eval count: 36 token(s)
prompt eval duration: 1.1533504s
prompt eval rate: 31.21 tokens/s
eval count: 1922 token(s)
eval duration: 2m21.4709785s
eval rate: 13.59 tokens/s
2010 PC (Nvidia 3060 – 12 GB vram)
ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 47.849802496s
load duration: 13.68441981s
prompt eval count: 36 token(s)
prompt eval duration: 608.403072ms
prompt eval rate: 59.17 tokens/s
eval count: 1612 token(s)
eval duration: 33.555681261s
eval rate: 48.04 tokens/s
ollama run qwen3:14b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 1m26.329137766s
load duration: 27.393260443s
prompt eval count: 36 token(s)
prompt eval duration: 311.143989ms
prompt eval rate: 115.70 tokens/s
eval count: 1704 token(s)
eval duration: 58.623095437s
eval rate: 29.07 tokens/s
ollama run qwen3:30b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 10m50.742991014s
load duration: 1m3.536882378s
prompt eval count: 36 token(s)
prompt eval duration: 3.03080791s
prompt eval rate: 11.88 tokens/s
eval count: 1740 token(s)
eval duration: 9m44.11665244s
eval rate: 2.98 tokens/s
Dell PowerEdge R630
ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 11m59.928418343s
load duration: 13.798537341s
prompt eval count: 36 token(s)
prompt eval duration: 2.372850941s
prompt eval rate: 15.17 tokens/s
eval count: 1440 token(s)
eval duration: 11m43.753674245s
eval rate: 2.05 tokens/s

Dell PowerEdge R630
ollama run deepseek-r1:70b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
total duration: 56m35.155970193s
load duration: 6m22.306749506s
prompt eval count: 29 token(s)
prompt eval duration: 21.512059957s
prompt eval rate: 1.35 tokens/s
eval count: 1243 token(s)
eval duration: 49m51.334762972s
eval rate: 0.42 tokens/s
Comments
Post a Comment