Running Local LLMs



Its been a while since I’ve had a topic in technology that I felt strongly enough to write about. After some experimentation with Ollama and testing out local LLMs, I felt the spark return. When this blog was originally written, August 2025, the novelty of chatGPT had worn off. What most people don’t know about LLMs is that the chatGPT experience of mid 2023 can now be had on your desktop or laptop computer! Why is this important? Its important because as LLMs get better and better, running local language models will become more and more viable. In addition, it increases privacy for those who use and find value in LLMs. Its my strong opinion that a niche has opened in the AI space for those who can understand and find use cases for local LLMs. 

My unique contribution to this technology shift is to show that with modest hardware, individuals can still access a GPT4 level experience as it was had circa mid 2023. Additionally, I’ll show what I observed in my home lab with various sizes of local LLM from 8 B to 70 B. The utility here is to help people build a mental map between LLM size and the hardware required to run such models. 

What this blog won’t do is give you step-by-step guidance for installing LLM software such as Ollama or Open WebUI. You can find out how to install such things via searching Google, YouTube, or ask your preferred cloud based LLM. Once you install Ollama, which is available on Windows, MacOS and Linux, you can download LLMs of various sizes to run on your computer.

For ease, I’ll list one guide here for installing Ollama: https://www.youtube.com/watch?v=3W-trR0ROUY

The Ollama website can be found here: https://ollama.com/

With Ollama out of the way, lets talk about compute resources. For the last few weeks I’ve tested various language models on 3 pieces of computer hardware I own. To put my performance numbers into context, I think its important I list the specifications of those computers.

First, I have a desktop PC I bought built back in 2010. I wrote about my experience building PCs and why I think such activities can help people in their IT careers: https://networknative.blogspot.com/2023/07/my-pc-build-why-building-your-own-pc.html

December 2010 PC specs as of August 2025:

  • Nvidia GPU was about $230, but computer is 15 years old and a newer PC of equal compute power could be had for around $500 or less.

  • Core i7 950 (3.02 Ghz - quad core with Hyper-Threading)

  • PCIe gen3 motherboard

  • 12 GB of DDR3 Memory

  • Nvidia GeForce RTX 3060 12 GB video card

  • Asus Sabertooth X58 Motherboard (PCIe gen 3)

  • 120 GB SSD drive connected via SATA

  • Asus DVD/CD Burner Drive


Next, I built a computer more recently in December 2022:

PC Specs:

  • Estimated at $1,900 in Aug 2025 ($2,300 in December 2022)

  • AMD 7700X (8 core 16 thread at 5.5 Ghz)

  • PCIe Gen5 motherboard

  • 32 GB DDR 5 Memory

  • B650 TOMAHAWK WIFI MOBO (PCIe Gen 5.0)

  • Nvidia 3080 Video Card (10 GB vram)

  • 2 TB NVMe M.2 Storage

  • Celsius S36 AIO cooler 

Finally, I have a home lab server I used to teach myself VMware NSX. There are no GPUs here, but it still provides value when it comes to understanding what's needed for running local LLMs.

Dell PowerEdge R630 with:

  • ~ $500 on ebay at August 2025

  • 2 × Intel Xeon E5-2660 v3 (10 cores / 20 with Hyper-Threading)

  • PCIe gen3 motherboard

  • 20 physical CPU / 40 logical CPUs total with Hyper-Threading

  • 128 GiB RAM

  • 2 NUMA nodes, one per socket

  • Notably, no GPU.


After 2 weeks of experimenting, I’ve determined the local LLM I like best for running on my primary desktop computer is Qwen3 from Alibabba, followed closely by Gemma3 from Google DeepMind because Gemma is also a vision model. That said, lets dig into the numbers and my testing of various local LLMs. To better understand the memory footprint of an LLM, I’ll be using Qwen3 8B, Qwen3:14b, Qwen3:30b, and Deepseek-r1 70B on the only machine I have with enough memory to run it. Lets see how it turned out.

Lets look at a benchmark comparing Qwen3 30B to other versions of Qwen3 (thinking vs non-thinking), Gemini-Flash 2.5 and chatGPT-4o. We can see that Qwen3 is competitive with both OpenAI and Google’s cloud hosted LLMs.


qwen3 benchmarks

So, how much compute resource does each model consume? Well, it depends on model size and quantization. The Qwen models I’m using today are quantized to Q4_K_M with a context window of 409,060 tokens. Another reason Qwen3 is able to perform so well is that its architecture is something called MoE, or Mixture of Experts. During inference, only the portion of the model's neural network will be activated at inference time.

I’ll be comparing Qwen3 8B, 14B and 30B and Deepseek-r1 70b for both token rate and memory footprint. We’ve all watched Nvidia's stock price as it rocketed over the last 3 years to become the most valuable company in the world. This will partly be a demonstration of why as you observe the difference between LLM performance on a CPU and GPU. 

For the purpose of a consistent token rate test, I’ll use the following prompt in my local LLM: “Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril.”

Qwen3 8B

2010 PC (Nvidia 3060 – 12 GB vram) Ubuntu Linux

2022 PC (Nvidia 3080 – 10 GB vram) Windows 11 (WSL)

Dell PowerEdge R630 – Ubuntu Linux

ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

promt_8b_model


2022 PC (Nvidia 3080 – 10 GB vram)

total duration: 16.5389157s

load duration: 1.4681324s

prompt eval count: 36 token(s)

prompt eval duration: 109.5674ms

prompt eval rate: 328.56 tokens/s

eval count: 1344 token(s)

eval duration: 14.960607s

eval rate: 89.84 tokens/s

ollama run qwen3:14b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

2022 PC (Nvidia 3080 – 10 GB vram)

total duration: 2m17.3188573s

load duration: 4.4674632s

prompt eval count: 36 token(s)

prompt eval duration: 232.6438ms

prompt eval rate: 154.74 tokens/s

eval count: 1694 token(s)

eval duration: 2m12.6181516s

eval rate: 12.77 tokens/s


qwen3 14b in action

ollama run qwen3:30b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 2m32.9570292s

load duration: 10.3321836s

prompt eval count: 36 token(s)

prompt eval duration: 1.1533504s

prompt eval rate: 31.21 tokens/s

eval count: 1922 token(s)

eval duration: 2m21.4709785s

eval rate: 13.59 tokens/s

GPU resource utilization Qwen3 30b




2010 PC (Nvidia 3060 – 12 GB vram)

ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 47.849802496s

load duration: 13.68441981s

prompt eval count: 36 token(s)

prompt eval duration: 608.403072ms

prompt eval rate: 59.17 tokens/s

eval count: 1612 token(s)

eval duration: 33.555681261s

eval rate: 48.04 tokens/s

ollama run qwen3:14b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 1m26.329137766s

load duration: 27.393260443s

prompt eval count: 36 token(s)

prompt eval duration: 311.143989ms

prompt eval rate: 115.70 tokens/s

eval count: 1704 token(s)

eval duration: 58.623095437s

eval rate: 29.07 tokens/s

Qwen3_14b_nvidia_3060


ollama run qwen3:30b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 10m50.742991014s

load duration: 1m3.536882378s

prompt eval count: 36 token(s)

prompt eval duration: 3.03080791s

prompt eval rate: 11.88 tokens/s

eval count: 1740 token(s)

eval duration: 9m44.11665244s

eval rate: 2.98 tokens/s

LLM_loading_into_3060




Dell PowerEdge R630

ollama run qwen3:8b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 11m59.928418343s

load duration: 13.798537341s

prompt eval count: 36 token(s)

prompt eval duration: 2.372850941s

prompt eval rate: 15.17 tokens/s

eval count: 1440 token(s)

eval duration: 11m43.753674245s

eval rate: 2.05 tokens/s

8b_memory_footprint


deepseek_logo


Dell PowerEdge R630

ollama run deepseek-r1:70b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."

total duration: 56m35.155970193s

load duration: 6m22.306749506s

prompt eval count: 29 token(s)

prompt eval duration: 21.512059957s

prompt eval rate: 1.35 tokens/s

eval count: 1243 token(s)

eval duration: 49m51.334762972s

eval rate: 0.42 tokens/s

Deepseek_70b_memory_footprint


deepseek_70b_CPU_load


For fun, here is the answer given by Deepseek-R1 70b:

daryl@LLMmachine:~$ ollama run deepseek-r1:70b --verbose "Steelman the case that local language models are the future of AI and too many people overlook this technology at their own economic peril."
Thinking...
Okay, so I need to make the case that local language models are the future of AI and that ignoring them could be 
economically risky. Hmm, where do I start? Well, first, what exactly are local language models? From what I 
understand, they're AI models that run on individual devices rather than relying on remote servers. That probably 
means they can work offline, which is a big plus.

Let me think about the benefits. Privacy is a huge concern these days. If the model is local, then all the 
processing happens on your own device. So, sensitive data doesn't get sent over the internet where it could be 
intercepted or misused. That makes sense. Companies dealing with personal info would definitely benefit from this 
to avoid data breaches.

Another point is latency. If you don't have to send requests to a server and wait for a response, applications can 
run much faster. Imagine using a virtual assistant that responds instantly without any lag. That would be awesome 
for real-time applications like gaming or voice assistants.

Cost savings could also be significant. Using cloud-based models means paying for server time, bandwidth, and 
maintenance. Shifting to local models might reduce these costs over time, especially as hardware gets more efficient. 
Plus, not relying on the cloud means you don't have to deal with service outages or depend on external providers.

Customization is another angle. Local models can be tailored to specific needs, whether it's industry jargon for 
businesses or personal preferences for users. This could make AI tools more effective and user-friendly since they're 
optimized for particular contexts.

Sustainability is something I haven't thought about much. Training big models uses a lot of energy. If local models are 
smaller and run on devices that people already use, maybe the overall environmental impact is less. Plus, as edge 
computing grows, using local resources more efficiently could be better than centralized data centers.

Compliance with regulations like GDPR is important too. By keeping data on-device, companies can more easily meet 
privacy standards without worrying about cross-border data transfers. This reduces legal risks and makes it easier to 
operate in regions with strict data laws.

Resilience against outages or service shutdowns is another benefit. If a cloud service goes down, any application 
relying on it would be affected. Local models don't have this dependency, making systems more robust and reliable.

Economic perils of ignoring local models... Well, if competitors adopt them and you don't, you might fall behind 
in efficiency or customer satisfaction. There's also the risk of missing out on new markets where internet access 
is limited but AI demand is growing.

Putting it all together, I need to structure this into a coherent argument. Start with an introduction stating the 
importance of local language models. Then outline each benefit clearly, making sure to explain how they translate 
to economic advantages and potential risks if overlooked. Maybe wrap up by emphasizing that adopting local models 
now can lead to innovation and better competitiveness.

I should make sure each point flows logically and supports the main thesis. Maybe use examples or analogies where 
possible to make it more relatable. Also, address any counterarguments, like initial costs of developing local 
models, but then counter them with long-term savings and benefits.

Overall, I think I have a good outline here. Now I just need to flesh out each point with some details and 
examples to make the case strong.
...done thinking.

**The Rise of Local Language Models: A Strategic Imperative for Future-Proofing AI**

In the rapidly evolving landscape of artificial intelligence, local language models (LLMs) are emerging as a 
transformative technology that promises to reshape how we interact with AI. These models operate on individual 
devices, enabling offline functionality and offering significant advantages over their cloud-based counterparts. 
Ignoring this shift could place businesses at a competitive disadvantage, making it imperative to embrace LLMs for 
economic resilience.

**1. Enhanced Privacy and Security:**
Local language models process data entirely on-device, eliminating the need to transmit sensitive information over the 
internet. This reduces the risk of data breaches and interception, which is crucial for industries handling personal or 
confidential data. Companies can thus enhance their compliance with privacy regulations while building trust with 
consumers.

**2. Reduced Latency for Real-Time Applications:**
By eliminating the need for server roundtrips, LLMs enable instantaneous responses, making them ideal for real-time 
applications such as virtual assistants, gaming, and autonomous systems. This speed enhancement can significantly 
improve user experience and operational efficiency.

**3. Cost Efficiency:**
Shifting from cloud-based to local models reduces reliance on expensive server infrastructure, cutting costs related to 
bandwidth, maintenance, and service outages. Over time, this shift can lead to substantial savings, especially as 
hardware efficiency improves.

**4. Customization and Contextual Relevance:**
LLMs can be tailored to specific contexts, whether it's integrating industry-specific jargon or personalizing user 
experiences. This customization enhances effectiveness, making AI tools more adaptable and valuable for diverse use 
cases.

**5. Environmental Sustainability:**
Training large-scale models consumes significant energy. Local models, being smaller and running on existing devices, 
offer a more sustainable alternative. As edge computing grows, leveraging local resources efficiently aligns with 
broader environmental goals.

**6. Regulatory Compliance:**
By keeping data on-device, companies can more easily comply with regulations like GDPR, avoiding legal risks associated 
with cross-border data transfers and ensuring smoother operations in regions with stringent data laws.

**7. Operational Resilience:**
Local models reduce dependency on cloud services, safeguarding against outages and shutdowns. This resilience ensures 
continuous operation, critical for applications where downtime can have significant consequences.

**Economic Perils of Inaction:**
Ignoring LLMs risks competitive disadvantage as others adopt these technologies to enhance efficiency and user 
satisfaction. Moreover, missing opportunities in emerging markets with limited internet access could hinder growth and 
innovation.

**Conclusion:**
The shift towards local language models represents a strategic imperative for businesses aiming to stay competitive. 
Embracing this technology now can drive innovation, improve operational efficiency, and ensure long-term 
competitiveness. As the AI landscape evolves, those who adapt will thrive, while others risk being left behind.









Comments

Popular posts from this blog

VXLAN versus GENEVE (NSX-V vs. NSX-T)

Packet Capture Network Traffic Inside ESXi Hypervisor

Listing the Entire ESXi CLI Command Name Space