It is hardly news that we live in an era when readily accessible AIs are catalysing a period of economic disruption. When used thoughtfully, Large Language Models allow for dramatic streamlining of production pipelines, reducing the cost of drafting, coding, summarising, analysing and automating routine knowledge work.

Yet most discussion still centres on frontier cloud models: ChatGPT, Claude, Gemini and their peers. These systems are powerful, polished and increasingly embedded into professional workflows. However, dependence on external APIs comes with external drawbacks – although proprietary closed-weight models often come with a high quality ecosystem of readily callable tools and capabilities, not to mention models that run into the hundreds of billions of parameters, there are drawbacks. Users may experience degraded performance, and costs may balloon at a moments’ notice. The alternative? Local LLMs.

By local LLMs, I mean models that run directly on your own machine rather than through a cloud API. Instead of sending prompts to an external server, the model runs on your laptop, desktop or private server. This changes the economics and privacy assumptions of AI use. It also shifts more responsibility onto the user, because local models vary significantly in speed, reliability and instruction-following ability.

For this article, I tested a set of local models on a MacBook Pro with an M4 Pro chip and 48 GB of unified memory, using Ollama and LM Studio. The models were:

ModelSizeUSP
Qwen 3.6 35B35B – A3BLargest general-purpose model tested
Gemma 4 31B31BLarge dense general assistant model
Gemma 4 26B26B – A4BGemma 4 tested with Model of Experts architecture
Qwen 3.5 4B9BSmall Qwen speed baseline
Gemma 4 4B4BLightweight model baseline

I’m not trying to create a definitive academic benchmark – more than enough of those are already out there. Instead, I’m trying to answer a practical question – which models provide the most real-world utility, where do they start to break, and what practical differences can we notice between small 4 Billion parameter models designed to run on consumer-end hardware and larger 20 to 30 billion parameter models that stretch the limits?

Why this model set is interesting

This group of models creates a useful contrast.

At one end are Qwen 3.6 35B, Gemma 4 31B and Gemma 4 26B. These are the latest and ‘highest scoring’ open weight LLMs on many benchmarks as of April 2026. They’re large enough to be genuinely interesting as local alternatives to cloud models, but also large enough to test the limits of laptop-based inference. They should, in theory, be better at reasoning, instruction-following and complex writing than smaller models. But that extra capability comes at a cost: more memory pressure, slower generation, longer load times and a greater chance that the model feels heavy in ordinary use. It’s also worth noting the presence of Gemma 4 31B – a ‘dense’ LLM. Some models use a ‘model of experts approach – they select a small subset of their parameters to conduct inference on a query depending on what the model deduces the nature of the query is. In contrast, Gemma 4 31B utilises all of its’ parameters to respond to a query, which may result in a greater ‘depth’, albeit at a significantly slower inference rate.

At the other end are Qwen 3.5 9B and Gemma 4 4B. These are not expected to beat the larger models on raw intelligence. Their purpose is different. They test whether a small local model can be fast enough, responsive enough and sufficiently competent for lightweight tasks such as summarisation, drafting, simple classification or coding assistance – determining whether the average consumer can derive any real benefit from current locally hosted open weight LLMs.

This is the central trade-off in local AI. The best model is not necessarily the largest. A model that is theoretically stronger but slow enough to interrupt workflow may be less useful than a smaller model that gives acceptable answers instantly.

The testing setup

The tests were run on:

GPU: Apple M4 Pro (20 core)
Memory: 48 GB unified memory
Harness: LM Studio

I chose to use LM Studio as the harness through which I ran these LLMs rather than other tools such as Ollama owing to their superior machine protection safeguards and their GUI with built-in tokens-per-second tracking and temperature adjustment capabilities. You can try it yourself here

Important note – I chose to download and utilise all models at a ‘4 bit’ quantisation level. What this means is that the ‘weights’ stored in vector form that make up the structure of LLMs have been stored as 4 bit numbers, truncating their original 32 bit format. This has cut the ‘accuracy’ of these models by between 2 and 5%, however, it leads to massive memory savings. This makes it possible to run the entire suite of publicly available Gemma 4 and Qwen 3.6 LLMs on my machine.

The comparison focused on practical use rather than abstract benchmark scores. I looked at:

CategoryWhat I was judging
Response Format (20%)Whether its’ response is in a valid .JSON format
Adherence to Schema (20%)Whether it obeyed detailed constraints
Factual Correctness (40% weighting)Whether the substance of the response is in accordance with reality
Compliance with Constraints (20% weighting)Adherence to safety layer, resistance to attempted prompt injections, following of specified instructions in response.

I asked each model to execute the same 10 tasks, weighting each task equally. The tasks range from simple key term extraction from JSON files to prompt injection resistance testing and reasoning/justification stress testing. The full list of prompts is available here. All models are being run in ‘reasoning’ mode, leading to greater accuracy at the expense of longer periods of inference.

Model 1: Gemma 4 E4B

The Gemma 4 4B model is Google’s smallest latest-generation open weight model. On paper, we’d be expecting it to struggle with more complex tasks, breaking down slightly as the prompts become more complex.

DetailResult
ModelGemma 4 E4B
Context length128,000 tokens
Temperature0
Speed~52 tokens/second
Memory pressure12.69 GB with full context window
(5.89 GB minimum)
ScorecardScore
Question 110
Question 29.5
Question 37
Question 49
Question 56
Question 69
Question 76
Question 88.75
Question 99
Question 107
Total:81.25/100

General impression

What a model! On paper, at 4 billion parameters this ought to be the least capable of the models we’re testing today. However, it achieved a respectable score of 81.25 out of 100. Whilst it did struggle in some areas, notably in debugging a test python program – with it being unable to identify the locations and nature of some bugs in question 5, and in Question 7 it incorrectly described itself as being capable of answering a question on ambiguous gender pronouns despite going on the clearly state the reasons it was unable to do so. Nevertheless, at a size that fits onto many consumer-range laptops, operating at speeds of over 50 tokens per second on my machine, the model provided a solid baseline. It’s also noteworthy how well the model was able to utilise its’ reasoning skills – it kept thinking times low (often under 20 seconds), meaning it provided a snappy response, being both token and time-efficient – responses to questions barely took 1,000 tokens at most.

Strengths

  • Rapid inference
  • Relatively token-efficient
  • Reasonable memory constraints
  • Excellent explanations

Weaknesses

  • Sometimes gets the easy stuff wrong, even if it understands the reasoning
  • Still needs a very thorough reader

Best use case

To me, Gemma 4 E4B represents the frontier of a critical niche in local inference AI – a model small enough to run on most desktop devices and laptops (with a small enough context window), whilst demonstrating a reasonable baseline of reasoning. It excels in data analysis and retrieval in paragraph-sized documents.

Provisional verdict

Don’t go giving it root access, but this LLM can really speed up routine data retrieval tasks – with heavy human supervision.


Model 2: Qwen 3.5 9B

The next model I tested was Qwen 3.5 9B – a 9 Billion parameter model. Although slightly older than the Gemma 4 or Qwen 3.6 families, Qwen 3.5 provides a useful bridge to analyse how model capabilities evolve between the 4 Billion edge model class and the multi-dozen parameter ‘flagship’ models of the latest open weight families.

DetailResult
ModelQwen 3.5 9B
Context length262,000
Temperature0
Speed~35 tokens/sec
Memory pressure16.8 GB at maximum context
(6.1 GB minimum)
ScorecardScore
Question 19
Question 29
Question 37
Question 48
Question 59
Question 69.5
Question 710
Question 88.5
Question 99
Question 107
Total:86/100

General impression

I had some mixed feelings here. Whilst it’s true that the paper score of 86 out of 100 is marginally higher than the score obtained through Gemma-4-E4B model, this came at a cost. Massively increased inference times. As expected, the 9 billion parameter model conducted inference at a slower rate than the nimble 4 Billion parameter option, or roughly 30% slower. However, what I didn’t expect was the huge increase in reasoning intensity. Whereas the nimbler E4B model was able to conduct all of its’ reasoning using no more than 1,200 tokens in each

And if we look at the reasoning that Qwen 3.5 conducts, it can get a bit… repetitive. When responding to question 3, Gemma 4 E4B required a total of 1,145 tokens and 16.5 seconds of thought. Qwen 3.5 9B took 7,568 tokens and 3 minutes and 45 seconds of thought. The models scored the same.

Strengths

  • Larger context window allows longer conversations than Gemma 4-E4B
  • Has marginally more reasoning power

Weaknesses

  • 30% slower inference than the tested 4B parameter models
  • Between 40% and octuple the number of tokens needed for reasoning than Gemma 4 E4B

Best use case

In all honesty, I’m not sure what to make of this model. It exists in a sort of ‘uncanny valley’, where it suffers from worse performance than 4 Billion parameter models, is too large to comfortably run on consumer grade devices with 16 GB of memory at any context length, and guzzles tokens in a way that Gemma models don’t. Although technically capable, you’re much better off sticking to either one of the smaller Gemma models for light work or, if your system permits, a larger model in the 20-30 billion parameter range for true heavy reasoning.

Provisional verdict

Best avoided

Model 3: Gemma 4 26B

Gemma 4 26B is particularly interesting because it’s the first of the two models I tested that exhibit ‘model of experts’ architecture. What this means in practice is that, although it has access to the full knowledge base of its’ 26 billion parameters, it actively utilises only 4 billion parameters of ‘specialised’ knowledge most relevant to each query submitted, allowing it to have the breadth of knowledge of a 26 billion model whilst enjoying the inference speeds of its’ much smaller counterpart

DetailResult
ModelGemma 4 26B A4B
Context length262,144
Temperature0
Speed~57 tokens/sec in optimal conditions
Memory pressure35 GB with full context window
(16.4 GB minimum)
ScorecardScore
Question 110
Question 29
Question 310
Question 48.5
Question 57
Question 610
Question 710
Question 89
Question 98
Question 1010
Total:91.5/100

General impression

Now we’re getting somewhere. If you had to choose between Qwen 3.5 9B and Gemma 4 26B, the Gemma model is a no brainer if you have the memory to spare. It scores a full ten points higher than the smaller E4B model. Although this is at the cost of slightly more inference per token, it’s nowhere near as much inference as is conducted by Qwen 3.5 9B – and the return to the rapid inference enjoyed by models with 4 billion active parameters makes the experience feel even more fluid

Strengths

  • Rapid inference
  • Efficient reasoning
  • High accuracy

Weaknesses

  • Marginally less token efficient than Gemma 4 E4B
  • The first of the ‘RAM guzzlers’ – takes triple the memory of the 4 billion parameter model.

Best use case

Gemma 4 26B is very much a ‘best of both’ workhorse for my use case – its’ depth of knowledge combined with speed of inference makes it utterly perfect for my daily workflows on this machine.

Provisional verdict

A strong model, suitable for most day to day work on higher end consumer devices


Model 4: Gemma 4 31B

Gemma 4 31B is the ‘densest’ model I tested. In other words, it’s the model with the largest number of parameters active at any one time. Whilst this should in theory make it even more capable than its ‘sparse’ model of experts sibling, it comes at the cost of context being much more memory intensive (to the point that my machine was unable to load the model with its’ full context window)

DetailResult
ModelGemma 4 31B
Context length262,144
Temperature0
Speed11 tokens/sec
Memory pressure19.5-87 GB (dependent on Context window)

General impression

ScorecardScore
Question 110
Question 29
Question 310
Question 410
Question 58.5
Question 610
Question 710
Question 810
Question 99.5
Question 1010
Total:97/100

As you can see, on paper this is our most capable model yet, cruising through the question set. It also happens to be the most token-efficient, with its’ reasoning taking significantly fewer tokens than model counterparts, with this question taking Gemma 4 31B just 567 tokens to answer compared to 944 from Gemma 4 26B.

However, the cripplingly low inference speed on my device mean it’s very hard to take this model seriously as a day-to-day workhorse

Strengths

  • Extremely thorough reasoning
  • Token-efficient response

Weaknesses

  • Much slower than anything else tested

Best use case

For my hardware, the best use case of Gemma 31B is clear – it’s far too slow (not to mention memory intensive) for most questions. Nevertheless, when a particularly complex or large data retrieval task or analysis that other models may struggle with comes up, switching from other models to Gemma 31B for that issue, then switching back allows me to gain pinpoint-accuracy at a fraction of the context cost of other models. Additionally, only using it for these rare questions allows my workflow to keep going on at a reasonable rate.

Provisional verdict

A heavy hitter, best used when real firepower is needed.


Model 5: Qwen 3.6 35B A3B

This model is, paradoxically, both the largest and the smallest of the models. It has the largest number of total parameters, but the smallest number of active parameters. It’s also the only model being tested out of the Qwen 3.6 family, the latest open weight model being released by Alibaba Cloud. The question that’s a deal-maker or breaker for me was: is this model’s token-efficiency superior to that of the smaller and older Qwen 3.5 9B, perhaps on par with Gemma models?

DetailResult
ModelQwen 3.6 35B A3B
Context length262,144
Temperature0
Speed~51 tokens per second
Memory pressure21.6-26.6 GB
ScorecardScore
Question 110
Question 210
Question 310
Question 49
Question 510
Question 610
Question 77
Question 810
Question 910
Question 1010
Total:96/100

General impression

A decent score. However, the ‘model of experts’ architecture at larger sizes does appear to come with some tradeoffs – despite having more total parameters than Gemma 4 31B and being of the ‘same generation’, Qwen 3.6 35B scored lower (albeit marginally) – when all vectors aren’t brought to bear on a problem, there is always a small chance that some potentially relevant parameters were excluded from the active subset, impacting performance.

But the main question that’s relevant to me – what were inference times like?

The side by side inference times between Qwen 3.6 35B and Gemma 4 31B were impressive indeed. Although it’s undeniable that Qwen 3.6’s reasoning tended to use more context, it was more token-efficient than its’ sibling from the last generation that we tested earlier. This, combined with the 50 tokens per second inference speeds meant that the sparse Qwen model delivered substantially shorter thinking periods than Gemma 4 31B in 9 out of the 10 questions in the set.

Strengths

  • Fastest in terms of ‘tokens per parameter’
  • Gemma 4 31B level performance

Weaknesses

  • Highest base memory requirements
  • Thinking is a little verbose and hard to parse

Best use case

High-speed, large-scale automated processing where rapid response time is prioritized over deep reasoning.

Provisional verdict

A highly efficient specialist for large-throughput tasks.

Large models versus small models

The most useful comparison was not only model against model, but class against class.

The large models – Qwen 3.6 35B, Gemma 4 31B and Gemma 4 26B – were tested as serious local assistants. Their advantage should be depth: better reasoning, better writing, stronger coding support and more reliable handling of complex prompts.

The small models – Qwen 3.5 4B and Nemotron 3 Nano 4B – were tested as lightweight tools. Their advantage should be speed: fast responses, lower memory use and less friction.

The key finding to explore is whether the large models were merely better, or whether the smaller models had a genuine role. In local AI, speed is not a minor factor. A model that answers in two seconds may be more useful for routine tasks than a smarter model that takes thirty seconds to produce an answer.

This is especially relevant for business use. Many business tasks do not require deep reasoning. Classification, tagging, rough summaries, first drafts and simple extraction may not need a dense, multi-dozen billion parameter model. But complex analysis, coding and source-grounded reasoning probably do.

Results summary

Scorecard Summary

Model Comparison Profile

ModelRaw capabilityPractical speedBest use caseMain weakness
Qwen 3.6 35BHighFast (~51 tps)Automated throughputVerbose/Memory intensive
Gemma 4 31BHighestSlow (11 tps)Complex analysisExtreme latency
Gemma 4 26BHighVery Fast (~57 tps)Daily workhorseRAM requirements
Qwen 3.5 9BModerateModerate (~35 tps)Lightweight tasksInefficient reasoning tokens
Gemma 4 4BLowFast (~52 tps)Simple extraction/retrievalLower intelligence

Best model by task

Use caseBest modelReason
Best overall local assistantGemma 4 26B A4BPerfect balance of speed and accuracy
Best reasoning modelGemma 4 31BHighest raw score in the test set
Best coding modelGemma 4 31BSuperior intelligence for complex logic
Best writing modelGemma 4 31BHighest linguistic capability
Best structured-output modelGemma 4 31BMost reliable adherence to JSON schema
Fastest usable modelGemma 4 E4BHigh speed with acceptable accuracy
Best small modelGemma 4 E4BRemarkably high score for 4B size
Best balance of speed and qualityGemma 4 26B A4BHigh performance without the massive latency
Most capable but least practicalGemma 4 31BToo slow for real-time interaction
Biggest surpriseGemma 4 E4BAchieved >80% score despite tiny size
Biggest disappointmentQwen 3.5 9BThe “uncanny valley” of inefficient token usage

Business implications

The results have broader implications beyond laptop experimentation.

The first is that local AI is becoming practical. Running models in the 26B–35B range on a consumer laptop would once have seemed unrealistic. Now, with sufficient unified memory and good local tooling, it is possible to run models that are genuinely useful.

The second implication is that businesses should not think about local LLMs as a single category. A small local model and a large local model serve different purposes. A 4B model may be suitable for fast internal classification, simple summarisation or low-stakes automation. A 30B-class model may be more appropriate for reasoning-heavy analysis, coding or document review.

The third implication is that model deployment will become more hybrid. Cloud models will remain important for frontier-level reasoning, reliability and polished user-facing applications. But local models will increasingly make sense where privacy, cost control or offline access matter.

A company might use a small local model to triage documents, a larger local model to draft internal summaries, and a frontier cloud model only for the most difficult or high-stakes tasks. This kind of routing could become one of the most important practical patterns in AI adoption.

The fourth implication is that evaluation matters. The model that feels most impressive in conversation may not be the best operational model. For businesses, the important questions are:

  • Can it follow instructions?
  • Can it produce valid structured output?
  • Can it avoid hallucinating?
  • Can it run at acceptable speed?
  • Can it handle the actual documents or workflows involved?
  • Can its mistakes be detected?

The answers vary by model.

Final comparison

My provisional final ranking is:

  1. Gemma 4 26B A4B
  2. Qwen 3.6 35B A3B
  3. Gemma 4 31B
  4. Gemma 4 E4B
  5. Qwen 3.5 9B

The best overall model was Gemma 4 26B A4B, because it offers the highest intelligence-to-speed ratio for daily workflows.

The best large model was Gemma 4 31B, because it achieved the absolute highest accuracy score in the test set.

The best small model was Gemma 4 E4B, because its performance at only 4 billion parameters is remarkably robust.

The most practical model was Gemma 4 26B A4B, because it provides near-top-tier intelligence without the crippling latency of denser models.

The most capable but least convenient model was Gemma 4 31B, because its inference speed makes real-time interaction frustrating.

The biggest surprise was Gemma 4 E4B‘s high score relative to its small size.

The biggest disappointment was Qwen 3.5 9B due to its inefficient and repetitive reasoning process.

Conclusion

Testing these 5 promising models showed me just how quickly local AI is becoming practical.

The larger models demonstrate that laptop-based inference is no longer limited to toy examples. They can be genuinely useful for writing, reasoning, coding and analysis. But they also show the cost of local capability: slower responses, greater memory pressure and more friction.

The smaller models show the opposite trade-off. They are less capable, but often much faster and more convenient. For many routine tasks, that’s more than enough.

That is the real lesson from local LLM testing. The question is not simply which model is “best”. The question is which model is best for a given job.

For individuals, local models offer privacy, experimentation and control. For businesses, they suggest a future in which AI workloads are distributed across different model sizes and deployment environments, with routine and commercially sensitive tasks running locally on workstations, whilst larger tasks will remain relegated to the cloud or proprietary servers.

The local LLM market is therefore not just a technical curiosity. It is part of a broader shift in AI economics. As models become easier to run, the advantage will move from merely having access to AI towards knowing how to choose, test and deploy the right model.

The next stage of AI adoption will not be defined only by the largest models. It will also be defined by the models that are fast enough, private enough and reliable enough to be useful where the work actually happens.

Leave a comment

Trending