Published: Jun 9, 2026 by Isaac Johnson

Just this last week Google rolled out Gemma4 12b which is really interesting to me because the 20+ billion models are a bit much for my hardware, but the 8b and less have some limitations.

12 billion models are a real butter zone for the hardware I have. It is not direct, but I try to find a model where billion matches my Gbs of VRAM.

I didn’t want to jump in right away because initially it didn’t work in released Ollama until the end of the week. And moreover, I don’t like to rush articles just to catch news cycles.

This weekend I put it through the paces. The first half was written using spotty WiFi during a swim meet at the Freeman Aquatic Center at the U of M over a hidden port routed back to my gaming laptop. The second half (adding sort order) was locally in network.

usage

First, I had to wait till around Friday when I could finally get a released Ollama that worked (the 0.30.3 was fixed on the releases and the download of 0.4 failed as a pre-release).

builder@builder-Lenny16:~$ !52
curl -fsSL https://ollama.com/install.sh | sh
>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading ollama-linux-amd64.tar.zst
######################################################################## 100.0%
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> NVIDIA GPU installed.
builder@builder-Lenny16:~$ ollama --version
ollama version is 0.30.5
Warning: client version is 0.30.4-rc0
builder@builder-Lenny16:~$ ollama pull gemma4:12b
pulling manifest
pulling 1278394b6936: 100% ▕████████████████████████████████████████████████████████████████████████████▏ 7.4 GB
pulling 675ad6e68101: 100% ▕████████████████████████████████████████████████████████████████████████████▏ 175 MB
pulling 0d542e0c8804: 100% ▕████████████████████████████████████████████████████████████████████████████▏  10 KB
pulling 56380ca2ab89: 100% ▕████████████████████████████████████████████████████████████████████████████▏   42 B
pulling c805f5b265d8: 100% ▕████████████████████████████████████████████████████████████████████████████▏  548 B
verifying sha256 digest
writing manifest
success

Next, I started down the path of using it to update a basic TUI I started (without AI) using a TUI framework.

As you can see it worked through a plan and came back successful:

When done, I can “compact” the current task

I got some bugs and started with cline CLI

But it just wouldn’t move forward with a fix. I tried the same model on the same host, but this time using the Cline extension in vs code.

This time I saw it actively reviewing and fixing the file. Till the thought process started writing into the file…

I canceled then realized the real issue was just a bad regexp at some point left one errant line in the file (replace_in_file). i removed it myself and got past that error.

I tried a new prompt in the UI and found it would get a failure at times, but retry which worked (as it stands, I’m trying this in a sports venue with poor signal so it could be an affect of the network)

and completed:

When a change is done, I can click “Explain changes”

I did have a complicated problem and was able to use GCA (since they haven’t shut it off yet) to sort out

Continue vs Cline

Here we can see adding a sort option via Continue.dev.

It seemed to update the backend index.ts as I pointed the file to it, but neglected to add to the UI layer even when i set it to use the index.ts file.

let’s now try in Continue.dev

Here we see it worked but did time out on responses a few times.

Now I have working reverse sort order toggle:

Cline CLI

I tried the CLI again but couldn’t get it to get past the thinking block

$ cline --thinking none --auto-approve ture --compaction agentic --retries 6 "the TUI in ./vikt
ui shows tasks. If i click a task, i would like to see details. Consult swagger in ./viktui/swagger.json"
[thinking] The user wants to implement a feature where clicking on a task in the TUI (Terminal User Interface) displays its details. They also mentioned consulting `swagger.json` for details about how tasks are structured or what fields are available.

First, I need to understand the current TUI implementation and how tasks are displayed.
Then, I'll examine `viktui/swagger.json` to find information about the "task" entity and its detail fields.
Finally, I will implement the logic to fetch/display details

$ cline "the TUI in ./viktui shows tasks. If i click a task, i would like to see details. Consult swagger in ./viktui/swagger.json"
[thinking] The user wants to implement a feature where clicking a task in the TUI displays its details. I need to understand how the TUI handles task selection and what "details" are available from the backend (as defined in the `swagger.json`).

Plan:
1. Explore the project structure to find relevant files for the TUI logic and data models.
2. Examine `viktui/swagger.json` to understand the API schema for tasks and their details.
3. Identify where task list items are rendered and where

or in interactive mode:

Hosted Gemma

I saw there are some Fireworks.AI hosted models via Azure AI Foundry but just for chat completion and they are locked out for me:

You can get a hosted endpoint in Google Vertex AI (Agent Platform) for $5.18/hour which seems really high to me:

There is no Gemma hosted model in AliCloud (though they have their own Qwen3.7)

And AWS Bedrock just has Gemma3 at the time of this writing

But the pricing is token based, not hourly so it could be a valid option if you lake GPU

I do see some up-and-commers with Gemma4. For instance, featherless.ai has all the Gemma4 models

They have an unlimited buffet style pricing:

The $10 model looks like one could do small work - its not the model limitation that would be hard, its the 16k context limit that would make it pretty limiting for coding work. But the 32k context $25 plan would be interesting.

Looking at their status page does show they have some uptime issues:

Gemma for not coding

I’ll try OpenWebUI which runs on that host to access Ollama

I’m eager to see any reason I would want multi-agent setups for real world scenarios.

I’ll ask:

I would like to know why one would choose a multi-agent architecture (A2A, for instance) for a production setup when agents are inherently non-deterministic, hallucinate, require GPUs (so more cost), are often changing and are very hard to test. What values do they provide that would override the more controlled nature of microservice design, kubernetes with it’s controls and monitoring, and the deterministic nature of proper language based applications?

I kept the argument going, following it up with

If agents can write the code, why not just use agents to write the complicated regexp and if-then-else logic - we would have more testable code that does not require GPUs nor persistently running agents. If the answer is to add more guardrails like judges, that seems just to inflate costs further with the hosts that adding more agents on agents will improve the quality. Why not just avoid the issue an only use agents for the human language interpretation (which I agree agents are uniquely adept to process)?

Twice it timed out in roughly the same spot so we’ll leave it there.

For comparison, I fired this at Gemini Pro (both queries together):

I actually want to share what Gemini 3.1 Pro replied with as it both confirmed my hypothesis but also challenged it as well:

Gemini 3.1 Pro on the question:

You are hitting on the exact friction point that seasoned architects are wrestling with right now. The short answer is: you are mostly right. For the vast majority of traditional enterprise workloads, replacing a deterministic, highly observable Kubernetes microservice with an LLM-driven agent is an architectural downgrade. It introduces latency, massive cost overhead, and non-determinism where predictability is required.

If a workflow can be expressed as a Directed Acyclic Graph (DAG), a state machine, or a complex tree of if-then-else logic, it should be.

However, there is a specific threshold where multi-agent (A2A) architectures become necessary. Here is the engineering rationale for when and why that trade-off is actually worth the GPU tax and testing headaches.

The Limits of Deterministic Code in a Fuzzy World

Your point about using agents to write the complicated Regex and conditional logic is valid—up to a point. That works beautifully when the input data has a bounded number of variations.

Multi-agent architectures are chosen when the state space of the problem is effectively infinite or highly unstructured, making it impossible to pre-compute the routing or write enough Regex to catch the edge cases.

Example: Triaging unstructured, cross-system incident alerts. If a system relies on PagerDuty, Datadog/OTel metrics, and custom application logs, the variety of how failures manifest is boundless. Static logic breaks the moment a log format changes or a novel failure mode occurs. Agents can adapt to schema drift and contextual ambiguity at runtime in a way that compiled code simply cannot.

Why Multi-Agent (A2A) Specifically?

If you accept that you have to use an LLM to handle “fuzzy” logic, the next question is why use multiple agents instead of one large prompt. The answer parallels why we moved from monoliths to microservices: Separation of concerns.

Single agents with massive, generalized prompts degrade quickly. Their context windows get polluted, leading to the exact hallucinations you want to avoid. A2A architectures apply microservice principles to non-deterministic systems:

Scoped Context: A “Planner” agent only has prompts and context relevant to breaking down a task. A “Coder” agent only sees the specific function it needs to write.
Constrained Output: By forcing agents to communicate through strict JSON contracts (Agent A outputs a JSON schema that Agent B consumes), you force temporary state checkpoints. This makes the system slightly easier to monitor and test than a single black-box LLM.

The “Judge” Cost Inflation

You are spot-on that adding “Judge” agents inflates costs and latency. The architectural justification for this pattern comes down to comparing the cost of the compute versus the cost of human labor, not CPU cycles.

If an A2A loop costs $0.80 in API calls and takes 45 seconds to run, it is a catastrophic failure if it’s replacing a $0.0001 deterministic Lambda function. But if that same loop is autonomously researching a zero-day vulnerability, writing a patch, and opening a PR—a task that takes a Senior Engineer four hours—that $0.80 and non-deterministic overhead is a massive win.

The Hybrid Sweet Spot

Your instinct to restrict agents solely to human language interpretation is widely considered the most mature, production-ready pattern right now.

This is exactly why bridging protocols are gaining so much traction. By using an LLM strictly as the semantic routing layer, you can parse human intent or unstructured data, and then immediately hand the execution off to deterministic, rigorously tested code via tools like Model Context Protocol (MCP) servers, n8n workflows, or Terraform/OpenTofu pipelines. The agent decides what tool to call, but the tool itself executes predictably.

Agents shouldn’t replace the control plane; they should just be a highly adaptive translation layer sitting on top of it.

Given your focus on controlled, observable infrastructure, have you experimented with exposing any of your existing deterministic tools or IaC pipelines to an LLM via MCP or similar tool-calling frameworks, rather than letting the agent write the execution logic itself?

Above was Gemini 3.1 Pro, as I mentioned. The points I think it raises are good:

If you can control/know your paths, using LLM just as the language front door makes sense (see this unhelpful agent I have running in n8n)
If your agent is about pulling signal from noise - especially if its about detecting problems (think security, fraud, production downtimes), it only makes sense to allow something that could help into the mix (that is, a false suggestion or alert on a prod monitor once in a while would be totally fine… but denying cancer coverage over hallucinations would not – how is the agent being used is very very very important).

There is something just a tiny bit creepy in the Gemini 3.1 Pro’s reply in that it clearly knows I write a lot about MCP, n8n, and OpenTofu.. those are strange things to toss into a normal reply - i felt it was rather tailoring it to me (fine, just feels a bit weird).

And I guess this rings well with me. The only “AI” systems that are catching my eyes lately are the autodetect and remediation ones I’ve seen by way of Dash in Datadog and Klaudia in Komodor. I can recall back in 2018 or so seeing BigPanda which was pretty much promising the same thing prior to full AI (I only pushed back because pricing and usage was behind a sales call experience - thanks but no thanks).

Okay, wow, I got derailed there. #SQUIRREL… the point of this section was to test Gemma4:12b (compared to say Gemini 3.1 Pro) on a topical question for which I’m strongly opinionated and see if it’s answer held true. I feel it gave a pretty solid answer, but we can see how Gemini 3.1 Pro answered a bit better.

Summary

Using a Lenovo Legion with a 12Gb nVidia RTX 5070 I managed to pull in Gemma4:12b and use it successfully to update a TUI to something actually working.

I did find network issues pushed me, at times, to use GCA and later Antigravity, but that was entirely due to remote network issues, not the model.

I’m finding it really quite solid. I’ve thrown mostly typescript at it thus far, but it has yet to fail me.

I’m leveraging two plugins for VS Code, Continue.dev and Cline. On the command line, I was sticking with just Cline. However, when having network issues, I switched to agy which is the Antigravity CLI from Google which is meant to replace the well-loved Gemini CLI.

Lastly, in asking non-coding (but about software architecture) questions, I felt the 12b model gave a very reasoned and sane reply. Compared to the Gemini 3.1 Pro, it was pretty good (though we see Pro is better, as I would expect). However, consider the scenerio of being on a plane, or in a corperate office that blocks Gemini endpoints, using Gemma4:12b would be a nice ‘trusted advisor’ (it could be my “Paw”… just saw Hoppers this weekend).