Building a Zero-Cost, Local AI Personal Assistant with Gemma 4 — driven by one simple truth: cloud AI is genuinely great, but at 3–4 million tokens a week it's genuinely expensive. Here's how I built a fully private, autonomous alternative that costs nothing to run.
Read the Full Story ↓My path wasn't a straight line. Here's the honest timeline of what actually happened.
| Model | Code Gen | Tool Calling | Speed (M3 Pro) | Verdict |
|---|---|---|---|---|
| Llama 3.1 8B | ✅ Good | ❌ Breaks | ✅ Fast | Too small |
| Qwen 2.5 14B | ✅ Good | ⚠️ Needs prompting | ✅ OK | Close, not reliable |
| Qwen3 14B / 32B | ✅ Strong | ⚠️ Brittle | ❌ Slow | Too much prompt work |
| Qwen3-Coder 30B | ⭐ Excellent | ❌ Weak reasoning | ⚠️ Moderate | Wrong tool for agents |
| Gemma 4 27B ⭐ | ✅ Strong | ✅ Native, reliable | ✅ Good | Current — winner |
Three components. Zero API keys. Complete sovereignty.
Spins up optimized, quantized model versions natively on hardware. No GPU meltdown.
Running the 27B parameter variant (gemma4:26b in Ollama) on a MacBook Pro M3 Pro with 32GB unified memory. Comfortable at ~8–12 tok/sec for background agent work. See the Modelfile below for the exact parameters I use.
Frontend + function-calling layer. Registers custom tools, giving the LLM hands to interact with my environment.
Scheduled automation. Jarvis works in the background — I just review the output.
Model Context Protocol connects Jarvis to live Atlassian (Jira + Confluence), Slack, and GitHub — giving it real-time read/write access without any data leaving the local chain.
The temperature of 0.3 is the single most impactful setting for reliable tool calling — most people leave it at 0.8+ and then wonder why their agent hallucinates tool arguments. Lower is better for agents.
An LLM isolated in a terminal is just a calculator. These Python scripts and API wrappers are what transform it into My Jarvis — a personal assistant built specifically around how I work, in the Gen AI product space at Elastic.
The 7 tools above are version 1 of My Jarvis — built in spare time, around a single use case. Working in the Gen AI product space at Elastic, I sit at the intersection of AI capability and real product delivery every day. The possibilities I can see from here are enormous.
The constraint isn't the model. The constraint is time to build. Every tool I add makes Jarvis exponentially more useful — and this is still chapter one.
Every agent has a system prompt. Most people use two lines. Mine is an operations manual — and the difference shows. This is the actual Open WebUI system prompt that defines how My Jarvis thinks, decides, and acts on every single message.
"The model is the engine. The system prompt is the driver. Get the driver wrong and it doesn't matter how fast the engine is."
— Sanjay V.Without this rule, LLMs narrate everything: "I will now search Jira for tickets matching your query." Useless. Forcing immediate tool use cuts response latency and removes the friction of watching the model think out loud instead of just doing the work.
Reads are free — no confirmation needed for searching, fetching, or analyzing. Writes require exactly one confirmation. This balance keeps the agent fast for queries while protecting against accidental mutations in real production systems.
Hallucination killer. When the model is explicitly instructed that tool-returned data is ground truth — not a suggestion to reason over — it stops fabricating ticket states, ticket counts, and field values. This one rule made agent reliability jump noticeably.
Local models default to writing Python scripts when they're unsure how to call a tool. Explicitly banning this forces correct tool use — and prevents the hallucinated-script rabbit hole where the model writes code to simulate an action instead of just calling the function.
A common question: "If it's zero cost, what tokens is it using?" The answer is — it's all my tokens. No third-party AI billing. Every credential here is an OAuth token or API key I personally own and control.
channels:read, chat:write, and DMs. Free within Slack's standard plan.A capability benchmark — honest, not hype.
A brutally honest breakdown for anyone making this decision.
| Dimension | ☁️ Cloud AI (GPT-4, Claude…) | 🏠 Local AI (Gemma 4 + Ollama) |
|---|---|---|
| 💸 Monthly Cost | ✗ $20–$200+/mo per seat | ✓ $0.00 — truly zero |
| 🔒 Data Privacy | ✗ Data sent to 3rd-party servers | ✓ 100% on-device, never leaves machine |
| 🔑 Token Ownership | ✗ You pay per token to their cloud | ✓ Only your own service OAuth tokens |
| 🌐 Internet Required | ✗ Yes — offline = broken | ✓ No — fully air-gapped capable |
| ⚡ Latency | ~ Network round-trip overhead | ✓ Near-zero local inference |
| 🧠 Raw Intelligence | ✓ Best-in-class for complex reasoning | ~ 85–92% parity for most daily tasks |
| 🔧 Customization | ~ Limited by API constraints | ✓ Full access — modify everything |
| 📄 Compliance Risk | ✗ Potential violations with sensitive data | ✓ Zero compliance risk |
| 🖥️ Hardware Needed | ✓ Any device with a browser | ~ High-end Mac or AI PC recommended |
| 🔄 Tool Orchestration | ✓ Mature plugin ecosystem | ✓ Open WebUI + custom Python tools |
| 🚀 Sequential Reasoning | ✓ Excellent across all tasks | ✓ Excellent with Gemma 4 (new!) |
"I can feed it highly sensitive internal tickets, unreleased roadmaps, and architecture diagrams without violating a single compliance policy."
— Sanjay V.I'm not going to tell you this is perfect. Here's what I actually ran into — because the failures are as instructive as the wins.
I asked Jarvis to do something that felt totally reasonable: cross-reference 47 open Jira stories across three Gen AI epics, identify thematic overlaps, flag any duplicate efforts, and suggest a consolidation strategy for our Q3 roadmap.
It started strong. Fetched all 47 tickets. Correctly grouped the first two epics. Then — about 60% of the way through the third epic — it quietly lost the thread. It began conflating two entirely separate stories, attributed work from Epic A to Epic C, and delivered a consolidation recommendation that was confident, well-formatted, and wrong.
I caught it because I know these tickets. A stakeholder reading the output wouldn't have.
These aren't generic "wouldn't it be cool if" ideas. Each one is the direct next step from something already running — a specific gap I've hit in my workflow, and the exact local tool that fills it.
Moving from chat to ambient computing — where you don't prompt the AI, the AI just runs your world. Let me make this concrete, not philosophical.
"In the 0-token era, you don't prompt the AI. The AI lives in the background of your OS — observing your screen, your calendar, your Slack, your backlog — pre-computing solutions proactively."
— Sanjay V.It's 8:47am. You haven't opened your laptop yet.
Every tool and integration in this stack already exists. The vision below is what happens when you add a scheduler and a few webhooks. This is where I'm taking it.
The GitHub-to-Jira sync is already doing a primitive version of this on demand. What's missing is the scheduling layer and the trigger wiring — not the AI capability. The model can already do all of it.
That's not a chatbot. That's ambient infrastructure. And this stack is one cron job away from it.
My Jarvis is personal by design — it knows my workflows, my tokens, my quirks. But the architecture is completely universal. Lately I've been thinking seriously about how to package this so others can spin up their own version in under an hour, without needing to understand everything under the hood.
"The goal isn't to give everyone my Jarvis. It's to give everyone a Jarvis that's theirs — tuned to their tools, their role, their data. Mine is just the blueprint."
— Sanjay V.A single docker-compose up that spins up Ollama (with Gemma 4 27B pre-pulled), Open WebUI, and the Python tool server — all pre-wired. You bring your .env file with your tokens. Everything else just works.
Not everyone's Jarvis should look the same. I'm thinking about packaging three variants: an IT PO Kit (Jira + Slack + GitHub tools pre-built), an Engineering Manager Kit (PR reviews, standup drafts, incident triage), and a Solo Builder Kit (writing, research, and task management). Each is a curated set of tools on top of the same core stack.
Open WebUI has a native tool/plugin system. The cleanest packaging path is publishing My Jarvis tools as an Open WebUI plugin bundle — one install, all tools registered automatically. Others can install it, add their own API tokens, and start running the same automations I use. No Python setup required.
A single powerful machine (Mac Studio, workstation, or a mini server) running Ollama + Open WebUI as a shared local service. Team members connect via browser — no local setup needed on their laptops at all. One model, shared inference, all data stays on-premises. Works especially well for teams that need the same automations and don't want each person running their own hardware.
Publish the full tool library as a public GitHub repo alongside this blog post — Confluence tool, Jira tool, GitHub sync, Slack bot, Scrum Analytics, Modelfile, system prompt, and the PO Skill. A real, working starter kit that anyone can fork, connect their own tokens, and run in under an hour.
The blog makes the case. The repo makes it real. And every person who builds a new tool on top of it — for Linear, Notion, Salesforce, whatever — makes the whole kit better. An "awesome-jarvis-tools" list waiting to happen.
My Jarvis is deeply personal — it knows my Jira projects, my Slack workspace, my standup format, my writing voice. The more useful it becomes for me, the more specific to me it is. That's a good problem to have.
But the tools underneath are universal. Everyone has a backlog. Everyone has a daily update. Everyone has stakeholders who want to know what shipped last week. The packaging question is really: how do I give people the 80% without requiring them to rebuild the 20% that's uniquely mine?
That's what I'm figuring out next. If you're thinking about this too — let's talk.
Connect on LinkedIn — let's build this togetherThis setup acts as a massive force multiplier. For the first time, I have a personal assistant that is highly capable, infinitely customizable, and 100% private — running entirely on my own hardware with no monthly bill, no compliance anxiety, and no tokens I don't own.
The access model is simple: every credential is mine. Jira OAuth, Slack bot token, GitHub PAT — all personal, all scoped, all stored locally in a .env file that never touches git. The LLM never sees raw credentials. The data stays on my machine.
The gap between local and cloud AI is real, but it's narrowing fast. For my daily grind, Gemma 4 + Ollama + Open WebUI is not a compromise. It's the upgrade. And the roadmap — voice, memory, vision, multi-agent, fine-tuning — is all achievable without ever giving a cloud AI company a single dollar or a single byte of my private data.