Taming Local AI: Fighting LLM Hallucinations with GBNF

You’re 30 minutes from a production deployment. You have a 50MB production JSON blob, and you just need to extract one field.

You reach for your AI assistant… and you stop.

Are you really about to paste sensitive, internal production data into a third-party API?

So, what’s the alternative? Spend 10 minutes manually wrestling with jq syntax, breaking your flow and killing your velocity?

This exact moment highlights a broken choice developers face daily: sacrifice privacy for productivity, or sacrifice productivity for privacy.

I set out to build a tool that fixes this. My goal: a CLI that translates natural language into valid jq syntax, running 100% locally.

What began as a simple project quickly became a deep exploration of model control. This post details that technical journey and the resulting architecture. It’s a Proof-of-Concept for building reliable, private, and context-aware AI tools that actually work.

Phase 1: The Privacy Wall & Syntactical Hallucination

The first prototype was a simple wrapper around upstream APIs. The problem was immediate: sending potentially sensitive production JSON to a third-party API is a security and privacy nightmare. Even if you dare to use ChatGPT, Anthropic, or Gemini, you still face a high chance of getting a wrong query since the LLM might just make up a function name or field that doesn’t exist in your actual data.

To fight privacy issues, I decided to go local-first. The logical step was a local-first model, served via Ollama. This solved the privacy concern, but exposed a deeper, more critical failure: syntactical hallucinations increased compared to heavier models provided by vendors. The models available for regular users and supported on my laptop are rather small and often bad at reasoning. I decided to go for Gemma3:4b during my experiments since it had good instruction following, reasonable reasoning capabilities, and good enough speed. However, the problem that arose when I switched to Gemma3 is that any request above trivial complexity ended up with the generation of incorrect jq expressions. It was a dead end.

Phase 2: From “Prompting” to “Controlling”

I initially tried several approaches to fight this problem, experimenting with prompt engineering, few-shot learning, and context injection. I even implemented few-shot Chain-of-Thought prompting, but found it difficult to generalize effectively for the Gemma3 model. These efforts made me realize the core issue: the problem wasn’t the prompt; it was the fundamental lack of control. I needed to force the model to be deterministic.

This realization led me to research how “JSON Mode” and other constrained-output features actually work at a low level. After nearly giving up, I discovered that the answer lies in GBNF (GGML Backus–Naur Form).

GBNF is a formal grammar passed to the model at inference time. It acts as a set of guardrails, forcing the LLM to only select tokens that conform to the defined syntax.

This was the “aha” moment. I didn’t need a better prompt; I needed a grammar.

This discovery led to another technical hurdle. Ollama, for valid security reasons, does not expose the GBNF API. To get the granular control I needed, I had to pivot again. I moved to llama-cpp-python, which provides direct, low-level access to the model and its inference parameters.

Phase 3: The Architecture of a Reliable Tool

This research-driven pivot resulted in the current architecture. It’s a robust technical showcase built on three key components.

1. Local-First Inference (llama-cpp-python)

This is the foundation. By using llama-cpp-python directly, the tool runs entirely on-device. This guarantees privacy and provides the low-level API access that a higher-level abstraction like Ollama conceals.

2. Dynamic Context-Awareness

A query is useless if the model doesn’t know the schema. The tool first reads the JSON from stdin and uses the genson library to dynamically generate a JSON Schema.

# Read all JSON from stdin
json_in = sys.stdin.read()

# Create a schema from it
builder = genson.SchemaBuilder()
builder.add_object(json.loads(json_in))
json_schema = builder.to_schema()

This schema is then injected directly into the prompt. The model isn’t guessing field names; it’s reading a machine-generated blueprint of the exact data it’s being asked to query. While this approach is still not ideal since the schema can be large for complex JSON structures, it’s a solid first step. I have several ideas for optimizing this in future iterations, such as schema summarization or selective field inclusion.

3. Deterministic Output (GBNF)

This is the core of the solution. I intentionally wrote a stripped-down jq GBNF grammar that defines a focused subset of the jq query language.

This is a critical, pragmatic decision. The goal isn’t to create an all-powerful jq expert, but to build a tool that a small, local model can use reliably. It’s designed to help developers and DevOps engineers filter, extract, and perform simple transformations on complex, nested schemas like Kubernetes CRDs or AWS CLI outputs. By reducing the grammar’s complexity, the model can focus on mastering a smaller set of operations without getting overwhelmed.

A small part of the grammar looks like this:

root ::= expression
# The core of jq is a pipeline of expressions
expression ::= ws term (ws "|" ws term)* ws
# A term can be a simple value/path or a more complex operation
term ::=
    (if-expression |
    assignment-expression |
    logic-or) (ws "as" ws variable)?
...

This grammar is loaded and passed to the model at runtime. The result is that the LLM’s output is guaranteed to be syntactically valid jq.

This is not a silver bullet. A syntactically valid query can still be logically incorrect or fail to produce the user’s intended results. But what it does do is completely eliminate syntactical hallucinations. It cannot produce a “close enough” answer or an invalid string. It is forced to produce valid jq every time.

From Concept to Tool: The Path Forward

This jq translator is a robust concept, and the project is actively in progress. It’s not a polished downloadable yet, but the core architecture is sound and demonstrates a clear path to a real solution.

More importantly, it’s a demonstration of a philosophy. In a tech scene obsessed with the easy path (just wrapping the next big OpenAI API), this project explores alternative path that gets more and more traction recently: utilizing small language models running locally to solve specific problems while ensuring privacy and security.

I’m continuing to work on this project and refine the context-injection pipeline. My goal is to turn this concept into a tool for any DevOps engineer or developer who needs to work with complex JSON data securely and efficiently.

You can follow the progress and explore the full repository, including the jq.gbnf grammar, here.

Phase 1: The Privacy Wall & Syntactical Hallucination#

Phase 2: From “Prompting” to “Controlling”#

Phase 3: The Architecture of a Reliable Tool#

1. Local-First Inference (llama-cpp-python)#

2. Dynamic Context-Awareness#

3. Deterministic Output (GBNF)#

From Concept to Tool: The Path Forward#