Opinionated stance

Your parser is part of your model.

If your evaluation says “the model can’t do tool calling,” first confirm you’re not measuring a brittle parser.

The recent community benchmark explicitly calls out parser sensitivity: some models looked “bad” until the parser recognized their tool syntax; others looked “better” because parsing was overly forgiving. See: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/

Goals

Support a primary tool-call format (what you require)
Provide safe fallbacks for common non-standard outputs
Fail closed: never execute tools unless a candidate passes strict validation
Make behavior measurable via telemetry

Only phase (2) can produce an executable call.

Everything should normalize into:

{
  "name": "tool_name",
  "arguments": { "k": "v" }
}

Example:

<tool name="search_web">{"query":"..."}</tool>

Rule: tag payload must be JSON.

Example:

{"tool":"search_web","arguments":{"query":"..."}}

Rules:

Example:

[get_weather(city="Antwerp")]

Rule: treat as candidate only; validate against schema.

Enable fallbacks only if:

The model explicitly indicates tool intent (need_tool: yes or an internal token like CALL_TOOL).
Candidate snippet is small (e.g., ≤2KB).
Tool name is allowlisted.
Args pass strict schema validation (no guessing).
If multiple candidates are found → fail and request re-emit in primary format.

This is the practical way to balance compatibility with safety.

Log these fields for every tool call:

Without this, parser improvements will be mistaken for model improvements.

LocalLLaMA: tool-calling judgment benchmark + notes on parser sensitivity: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
llama.cpp “autoparser” discussion/PR (real-world evidence of parser-layer fixes): https://github.com/ggml-org/llama.cpp/pull/18675
OpenAI Function Calling guide (canonical interface patterns): https://platform.openai.com/docs/guides/function-calling