Opinionated stance
Your parser is part of your model.
If your evaluation says “the model can’t do tool calling,” first confirm you’re not measuring a brittle parser.
The recent community benchmark explicitly calls out parser sensitivity: some models looked “bad” until the parser recognized their tool syntax; others looked “better” because parsing was overly forgiving. See: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
Goals
- Support a primary tool-call format (what you require)
- Provide safe fallbacks for common non-standard outputs
- Fail closed: never execute tools unless a candidate passes strict validation
- Make behavior measurable via telemetry
Two-phase architecture (do not skip)
- Extract candidates (string parsing only; no execution)
- Validate & normalize (allowlist tool names + schema validation)
Only phase (2) can produce an executable call.
Normalized tool call contract
Everything should normalize into:
{
"name": "tool_name",
"arguments": { "k": "v" }
}
Supported formats
Primary: explicit tags
Example:
<tool name="search_web">{"query":"..."}</tool>
Rule: tag payload must be JSON.
Fallback A: strict JSON block
Example:
{"tool":"search_web","arguments":{"query":"..."}}
Rules:
- output must be exactly a JSON object after trimming
- parse JSON → validate schema
Fallback B: bracket/function style
Example:
[get_weather(city="Antwerp")]
Rule: treat as candidate only; validate against schema.
Fallback gates (must all pass)
Enable fallbacks only if:
- The model explicitly indicates tool intent (
need_tool: yesor an internal token likeCALL_TOOL). - Candidate snippet is small (e.g., ≤2KB).
- Tool name is allowlisted.
- Args pass strict schema validation (no guessing).
- If multiple candidates are found → fail and request re-emit in primary format.
This is the practical way to balance compatibility with safety.
Telemetry (mandatory)
Log these fields for every tool call:
parse_mode: primary|json|bracket|nonefallback_used: booleancandidate_count: integerschema_validation: pass|fail + reasontool_result_status: ok|empty|error
Without this, parser improvements will be mistaken for model improvements.
Common pitfalls
- Regex-grabbing JSON from long text (high false positives)
- Auto-filling missing args (creates “ghost successes”)
- No tool allowlist (expands execution surface)
- Fallback parsers with no telemetry (destroys interpretability)
References
- LocalLLaMA: tool-calling judgment benchmark + notes on parser sensitivity: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
- llama.cpp “autoparser” discussion/PR (real-world evidence of parser-layer fixes): https://github.com/ggml-org/llama.cpp/pull/18675
- OpenAI Function Calling guide (canonical interface patterns): https://platform.openai.com/docs/guides/function-calling