Opinionated stance
Tool calling is a product reliability problem, not a prompt trick.
If you don’t measure when to call tools (and when not to), you will ship agents that:
- over-call tools (wasteful, slow, brittle, sometimes infinite loops)
- under-call tools (hallucinate and look confident)
The community benchmark that’s currently circulating makes the point clearly: scoring “can the model call a tool” is easy; scoring judgment (especially restraint) is what separates models. See: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
Definitions (what you should measure)
You want three independent axes:
- Action: does it call a tool when required?
- Restraint: does it avoid calling when not required?
- Recovery: does it handle empty/error/conflict results without thrashing?
Most teams only measure (1). Production pain comes from (2) and (3).
Decision framework: Need → Which → Verify
1) Need (should we call a tool?)
Use a hard gate:
- If the answer requires external facts, fresh info, private data, or precise computation → call a tool.
- If the user already provided sufficient info and the task is transform / summarize / reason → do not call a tool.
Make the model emit an internal decision object (not user-visible):
need_tool: yes/nowhy: <one sentence>missing_info: [...](if missing required args, ask the user; do not guess)
2) Which (pick the correct tool)
Engineering rule: each tool must have a one‑line capability boundary + I/O example.
This reduces wrong-tool calls dramatically (and makes eval cases interpretable).
3) Verify (after tool returns)
Enforce two requirements:
- Evidence: cite tool output fields/snippets.
- Consistency checks:
- empty → broaden query / ask for clarifying input
- conflict → surface conflict + request confirmation
- out-of-scope → say so; do not invent
Minimal regression suite (12–20 cases)
Goal: something you can build in a day, run forever.
Case categories
A) Action cases (must call)
- “What’s the weather in X today?” → weather tool
- “Look up Y in our database” → DB tool
B) Restraint cases (must not call)
- Prompt already contains the needed fact. Example: “Weather is 8°C and rainy—what should I wear?” → no tool.
C) Judgment traps
- User says “search it” but all required facts are already provided.
- The question embeds the answer; model must resist a reflexive tool call.
D) Recovery cases
- tool returns empty
- tool returns error
- tool returns conflicting results
Scoring (simple but high signal)
Record per case:
should_call_tool: yes/nocalled_tool: <name>|noneargs_valid: yes/noanswer_cites_evidence: yes/norecovery_behavior: ok|thrash|give_up|hallucinate
Production notes (what breaks)
- Restraint regressions are common: small prompt edits often increase tool usage.
- Parser changes look like model improvements unless you log fallback usage.
- Multi-tool chains are the fastest path to loops; start with “one tool call per turn” if stability is a priority.
References
- LocalLLaMA: “tool-calling judgment” benchmark + discussion (Action vs Restraint; parser effects): https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
- OpenAI Function Calling guide (canonical interface patterns): https://platform.openai.com/docs/guides/function-calling