Opinionated stance

Tool calling is a product reliability problem, not a prompt trick.

If you don’t measure when to call tools (and when not to), you will ship agents that:

over-call tools (wasteful, slow, brittle, sometimes infinite loops)
under-call tools (hallucinate and look confident)

The community benchmark that’s currently circulating makes the point clearly: scoring “can the model call a tool” is easy; scoring judgment (especially restraint) is what separates models. See: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/

Definitions (what you should measure)

You want three independent axes:

Action: does it call a tool when required?
Restraint: does it avoid calling when not required?
Recovery: does it handle empty/error/conflict results without thrashing?

Most teams only measure (1). Production pain comes from (2) and (3).

Decision framework: Need → Which → Verify

1) Need (should we call a tool?)

Use a hard gate:

If the answer requires external facts, fresh info, private data, or precise computation → call a tool.
If the user already provided sufficient info and the task is transform / summarize / reason → do not call a tool.

Make the model emit an internal decision object (not user-visible):

need_tool: yes/no
why: <one sentence>
missing_info: [...] (if missing required args, ask the user; do not guess)

2) Which (pick the correct tool)

Engineering rule: each tool must have a one‑line capability boundary + I/O example.

This reduces wrong-tool calls dramatically (and makes eval cases interpretable).

3) Verify (after tool returns)

Enforce two requirements:

Evidence: cite tool output fields/snippets.
Consistency checks:
- empty → broaden query / ask for clarifying input
- conflict → surface conflict + request confirmation
- out-of-scope → say so; do not invent

Minimal regression suite (12–20 cases)

Goal: something you can build in a day, run forever.

Case categories

A) Action cases (must call)

“What’s the weather in X today?” → weather tool
“Look up Y in our database” → DB tool

B) Restraint cases (must not call)

Prompt already contains the needed fact. Example: “Weather is 8°C and rainy—what should I wear?” → no tool.

C) Judgment traps

User says “search it” but all required facts are already provided.
The question embeds the answer; model must resist a reflexive tool call.

D) Recovery cases

tool returns empty
tool returns error
tool returns conflicting results

Scoring (simple but high signal)

Record per case:

should_call_tool: yes/no
called_tool: <name>|none
args_valid: yes/no
answer_cites_evidence: yes/no
recovery_behavior: ok|thrash|give_up|hallucinate

Production notes (what breaks)

Restraint regressions are common: small prompt edits often increase tool usage.
Parser changes look like model improvements unless you log fallback usage.
Multi-tool chains are the fastest path to loops; start with “one tool call per turn” if stability is a priority.

References

LocalLLaMA: “tool-calling judgment” benchmark + discussion (Action vs Restraint; parser effects): https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/
OpenAI Function Calling guide (canonical interface patterns): https://platform.openai.com/docs/guides/function-calling

Tool‑Calling Judgment: Decision Framework + Minimal Regression Suite