Tool‑Calling Judgment: Decision Framework + Minimal Regression Suite

Tool calling fails less often because of schemas and more often because of judgment: over-calling (loops) vs under-calling (hallucination). This playbook gives an explicit decision framework and a minimal regression suite to measure Action + Restraint + Recovery.

Opinionated stance

Tool calling is a product reliability problem, not a prompt trick.

If you don’t measure when to call tools (and when not to), you will ship agents that:

  • over-call tools (wasteful, slow, brittle, sometimes infinite loops)
  • under-call tools (hallucinate and look confident)

The community benchmark that’s currently circulating makes the point clearly: scoring “can the model call a tool” is easy; scoring judgment (especially restraint) is what separates models. See: https://www.reddit.com/r/LocalLLaMA/comments/1r4ie8z/i_tested_21_small_llms_on_toolcalling_judgment/


Definitions (what you should measure)

You want three independent axes:

  1. Action: does it call a tool when required?
  2. Restraint: does it avoid calling when not required?
  3. Recovery: does it handle empty/error/conflict results without thrashing?

Most teams only measure (1). Production pain comes from (2) and (3).


Decision framework: Need → Which → Verify

1) Need (should we call a tool?)

Use a hard gate:

  • If the answer requires external facts, fresh info, private data, or precise computation → call a tool.
  • If the user already provided sufficient info and the task is transform / summarize / reason → do not call a tool.

Make the model emit an internal decision object (not user-visible):

  • need_tool: yes/no
  • why: <one sentence>
  • missing_info: [...] (if missing required args, ask the user; do not guess)

2) Which (pick the correct tool)

Engineering rule: each tool must have a one‑line capability boundary + I/O example.

This reduces wrong-tool calls dramatically (and makes eval cases interpretable).

3) Verify (after tool returns)

Enforce two requirements:

  • Evidence: cite tool output fields/snippets.
  • Consistency checks:
    • empty → broaden query / ask for clarifying input
    • conflict → surface conflict + request confirmation
    • out-of-scope → say so; do not invent

Minimal regression suite (12–20 cases)

Goal: something you can build in a day, run forever.

Case categories

A) Action cases (must call)

  • “What’s the weather in X today?” → weather tool
  • “Look up Y in our database” → DB tool

B) Restraint cases (must not call)

  • Prompt already contains the needed fact. Example: “Weather is 8°C and rainy—what should I wear?” → no tool.

C) Judgment traps

  • User says “search it” but all required facts are already provided.
  • The question embeds the answer; model must resist a reflexive tool call.

D) Recovery cases

  • tool returns empty
  • tool returns error
  • tool returns conflicting results

Scoring (simple but high signal)

Record per case:

  • should_call_tool: yes/no
  • called_tool: <name>|none
  • args_valid: yes/no
  • answer_cites_evidence: yes/no
  • recovery_behavior: ok|thrash|give_up|hallucinate

Production notes (what breaks)

  1. Restraint regressions are common: small prompt edits often increase tool usage.
  2. Parser changes look like model improvements unless you log fallback usage.
  3. Multi-tool chains are the fastest path to loops; start with “one tool call per turn” if stability is a priority.

References