→

D004. Third-party testing of tool calls

D004

Third-party testing of tool calls

Appoint expert third-parties to evaluate tool calls in AI systems, including executing unauthorized actions, accessing restricted information, or making decisions beyond their intended scope at least every 3 months

Keywords

Tool CallsTool SelectionThird-Party Testing

Application

Mandatory

Frequency

Every 3 months

Type

Preventative

Crosswalks

NIST AI RMF

GOVERN 6.1: Third-party risk policies

GOVERN 4.3: Testing and incident sharing

MANAGE 2.2: Deployed system value

MEASURE 1.3: Independent assessment

MEASURE 2.1: TEVV documentation

MEASURE 2.6: Safety evaluation

MEASURE 4.1: Context-specific measurement

MEASURE 4.2: Trustworthiness validation

OWASP Top 10

LLM06:25 - Excessive Agency

ISO 42001

A.6.2.4: AI system verification and validation

CSA AICM

AIS-05: Application Security Testing

OWASP AIVSS

Agentic AI Tool Misuse

IBM AI Risk Atlas

IBM 5: Agentic AI - Misaligned actions

IBM 9: Agentic AI - Function calling hallucination

IBM 11: Agentic AI - Incomplete AI agent evaluation

Cisco AI Security Framework

AITech-1.3: Goal Manipulation

AITech-7.1: Reasoning Corruption

AITech-7.2: Memory System Corruption

AITech-12.1: Tool Exploitation

OWASP Agentic Top 10

ASI02 - Tool Misuse and Exploitation

ASI05 - Unexpected Code Execution

ASI10 - Rogue Agents

Control activities

Typical evidence

Appointing qualified third-party assessors. Including selecting assessors with relevant technical capabilities for identified risk areas, maintaining records of assessor qualifications and independence.

Conducting regular testing. Including defining testing scope and methodologies based on risk taxonomy and performing assessments of tool calls at least every quarter.

Maintaining documentation. Including testing scope, results, and remediation actions taken, tracking follow-up activities and resolution timelines.

D004.1 Report: Tool call testing

Third-party evaluation report showing tool call testing - must include risk taxonomy tested, testing methodology and findings, and improvement tracking with remediation timelines and documentation.

AIUC-1 is built with industry leaders

"We need a SOC 2 for AI agents— a familiar, actionable standard for security and trust."

Phil Venables

Former CISO of Google Cloud

"Integrating MITRE ATLAS ensures AI security risk management tools are informed by the latest AI threat patterns and leverage state of the art defensive strategies."

Dr. Christina Liaghati

MITRE ATLAS lead

"Built on the latest advances in AI research, AIUC-1 empowers organizations to identify, assess, and mitigate AI risks with confidence."

Prof. Sanmi Koyejo

Lead for Stanford Trustworthy AI Research

"AIUC-1standardizes how AI is adopted. That's powerful."

John Bautista

Partner at Orrick

Third-party testing of tool calls

Keywords

Application

Frequency

Type

Crosswalks

Control activities

Typical evidence

Should include

Category

Typical Location

Capabilities

AIUC-1 is built with industry leaders

Third-party testing of tool calls

Keywords

Application

Frequency

Type

Crosswalks

Control activities

Typical evidence

Should include

Category

Typical Location

Capabilities

AIUC-1 is built with industry leaders