The first standards-aligned assessment framework that measures both how well your agents perform and what they are capable of, displayed on a single verifiable badge.
Organizations are deploying AI agents into production without any standardized way to measure, compare, or certify their quality. The result is uncertainty at every level.
There is no industry-accepted way to verify that an AI agent can reliably do what it claims. Vendor benchmarks are self-reported and non-comparable.
An agent that handles many domains poorly is different from one that handles one domain brilliantly. Current evaluations collapse these into a single, meaningless number.
NIST, ISO, and IEEE have published AI frameworks, but no one has mapped agent assessment to them. Compliance teams have no certification to point to.
Performance and Capability are orthogonal. An agent can be a narrow expert with elite execution, or a versatile generalist still building reliability. Neither axis dominates.
How well does the agent execute the tasks it is given?
What range and depth of work can the agent handle?
Every score traces to observable, reproducible evidence. No subjective impressions. Code-based graders run first, model-based graders handle nuance, human calibration ensures accuracy.
Proportion of tasks completed to acceptance criteria without human intervention.
Correctness of output measured by revision rate and error density with severity weighting.
Time from assignment to acceptable completion, normalized by task complexity.
Low variance in performance across repeated similar tasks. Reliable behavior across contexts.
Adherence to organizational review processes, output format standards, and verification protocols.
Number of distinct task domains the agent operates in with at least Competent-level performance.
Maximum task complexity the agent can complete successfully, from single-step to novel problems.
Effectiveness in using available tools: correct invocation, error recovery, efficiency.
Degree of independent operation sustained, from human-in-the-loop to fully autonomous.
Speed of improvement on repeated exposure to similar task types over time.
Judgment in identifying when to delegate, selecting the right delegate, and managing the outcome.
Coordinating multi-agent workflows, managing dependencies, handling failures, and synthesizing results.
Every assessment dimension maps directly to published standards from NIST, ISO, and IEEE. This is not a proprietary scoring system. It is an implementation of existing AI governance frameworks.
GOVERN, MAP, MEASURE, MANAGE functions. Our assessment maps directly to the MEASURE function for valid and reliable AI behavior.
PDCA lifecycle with 38 controls. Our Review Compliance dimension maps to Control A.6.2.6 for monitoring and measurement.
Canonical vocabulary for agent, system, model, transparency. Our autonomy levels use the standard's human oversight definitions.
Capability description semantics and interoperability levels. Our Domain Breadth and Orchestration dimensions map to this standard.
Quality characteristics including accuracy, robustness, fairness. Our Accuracy and Complexity Ceiling dimensions derive from this model.
Agent authentication, identity, security evaluation, interoperability. Our Delegation and Orchestration dimensions align with this initiative.
Two phases, designed to scale from internal tooling to industry-wide adoption.
Provide your agent inventory with domain tags, tool access, and autonomy level.
The framework monitors task logs, tool calls, revision history, and delegation patterns.
Code-based graders assess quantitative dimensions. Model-based graders handle nuance. Calibrated against human baselines.
A verifiable, dual-axis badge is generated in SVG, PNG, JSON, and Markdown formats.
Connect your agent infrastructure via REST API. Multi-tenant support for any organization running AI agents.
Rolling 30-day windows with recency weighting. Automatic re-assessment triggers. Real-time dashboards.
Shareable badge URLs with full breakdown. Open Badges v3.0 compatible. Third-party verifiable.
Every badge displays two independent scores, the certifying organization, validity period, and a verification link. Machine-readable JSON follows Open Badges v3.0.
Phase 1 launches as an open-source assessment tool. Join the early access list to be notified at launch and help shape the standard.
No spam. Only launch updates and framework previews.
You are on the list. We will be in touch.