Skip to main content
The Agent Experience Score measures how well AI agents implement Auth0 across different models and frameworks. It allows you to compare current scores from agents implementing Auth0 services and features — such as Multi-factor Authentication (MFA) or Auth0 Actions — in development and testing environments to review how Auth0 tools improve agent performance. Use this resource to learn about the scoring methodology, including how scores are calculated, what dimensions are measured, and how grades are assigned.

Test specifications

AI agents — Claude Code, GitHub Copilot, Gemini CLI — run Auth0 integration tasks in isolated development environments. Each agent uses the same tools a developer would in a realistic environment: a workspace, a shell, and file tools like Auth0 CLI. The prompts are short and realistic: “add authentication to my Next.js app,” not step-by-step recipes. Each model is tested with and without Auth0 tools (MCP Server and Agent Skills). The difference between those scores is the measurable impact of Auth0’s AI tooling on the developer experience.

Score dimensions

Every run is scored across 7 dimensions split into two categories. Four dimensions address the agent process from end-to-end with Auth0 tools. Three dimensions score the final output. Each dimension is scored 0–100 individually, then weighted and combined into the overall score.
DimensionCategoryDescription
Setup FrictionProcessScore determined by the agent’s ability to complete the task autonomously. If the agent paused to ask questions or encountered errors, the score decreased.
Setup SpeedProcessScore determined by the agent’s active execution time. Results are comparable across environments.
EfficiencyProcessScore determined by the number of tool calls required to complete the task. Fewer tool calls means less cost and less complexity.
Error RecoveryProcessScore determined by infrastructure errors (rate limits, timeouts) that disrupted the execution.
CorrectnessOutputScore determined by whether the generated code imports real packages, calls real methods, and wires components correctly.
HallucinationOutputScore determined by whether the agent invented packages that don’t exist or used incorrect SDK variants.
SecurityOutputScore determined by whether the agent hardcoded secrets, stored tokens insecurely, or committed credentials to source code.

Grades

Overall scores map to letter grades:
GradeMin scoreDescription
A90Production-ready. Minimal issues.
B75Solid, but with a few gaps to fix.
C60Usable, but needs cleanup.
D40Significant problems.
F< 40Not useful — faster to start from scratch.
Grades are calibrated to match developer intuition. A score of 91 should feel like code you’d accept with minimal review. A score of 55 should feel like something that needs real work to fix.

Result validation

Every grader verifies generated code — not prose or explanations. Graders check that code compiles, imports real packages, calls actual SDK methods, and doesn’t introduce security vulnerabilities. Results are validated at multiple levels:
  • Presence checks: Required SDK symbols, imports, and config keys exist in the output.
  • Hallucination detection: Invented packages, wrong SDK variants, and fabricated API methods are caught.
  • Security checks: Hardcoded credentials, tokens in insecure storage, and secrets in source code are flagged.
  • Structural validation: Code is correctly wired — right components in right files, lifecycle hooks handled, middleware in the correct order.
  • Version correctness: The agent uses current APIs, not deprecated patterns (only checked when the agent has access to current docs).
  • Holistic review: An LLM judge evaluates overall correctness of the implementation.

Estimated cost and time

The results page displays estimated cost and estimated time for each configuration. These values represent a single eval run with Auth0 MCP + Skills enabled.

Estimated cost

Cost is calculated from the total tokens consumed during the eval run (input tokens + output tokens) multiplied by the model provider’s published per-token pricing. Auth0 does not charge for running evals — the cost reflects what you would pay your model provider for equivalent token usage. Token pricing varies by model and provider. For current rates, refer to your provider’s pricing page:

Estimated time

Time is the wall-clock duration of the eval run from prompt submission to final output. It includes all agent activity: reading files, making tool calls, waiting for API responses, and writing code. Time may vary based on:
  • Model provider API latency and rate limits
  • Number of tool calls required (varies by task complexity)
  • Network conditions between the eval environment and the model provider
  • Provider-side queue depth and load
Time is not normalized across providers. A faster time reflects both model efficiency and provider infrastructure performance.

Learn more