Test specifications
AI agents — Claude Code, GitHub Copilot, Gemini CLI — run Auth0 integration tasks in isolated development environments. Each agent uses the same tools a developer would in a realistic environment: a workspace, a shell, and file tools like Auth0 CLI. The prompts are short and realistic: “add authentication to my Next.js app,” not step-by-step recipes. Each model is tested with and without Auth0 tools (MCP Server and Agent Skills). The difference between those scores is the measurable impact of Auth0’s AI tooling on the developer experience.Score dimensions
Every run is scored across 7 dimensions split into two categories. Four dimensions address the agent process from end-to-end with Auth0 tools. Three dimensions score the final output. Each dimension is scored 0–100 individually, then weighted and combined into the overall score.| Dimension | Category | Description |
|---|---|---|
| Setup Friction | Process | Score determined by the agent’s ability to complete the task autonomously. If the agent paused to ask questions or encountered errors, the score decreased. |
| Setup Speed | Process | Score determined by the agent’s active execution time. Results are comparable across environments. |
| Efficiency | Process | Score determined by the number of tool calls required to complete the task. Fewer tool calls means less cost and less complexity. |
| Error Recovery | Process | Score determined by infrastructure errors (rate limits, timeouts) that disrupted the execution. |
| Correctness | Output | Score determined by whether the generated code imports real packages, calls real methods, and wires components correctly. |
| Hallucination | Output | Score determined by whether the agent invented packages that don’t exist or used incorrect SDK variants. |
| Security | Output | Score determined by whether the agent hardcoded secrets, stored tokens insecurely, or committed credentials to source code. |
Grades
Overall scores map to letter grades:| Grade | Min score | Description |
|---|---|---|
| A | 90 | Production-ready. Minimal issues. |
| B | 75 | Solid, but with a few gaps to fix. |
| C | 60 | Usable, but needs cleanup. |
| D | 40 | Significant problems. |
| F | < 40 | Not useful — faster to start from scratch. |
Result validation
Every grader verifies generated code — not prose or explanations. Graders check that code compiles, imports real packages, calls actual SDK methods, and doesn’t introduce security vulnerabilities. Results are validated at multiple levels:- Presence checks: Required SDK symbols, imports, and config keys exist in the output.
- Hallucination detection: Invented packages, wrong SDK variants, and fabricated API methods are caught.
- Security checks: Hardcoded credentials, tokens in insecure storage, and secrets in source code are flagged.
- Structural validation: Code is correctly wired — right components in right files, lifecycle hooks handled, middleware in the correct order.
- Version correctness: The agent uses current APIs, not deprecated patterns (only checked when the agent has access to current docs).
- Holistic review: An LLM judge evaluates overall correctness of the implementation.
Estimated cost and time
The results page displays estimated cost and estimated time for each configuration. These values represent a single eval run with Auth0 MCP + Skills enabled.Estimated cost
Cost is calculated from the total tokens consumed during the eval run (input tokens + output tokens) multiplied by the model provider’s published per-token pricing. Auth0 does not charge for running evals — the cost reflects what you would pay your model provider for equivalent token usage. Token pricing varies by model and provider. For current rates, refer to your provider’s pricing page:Estimated time
Time is the wall-clock duration of the eval run from prompt submission to final output. It includes all agent activity: reading files, making tool calls, waiting for API responses, and writing code. Time may vary based on:- Model provider API latency and rate limits
- Number of tool calls required (varies by task complexity)
- Network conditions between the eval environment and the model provider
- Provider-side queue depth and load