User Guide

Everything you need to know to evaluate your prompts effectively with EvalPilot.

Getting Started

Your First Evaluation in 90 Seconds

  1. 1

    Paste Your Prompt

    Go to the evaluation page and paste your LLM prompt. Our system automatically detects what type of task your prompt performs (classification, extraction, summarization, etc.).

  2. 2

    Confirm Evaluation Criteria

    We suggest relevant criteria based on your task type. Toggle on/off what matters for your use case. No long interviews - just confirm and go.

  3. 3

    Run the Evaluation

    We generate 50+ test cases and run your prompt against each one. Watch progress in real-time with live streaming updates.

  4. 4

    Review Results

    Get a comprehensive breakdown of how your prompt performed: pass rate, category stats, failure examples, and actionable improvement suggestions.

Test Case Categories

We automatically generate test cases across three categories to thoroughly evaluate your prompt:

Happy Path (40%)

Standard inputs that your prompt should handle well. These test basic functionality.

Edge Cases (40%)

Boundary conditions, unusual formats, and corner cases that might trip up your prompt.

Adversarial (20%)

Malicious inputs, prompt injection attempts, and deliberately confusing inputs.

Why This Methodology?

Why 50+ Test Cases?

Based on Anthropic's evaluation best practices, volume beats precision when testing LLM prompts.

Statistical Confidence — 50+ tests give you confidence that your prompt handles diverse inputs, not just the 3-5 examples you manually tested.

Edge Case Discovery — More tests = higher chance of catching failures you'd never think to test manually.

Automated Grading — We use code-based grading where possible (exact match, JSON validation), so large volumes are fast and reliable.

Anthropic Principle: "100 automated tests with 80% accuracy beats 10 hand-graded tests with 100% accuracy."

Why Three Grading Methods?

Different evaluation criteria require different grading approaches. We use a hierarchy to maximize confidence:

CODE
High Confidence (Preferred)

Objective criteria like JSON format, exact matches, regex patterns. Fast, deterministic, no ambiguity.

LLM
AI Judgment (When Needed)

Subjective criteria like factuality, tone, relevance. We use detailed rubrics and reasoning-first scoring for consistency.

HUMAN
Your Review (Final Say)

For highly subjective criteria (creativity, brand voice), you approve/reject flagged tests. We prioritize these for your attention.

Transparency: Every result shows which grading method was used, so you know how much to trust each score.

Agent EvalPreview

What is Agent Eval?

While Prompt Eval tests individual prompts that generate text responses, Agent Eval tests AI agents that take actions—calling tools, making API requests, or executing multi-step workflows.

Perfect for testing booking assistants, customer support bots, code generation agents, and any AI system that does more than just respond.

Getting Started with Agent Eval

  1. 1

    Describe Your Agent

    Go to Agent Eval and describe what your agent does. Be specific about its purpose and capabilities.

  2. 2

    Define Available Actions

    List the actions your agent can take—API calls, tool uses, database queries. Include parameter schemas if relevant.

  3. 3

    Create Test Scenarios

    Define storyboards for your agent. Each scenario includes what the customer says, the expected action sequence, and success criteria. Generate more with AI.

  4. 4

    Configure Integration

    Choose how to capture transcripts: Manual (paste), n8n, Make, Zapier, OpenAI Assistant, or Custom API. Integrations let you run your agent live and capture the conversation automatically.

  5. 5

    Capture Transcripts (Per Scenario)

    Each scenario needs its own transcript. Use "Run All Scenarios" to execute your integration for all scenarios at once, or capture each one individually. For manual mode, paste transcripts one at a time per scenario.

  6. 6

    Get Detailed Results

    Each scenario is evaluated against its own transcript independently. We score trajectory (did the agent take the right actions?) and response quality (was the final response helpful?).

Why Agent Eval is Different?

Testing agents that take actions requires a fundamentally different approach than testing prompts that generate text.

1. Text-Only vs. Tool-Using Agents

Text-Only Agents (Prompt Eval): "Summarize this article" → Output text → Grade text quality

Tool-Using Agents (Agent Eval): "Book a flight" → Calls APIs (search_flights, book_flight, send_confirmation) → Output text + actions taken

Agent Eval handles both! We extract actions from transcripts using explicit markers ("[Agent uses lookup_order]") or LLM-based inference ("I've looked up your order..." → infers lookup_order action).

2. Six ADK-Style Metrics

Four core metrics determine pass/fail. Two optional rubric-based metrics provide quality insights without affecting the result.

Trajectory Score (0-100%)

Did the agent take the right actions in the right order? Uses ordered sequence matching.

Response Score (0-100%)

Was the final response helpful? Uses LLM semantic evaluation via Gemini.

Hallucination Score (0-100%)

Did the agent stay grounded in facts? LLM-based grounding check via Gemini.

Safety Score (0-100%)

Is the response safe and appropriate? Checks for harmful or inappropriate content.

Response Quality (0-100%) informational

Custom criteria for response quality. Define your own criteria like "Response is concise" or "Uses professional tone". Shown as quality insights.

Tool Use Quality (0-100%) informational

Custom criteria for tool usage. Define rules like "Looks up order before changes". Shown as quality insights, don't affect pass/fail.

3. Per-Scenario Evaluation (Not Batch)

Why per-scenario? Each scenario tests a different user intent ("Book a flight" vs "Cancel reservation"). Testing "book a flight" with a transcript about "cancel reservation" wouldn't make sense.

Follows Google ADK pattern: Each test case triggers a separate agent invocation, evaluated independently against expected behavior.

Parallel execution: All scenarios with transcripts are evaluated simultaneously for faster results.

4. Integration-First Design

We integrate directly with your agent platform (n8n, Make, OpenAI Assistants) to capture live executions. Use "Run All Scenarios" to automatically execute your agent for every scenario at once.

For OpenAI Assistants: We extract tool_calls directly from run steps. For custom APIs: We use LLM-based action extraction from response patterns.

Supported Integrations

Connect directly to your agent's platform to capture live conversations:

n8n Workflow

Trigger workflows via API and capture execution logs

Make (Integromat)

Run scenarios and analyze operation logs

Zapier Webhook

Trigger Zaps via webhook

OpenAI Assistant

Test Assistants and capture tool calls automatically

Custom API

Call any endpoint and extract actions via LLM

Manual

Paste transcripts directly (no setup needed)

Understanding Agent Results (ADK-Style Metrics)

Agent Eval uses Google ADK-style evaluation metrics. Four core metrics determine pass/fail. Two optional rubric-based metrics provide quality insights without affecting the test result:

Trajectory Score

Did the agent take the right actions in the right order? Compares expected steps vs. actual steps extracted from the transcript. Uses ordered sequence matching.

Response Score

Was the final response helpful and appropriate? Uses LLM semantic evaluation via Gemini to understand meaning, not just word overlap.

Hallucination Score

Did the agent stay grounded in facts? Uses Gemini LLM to check if the response is factually consistent with the transcript. Higher = less hallucination.

Safety Score

Is the response safe and appropriate? Checks for harmful, unethical, or inappropriate content. Higher = safer response.

Response Quality (optional)

Define custom criteria for response quality. Each criterion is evaluated with yes/no scoring. Great for brand-specific rules like "Uses professional tone".

Tool Use Quality (optional)

Define custom criteria for tool usage patterns. Enforce rules like "Always looks up order before processing refunds".

ADK Best Practice: Following Google ADK methodology, the four core metrics (Trajectory, Response, Hallucination, Safety) determine pass/fail. Rubric-based quality scores are informational only - they provide insights for improvement without blocking test results.

Per-Scenario Evaluation

Unlike traditional testing where one transcript is evaluated against all scenarios, Agent Eval uses per-scenario evaluation—each scenario gets its own transcript and is evaluated independently.

Why per-scenario? Each scenario represents a different user intent. Testing "book a flight" with a transcript about "cancel reservation" wouldn't make sense.

Run All Scenarios — If you have an integration configured, use the "Run All Scenarios" button to execute your agent for every scenario at once. Progress is shown in real-time.

Parallel evaluation — All scenarios with transcripts are evaluated simultaneously for faster results.

Tip: Only scenarios with transcripts are evaluated. You can run a partial evaluation with just the scenarios you've captured.

Agent Eval vs Prompt Eval

Prompt EvalAgent Eval
TestsText responsesAction sequences
InputPrompt textAgent description + actions
EvaluatesResponse qualityGoal achievement + action correctness
Use CaseChatbots, summarizers, classifiersAssistants, bots, workflow agents
Free Tier2 evaluations1 evaluation

Preview Notice: Agent Eval is in early access. We're adding features based on user feedback. Have ideas? Let us know at feedback@evalpilot.co

Understanding Results

Grading Methods

We use a hierarchy of grading methods to ensure accurate and transparent evaluation:

High Confidence

Code-Based Grading

Exact match, JSON structure validation, regex patterns. Used for objective criteria like format compliance. Most reliable.

AI Judgment

LLM-Based Grading

Uses Claude to evaluate subjective criteria like relevance, factuality, and tone. Includes detailed reasoning for each judgment.

Human Review

Manual Review

Some test cases are flagged for your review. You can approve or reject and provide your own reasoning. Your judgment updates the evaluation stats.

Key Metrics

Pass Rate

Percentage of test cases that passed. A good target is 80%+ for happy path cases and 60%+ for edge cases.

Average Score

Mean score across all test cases (0-100). Provides nuance beyond pass/fail.

Category Breakdown

See how your prompt performs on happy path vs edge cases vs adversarial inputs.

Failure Examples

Real examples of where your prompt failed, with the input, expected output, and actual output for debugging.

Taking Action

Review flagged cases - Check the human review queue and approve/reject ambiguous results

Study failure patterns - Look at what categories fail most to identify systematic issues

Iterate and compare - Update your prompt and run V1 vs V2 comparison to measure improvement

Pro Features

Saved Test Suites

Save test cases from evaluations or create custom suites with your own test cases. Perfect for regression testing and consistent prompt validation.

Create custom suites with your own test cases from scratch
Save test cases from completed evaluations
Edit test cases anytime - add, modify, or remove
Run the same tests on updated prompts
Duplicate suites for A/B testing variants
Track suite usage and last run dates
Up to 50 test cases per suite

V1 vs V2 Comparison

Compare two versions of your prompt side-by-side to measure improvement or regression across all metrics.

Overall pass rate delta with trend indicators
Category-by-category breakdown
Test-by-test comparison showing improved/regressed cases

PDF Export

Generate professional PDF reports to share with stakeholders, clients, or team members.

Executive summary with key metrics
Detailed test case results
Recommendations and next steps

Auto-Improve & Retest

Get AI-powered suggestions to improve your prompt, then apply and retest with one click. Uses 2 evaluation credits for higher-quality rewrites with GPT-4o.

Analyzes your actual failed test cases
Generates targeted improvements using GPT-4o
One-click apply and rerun evaluation
Uses 2 credits for premium model quality

Team CollaborationTeam Plan

What's Included in Team Plan?

The Team plan ($99/mo) is designed for organizations that need to collaborate on AI evaluations.

5 team seats - Owner + 4 members
200 pooled prompt evals/month
40 pooled agent evals/month
Shared workspace for team evaluations
Unlimited BYOK evaluations
All Pro features included

Creating a Team

  1. 1

    Subscribe to Team Plan

    Go to Pricing and select the Team plan. After payment, a team is automatically created for you.

  2. 2

    Access Team Settings

    Go to Settings → Team to manage your team, view usage, and invite members.

Inviting Team Members

  1. 1

    Send an Invite

    In Team Settings, enter the email address of the person you want to invite and click "Send Invite". They'll receive an email with an invite link.

  2. 2

    Invited User Creates Account

    Important: If the invited person doesn't have an EvalPilot account, they should create one first, then click the invite link from their email.

  3. 3

    Accept the Invite

    Once logged in, clicking the invite link shows the "Accept Invite" button. Click it to join the team.

Note: Invites expire after 7 days. If an invite expires, the team owner can send a new one from Team Settings.

How Pooled Usage Works

Team evaluations draw from a shared pool, not individual limits.

200 prompt evals/month — Shared across all 5 team members

40 agent evals/month — Also shared across the team

BYOK unlimited — Each member can add their own API key for unlimited evals

Resets monthly — Usage resets on your billing date

Team Roles

Owner

The person who created the team. Can invite members, remove members, manage team settings, and view all team evaluations.

Member

Can run evaluations using team credits, view team evaluations, and leave the team. Cannot invite or remove other members.

Bring Your Own Key (BYOK)Pro+

Pro and Team users can use their own OpenAI or Anthropic API key for unlimited evaluations. Your key is encrypted and never stored in plain text.

How to add your API key:

  1. Go to Settings (from the user menu)
  2. Navigate to the API Keys section
  3. Add your OpenAI or Anthropic API key
  4. Your key is encrypted with AES-256-GCM before storage

Security note: Keys are encrypted at rest and only decrypted server-side when making API calls. We never log or expose your keys.

Frequently Asked Questions

How many free evaluations do I get?

Free users get 2 prompt evaluations and 1 agent evaluation total. Pro and Team users get 50+ prompt evals and 10+ agent evals per month, plus unlimited evaluations with their own API key (BYOK).

What LLM models are used?

We use Claude 3.5 Sonnet for test generation, evaluation, and LLM-based grading. When you bring your own key, you can use OpenAI GPT-4 or Anthropic Claude.

How long does an evaluation take?

Most evaluations complete in 60-90 seconds for 50 test cases. You'll see real-time progress as each test runs.

Can I edit test cases after generation?

Yes! You can review and modify test cases before running evaluations. For saved suites (Pro), you can edit test cases anytime - add new ones, modify existing ones, or remove them. You can also create custom suites with your own test cases from scratch. Each suite can have up to 50 test cases.

Is there a limit on test cases per suite?

Yes, each test suite can have a maximum of 50 test cases. This limit applies to both automatically generated suites and custom suites you create. If you need to add new test cases to a suite that already has 50, you'll need to remove some existing ones first.

What's the difference between pass rate and average score?

Pass rate is a binary percentage (how many tests passed vs failed). Average score is a 0-100 scale that captures nuance - a test case might score 70 but still 'pass' if it meets the threshold.

How does human review work?

Some test cases are flagged for your review when the automated grading is uncertain. You can approve or reject these cases, and your judgment is incorporated into the final evaluation stats.

Can I compare evaluations from different prompts?

Yes! The V1 vs V2 comparison feature (Pro) lets you compare any two evaluations, even if they're for different prompts. This is useful for A/B testing prompt variations.

How do I create a custom test suite?

Go to the Suites page and click 'Create Custom Suite'. You can add your own test cases with inputs, expected outputs, and categories (happy path, edge case, or adversarial). Custom suites are a Pro feature.

How does auto-improve work and why does it cost 2 credits?

Auto-improve analyzes your actual failed test cases and generates targeted prompt improvements using Claude Sonnet. It costs 2 evaluation credits because we use a premium model for higher-quality rewrites that are more likely to improve your results.

Is my prompt data secure?

Yes. Your prompts are stored in your account and never shared. API keys are encrypted with AES-256-GCM. We don't use your data for training.

What happens when I cancel my subscription?

When you cancel, you keep access to Pro/Team features until the end of your current billing period. No prorated refunds—you've already paid for that time, so enjoy it! After the period ends, you'll be moved to the free tier.

What integrations does Agent Eval support?

Agent Eval supports n8n, Make (Integromat), Zapier webhooks, OpenAI Assistants, and custom API endpoints. You can also paste transcripts manually. Integrations let you run your agent live and capture conversations automatically.

Why does each scenario need its own transcript?

Agent Eval uses per-scenario evaluation—each scenario represents a different user intent (e.g., 'book a flight' vs 'cancel reservation'). Evaluating one transcript against all scenarios wouldn't make sense. Use 'Run All Scenarios' with an integration to capture transcripts for all scenarios at once.

How does Agent Eval's evaluation engine work?

Agent Eval uses a TypeScript ADK-style evaluator with four metrics: Trajectory Score (ordered sequence matching), Response Score (LLM semantic evaluation), Hallucination Score (Gemini-based grounding check), and Safety Score (Gemini-based safety check). The evaluator runs entirely serverless—no external Python service required. It uses Google's Gemini API for all LLM-based evaluations.

How do I invite someone to my team?

Go to Settings → Team, enter their email address, and click 'Send Invite'. They'll receive an email with a link. Important: If they don't have an EvalPilot account yet, they should create one first, then click the invite link to accept.

How does pooled team usage work?

Team plans share evaluations across all members. Your team gets 200 prompt evals and 40 agent evals per month total—not per person. Usage resets on your billing date. Each member can also add their own API key (BYOK) for unlimited evaluations.

Can I be on multiple teams?

No, each user can only be a member of one team at a time. To join a different team, you'd need to leave your current team first.

Ready to evaluate your prompt?

Get started in 90 seconds. No credit card required.

Start Evaluation