Testing
Eval tests validate that your skills, agents, and commands produce quality outputs. Run them against any LLM provider to check behavior before publishing or after installing.
Who is it for
Publishers
Write .eval.yaml files alongside your elements. These are the test suite for the artifact - validate quality before publishing, catch regressions between versions, enforce quality gates in CI.
Consumers
You don't write tests, but you can run grekt eval on installed artifacts to verify everything works with your provider and environment.
INFO
You can use npx promptfoo@latest to avoid a permanent install. If promptfoo is not detected, grekt eval will offer to run it via npx automatically.
How it works
- Create a
.eval.yamlfile next to any element.mdfile - Define test cases with inputs and assertions
- Run
grekt eval- grekt discovers tests, sends prompts to the provider, checks assertions
grekt reads the system prompt from the .md file automatically. You only define inputs and expectations.
File structure
.grekt/artifacts/@acme/support/
├── skills/
│ ├── tone-checker.md # the skill
│ └── tone-checker.eval.yaml # tests for this skill
├── agents/
│ ├── support-agent.md
│ └── support-agent.eval.yamlThe .eval.yaml file must share the same base name as its .md sibling.
Writing eval files
tests:
- description: "handles refund request"
vars:
input: "I want a refund, this is not what I expected"
assert:
- type: contains-any # output includes at least one of these
value: ["refund", "process", "help"]
- type: not-icontains # output must NOT include this (case-insensitive)
value: "that's not possible"
- type: llm-rubric # LLM judges output against this criteria
value: "acknowledges frustration, explains refund process clearly"Each test case has:
| Field | Required | Description |
|---|---|---|
description | No | Human-readable test label |
vars | Yes | Input variables passed to the prompt |
assert | Yes | One or more assertions to check the output |
Assertions
Assertions are passed directly to the eval engine. Common types:
| Type | Description |
|---|---|
contains | Output contains the value |
contains-any | Output contains any of the values |
not-contains | Output does not contain the value |
icontains | Case-insensitive contains |
not-icontains | Case-insensitive not-contains |
llm-rubric | LLM judges output against a criteria |
similar | Cosine similarity above threshold |
cost | Response cost below threshold |
latency | Response time below threshold |
Full assertion reference: promptfoo assertions
WARNING
llm-rubric should not be used for deterministic outputs. Since another LLM evaluates the result, it can produce inconsistent results across runs and degrade over time as models change. See this research log for a real-world example of LLM-as-judge masking regressions.
Provider override
Each eval file can override the default provider:
provider: openai:gpt-4.1-mini
tests:
- vars:
input: "test"
assert:
- type: contains
value: "ok"Provider configuration
Set the default provider in .grekt/config.yaml (gitignored):
eval:
providers:
- openai:gpt-4.1-miniAPI keys are set via environment variables. The provider string follows promptfoo's provider format.
Common providers:
| Provider | Format |
|---|---|
| OpenAI | openai:gpt-4.1-mini |
| Anthropic | anthropic:messages:claude-sonnet-4-20250514 |
| Ollama (local) | ollama:chat:llama3.3 |
Evaluable elements
Only these element types support evals:
- Skills
- Agents
- Commands
Other types (rules, hooks, MCPs...) have their own testing strategies. An .eval.yaml next to a non-supported element produces a warning.
See also
- grekt eval -Command reference