Intelligence

Measure agent quality.
Scientifically.

Run structured evaluations across multiple models. Compare quality, cost, and speed. Detect regressions before they reach production.

Request Access

SWE-BENCH VERIFIED 500 real GitHub issues — live evaluation run

Instances

django__django-15648 ✓ RESOLVED

sympy__sympy-21379 ✓ RESOLVED

astropy__astropy-14309 ✓ RESOLVED

matplotlib__matplotlib-26011 RUNNING...

scikit-learn__scikit-19851 RUNNING...

flask__flask-4045 QUEUED

371 / 500 instances

Score

SCORE

100% 75% 50% 25% 0%

74.2%

Cost Tracker

TOKENS USED 1.24M

COST SO FAR $4.82

ESTIMATED TOTAL ~$47.50

Model Throughput

Haiku

312 inst

Sonnet

47 inst

Opus

12 inst

6 CONNECTORS ACTIVE

① 500 GitHub issues — real SWE-bench instances queued and solved in parallel

② Score climbs live — resolved instances feed the score meter in real time

③ Cost tracked per model — Haiku, Sonnet, and Opus throughput lanes shown with live spend

Test Suites

Define evaluation cases with expected behavior. Run them against any model or expert agent configuration.

Side-by-Side Comparison

See how different models handle the same job. Compare token usage, latency, cost, and output quality.

Scoring

Structured scoring on a 0-100 scale. Track pass/fail, quality metrics, and guardrail compliance per run.

Trend Analysis

Track quality scores over time. Detect when model updates cause regressions in your specific use cases.

Cost Analysis

See the exact cost per response for each model. Make informed decisions about model routing and budget.

Scheduled Evals

Run evaluations on a schedule. Get notified when scores drop below your thresholds.

SWE-bench Verified

Run against the industry-standard coding benchmark. Orqista integrates the official Princeton Docker eval harness for ground-truth pass/fail scoring — the same methodology used to rank frontier AI models.

Warm / Cold Mode

Measure whether Skill Documents improve agent quality over time. Cold runs use a fresh agent; warm runs let the agent draw on learned procedural knowledge. Compare scores to prove improvement.

How scoring works

Each evaluation case defines expected behavior criteria. The grader scores LLM output on a 0-100 scale across correctness, completeness, and guardrail compliance. Results are tracked over time to detect regressions before they reach production. For SWE-bench Verified runs, the official Princeton Docker harness validates patches against real repository test suites — the same methodology used to rank frontier AI coding models.

Related Features

Play Workbench Multi-Model Support Expert Agents

See benchmark results for your codebase

Request access to run evaluations against your real projects.

Request Early Access

Measure agent quality.Scientifically.