ArkCore Evals

Offline regression evaluation workflow for ResearchArk LLM prompts and retrieval quality gates.

ArkCore Evals is the offline quality gate for ResearchArk prompt and retrieval changes. It scores candidate outputs against version-pinned golden datasets, compares the result with the main-branch scorecard, and blocks regressions before prompt or embedding changes ship.

When To Run It

Run an evaluation before merging changes that affect:

arkcore-llm prompts, model routing, or structured-output behavior
arkcore-vector ranking or indexed document shape
arkcore-text-processor embedding or reranking behavior
ArkSearch retrieval, partner recommendations, opportunity briefings, consortium scoring, expert matching, or funding-opportunity AI remarks

Walkthrough

Build the reviewed release input from an offline sampled-event handoff. If the row-level handoff is blocked, refresh scorecards/release-source-counts.json with the operator helper, then run source-counts-check, release-handoff-report, source-collection-check, source-sampling-plan, build_release_source_exports.py, source-exports-check, build-sampled-event-label-packet, build-draft-sampled-event-labels, sampled-event-labels-check, and build-sampled-events; use the focused source-collection report to archive the remaining missing persisted-row actions before attempting sampled-event validation, and use the source-sampling plan to archive read-only SQL templates for persisted segments with available rows. Use the handoff report's sampled-event sampling plan to see the target rows, sampleable rows, selected source segments, source allocation across shared sources, source shortfalls per surface, source-collection actions for missing persisted rows, and unused-source guidance. The handoff report can be regenerated from scorecards/release-source-counts.json, which must contain aggregate counts only, match the dataset version, avoid PII-bearing source names, and include every mapped source key. The refresh helper uses live read-only count(*) queries from separate user-data and funding DSNs; it is not eval runtime code and does not export rows. source-sampling-plan also does not connect to a database or export row-level traffic; it exits non-zero while persisted-row shortfalls remain and omits SQL for blocked surfaces unless --allow-partial is used for available-row SQL templates. scripts/build_release_source_exports.py executes only audited single-SELECT templates against explicit read-only DSNs, PII-scans rows before writing the ignored packet, and leaves strict validation to source-exports-check. source-exports-check validates the offline row packet created from those SQL templates, matches rows to the sampling plan, rejects partial source-sampling plans unless --allow-partial is explicit, and rejects PII or raw source keys before sampled-event assembly. Source-export rows wrap source columns as traffic_id, surface, source, pii_scrubbed: true, and source_payload; schemas/source-export.schema.json documents the wrapper shape, while source-exports-check enforces plan-specific output columns and row counts. If only available persisted sources have been exported, --allow-partial on source-sampling-plan, build_release_source_exports.py, source-exports-check, build-sampled-event-label-packet, build-draft-sampled-event-labels, sampled-event-labels-check, build-sampled-events, and sampled-events-check validates, labels, and assembles those rows early while preserving partial: true and the blocked-surface list so the reports cannot satisfy release readiness; when partial work is possible, release-handoff-report lists those optional commands separately from the strict release-facing command list. build-sampled-event-label-packet renders an ignored Markdown worksheet from valid exports so operators can author labels without the harness inventing expected outputs. build-draft-sampled-event-labels writes a draft label seed from validated exports, warns where reviewers must replace retrieval relevance ids or expand opportunity briefing outputs, and still does not create human-review evidence. sampled-event-labels-check verifies one label per export row, no extra labels, valid retrieval or structured expected-output shape for the matched surface, and the same PII/raw-key guardrails before assembly. build-sampled-events joins that validated export with operator-authored sampled-event-labels.jsonl, copies source and surface from the export rather than the label file, requires one label per export row, rejects extra labels, and never adds human-review metadata. Endpoint-specific LLM usage counts can confirm volume, latency, or cost attribution, but they are not sample-ready golden rows unless paired with persisted, PII-scrubbed surface outputs and human review. Each sampled row must carry one of those non-PII source keys so the reviewer packet stays traceable to the handoff plan; --handoff-report treats the plan as a target contract, so stale dataset-version plans are rejected, ready entries cannot lower sampled rows below the plan target or requested surface threshold, shared source segments cannot allocate more rows than their available count, and non-ready plan rows repeat the source-collection shortfall at validation time. The sampled-event report carries the requested dataset version for reviewer traceability. Start with sampled-traffic-events.jsonl, which must come from validated exports plus labels and must not contain raw production database rows, runtime logs, user identifiers, raw queries, or human-review claims.

ArkSphere expert, consortium, and partner surfaces use role-expanded source rows derived from persisted required_roles JSON and sphere-opportunity links, while endpoint-specific LLM usage remains telemetry only.

uv run python scripts/refresh_release_source_counts.py \
  --user-data-dsn "$ARKCORE_EVALS_USER_DATA_DSN" \
  --funding-dsn "$ARKCORE_EVALS_FUNDING_DSN" \
  --dataset-version v0.1.0 \
  --out scorecards/release-source-counts.json

uv run arkcore-evals source-counts-check \
  --input scorecards/release-source-counts.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/release-source-counts-check.json \
  --markdown-out scorecards/release-source-counts-check.md

uv run arkcore-evals release-handoff-report \
  --root . \
  --dataset-version v0.1.0 \
  --source-counts-file scorecards/release-source-counts.json \
  --target-total 500 \
  --json-out scorecards/release-handoff-readiness.json \
  --markdown-out scorecards/release-handoff-readiness.md

uv run arkcore-evals source-collection-check \
  --handoff-report scorecards/release-handoff-readiness.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/source-collection.json \
  --markdown-out scorecards/source-collection.md

uv run arkcore-evals source-sampling-plan \
  --handoff-report scorecards/release-handoff-readiness.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/source-sampling-plan.json \
  --markdown-out scorecards/source-sampling-plan.md

uv run python scripts/build_release_source_exports.py \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --user-data-dsn "$ARKCORE_EVALS_USER_DATA_DSN" \
  --funding-dsn "$ARKCORE_EVALS_FUNDING_DSN" \
  --dataset-version v0.1.0 \
  --out source-exports.jsonl \
  --json-out scorecards/source-exports-build.json \
  --markdown-out scorecards/source-exports-build.md

uv run arkcore-evals source-exports-check \
  --input source-exports.jsonl \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/source-exports.json \
  --markdown-out scorecards/source-exports.md

uv run arkcore-evals build-sampled-event-label-packet \
  --source-exports source-exports.jsonl \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/sampled-event-label-packet.json \
  --markdown-out scorecards/sampled-event-label-packet.md

uv run arkcore-evals build-draft-sampled-event-labels \
  --source-exports source-exports.jsonl \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --out sampled-event-labels.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/sampled-event-labels-draft.json \
  --markdown-out scorecards/sampled-event-labels-draft.md

uv run arkcore-evals sampled-event-labels-check \
  --labels sampled-event-labels.jsonl \
  --source-exports source-exports.jsonl \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/sampled-event-labels.json \
  --markdown-out scorecards/sampled-event-labels.md

uv run arkcore-evals build-sampled-events \
  --source-exports source-exports.jsonl \
  --labels sampled-event-labels.jsonl \
  --out sampled-traffic-events.jsonl \
  --source-sampling-plan scorecards/source-sampling-plan.json \
  --handoff-report scorecards/release-handoff-readiness.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/sampled-events-build.json \
  --markdown-out scorecards/sampled-events-build.md

uv run arkcore-evals sampled-events-check \
  --input sampled-traffic-events.jsonl \
  --handoff-report scorecards/release-handoff-readiness.json \
  --dataset-version v0.1.0 \
  --json-out scorecards/sampled-events.json \
  --markdown-out scorecards/sampled-events.md

uv run arkcore-evals build-review-candidates \
  --input sampled-traffic-events.jsonl \
  --out review-candidates.jsonl \
  --dataset-version v0.1.0 \
  --handoff-report scorecards/release-handoff-readiness.json \
  --json-out scorecards/review-candidates-build.json \
  --markdown-out scorecards/review-candidates.md

uv run arkcore-evals review-candidates-check \
  --input review-candidates.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/review-candidates.json \
  --markdown-out scorecards/review-candidates.md \
  --review-packet-out scorecards/review-candidate-packet.md

The review-candidate packet is for a human reviewer. It does not approve rows. Create a controlled worksheet if useful:

uv run arkcore-evals build-review-decision-template \
  --candidates review-candidates.jsonl \
  --out review-decisions.template.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/review-decision-template.json \
  --markdown-out scorecards/review-decision-template.md

Template rows start with approved: false and blank reviewer fields, so they do not validate as review evidence. Capture completed reviewer approvals in review-decisions.jsonl; reviewer identifiers must not be AI, automation, fixture, or test markers. Then build and validate the reviewed sample file:

uv run arkcore-evals review-decisions-check \
  --candidates review-candidates.jsonl \
  --decisions review-decisions.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/review-decisions.json \
  --markdown-out scorecards/review-decisions.md

uv run arkcore-evals build-reviewed-samples \
  --candidates review-candidates.jsonl \
  --decisions review-decisions.jsonl \
  --out reviewed-traffic-samples.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/reviewed-samples-build.json \
  --markdown-out scorecards/reviewed-samples.md

uv run arkcore-evals reviewed-samples-check \
  --input reviewed-traffic-samples.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/reviewed-samples.json \
  --markdown-out scorecards/reviewed-samples.md

Curate a versioned golden dataset under data/golden/<dataset-version>/ from the offline JSONL export of sampled, PII-scrubbed, human-reviewed traffic rows. Curation validates every reviewed input row before selection, so clean selected rows cannot mask an invalid export.

uv run arkcore-evals curate-dataset \
  --input reviewed-traffic-samples.jsonl \
  --out-dir data/golden/v0.1.0 \
  --dataset-version v0.1.0 \
  --traffic-from 2026-05-01 \
  --traffic-to 2026-05-18 \
  --sampling-method "stratified production traffic sample" \
  --pii-scrubbing-method "automated PII scan plus human review" \
  --reviewer alpha912 \
  --approved-by alpha912 \
  --approved-at 2026-05-18

Select the golden dataset version for the surface under test. Scorecard runs, live judge runs, daily drift, and the reusable CI workflow reject datasets outside data/golden/; candidate scoring and live judging also require reviewed traffic-sample provenance before scoring or model calls.
For a new production baseline, validate the raw reviewed input, baseline prediction, and judge files. The preflight rejects non-JSONL paths and obvious live database or log locations before parsing. Standalone baseline checks also report unsafe or missing baseline artifact paths before trusting reviewed rows, so malformed reviewed samples do not hide bad artifact locations. The repo checklist in docs/release-input-format.md and the schemas/*.schema.json files define the exact JSONL shape for reviewed rows, baseline predictions, Claude Opus/Sonnet/Haiku 4.x plus Gemini candidate model coverage, per-surface and per-model prediction counts, Claude Opus plus Gemini Pro judge scores, the release_requirements threshold summary, the required_files handoff summary, and the missing_files report for absent handoff files:

uv run arkcore-evals release-inputs-check \
  --reviewed-input reviewed-traffic-samples.jsonl \
  --baseline-predictions baseline-predictions.jsonl \
  --judge-scores baseline-judges.jsonl \
  --dataset-version v0.1.0 \
  --json-out scorecards/release-inputs.json

Build the canonical release bundle from offline reviewed inputs:

uv run arkcore-evals build-release-bundle \
  --reviewed-input reviewed-traffic-samples.jsonl \
  --baseline-predictions baseline-predictions.jsonl \
  --judge-scores baseline-judges.jsonl \
  --root . \
  --dataset-version v0.1.0 \
  --traffic-from 2026-05-01 \
  --traffic-to 2026-05-18 \
  --sampling-method "stratified production traffic sample" \
  --pii-scrubbing-method "automated PII scan plus human review" \
  --reviewer alpha912 \
  --approved-by alpha912 \
  --approved-at 2026-05-18

The command reruns the raw input validator, stages curation, filters baseline predictions and two-judge scores to the selected golden examples, generates scorecards/main.json, runs release-check, and writes the canonical dataset, manifest, scorecards/main.json, and scorecards/main-judges.jsonl only after the staged bundle is release-ready. Add --dry-run --json-out scorecards/release-dry-run.json to validate those release inputs without writing canonical artifacts.

Generate candidate predictions from that dataset only. Do not query production data during the evaluation run.
Run dataset validation:

uv run arkcore-evals validate-dataset \
  data/golden/v0.1.0/golden.jsonl \
  --manifest data/golden/v0.1.0/manifest.json

Check release readiness when adopting a production dataset or main scorecard. The golden manifest must include reviewed-input provenance: export SHA-256, input row count, selection seed, difficulty counts, and the release minimums used by curation. The baseline artifacts must be scorecards/main.json and scorecards/main-judges.jsonl; the judge file must use the exact release Claude Opus and Gemini Pro ids, and the scorecard run id must match the candidate hash, dataset hash, dataset version, seed, and full-surface scope so the release can be reproduced.

uv run arkcore-evals release-check \
  --dataset data/golden/v0.1.0/golden.jsonl \
  --manifest data/golden/v0.1.0/manifest.json \
  --baseline-scorecard scorecards/main.json \
  --judge-scores scorecards/main-judges.jsonl

Service deployments can run the same readiness gate through POST /v1/release-check. The endpoint accepts only golden dataset and manifest paths under data/golden/, with baseline scorecards and judge-score JSONL under the configured scorecard directory; missing in-scope artifacts return a readiness report instead of a transport error.

Before calling the harness objective complete, run uv run arkcore-evals completion-audit. It maps the source-export and sampled-event label handoff contracts, raw release handoff files, canonical release files, metric implementation, FastAPI route and API-test evidence, reusable and caller CI workflows, caller exporter golden-path guards, prompt registry, admin promotion path, observability, docs, and coverage gate to concrete evidence and fails until scorecards/release-inputs.json, the real handoff files, and release artifacts are present. It also checks that the manifest curation hash and row count match the reviewed traffic handoff, that the golden rows were selected from that file, that the main scorecard candidate hash and accounting totals match the supplied baseline predictions, and that the canonical judge file matches the supplied judge handoff filtered to the golden dataset. The required scorecards/release-inputs.json, scorecards/source-sampling-plan.json, scorecards/source-exports.json, scorecards/sampled-event-label-packet.json, scorecards/sampled-event-labels.json, and scorecards/sampled-events-build.json reports must be valid, non-partial, empty of errors, match the audit dataset version, and meet release row thresholds. If scorecards/release-handoff-readiness.json exists, the audit includes any source-collection shortfalls from that report. The JSON output includes a checklist that maps each gate to expected artifacts, commands, evidence, and missing items.

Run the two-judge rubric pass:

uv run arkcore-evals judge \
  --dataset data/golden/v0.1.0/golden.jsonl \
  --predictions scorecards/candidate-predictions.jsonl \
  --out scorecards/judges.jsonl

Service deployments can generate the same cache-backed judge JSONL through POST /v1/judge-scores; the endpoint enforces the same data/golden/ dataset boundary as the CLI, writes artifacts only under the configured scorecard directory, and keeps judge cache writes inside the configured judge-cache directory.

For a service-owned subset, repeat --surface on the judge, run, and compare commands so the same rows are used through the whole gate:

uv run arkcore-evals judge \
  --dataset data/golden/v0.1.0/golden.jsonl \
  --predictions scorecards/candidate-predictions.jsonl \
  --out scorecards/judges.jsonl \
  --surface opportunity_briefings \
  --surface funding_opportunity_ai_remarks

Run the candidate scorecard:

uv run arkcore-evals run \
  --dataset data/golden/v0.1.0/golden.jsonl \
  --predictions scorecards/candidate-predictions.jsonl \
  --candidate-id pr-123 \
  --judge-scores scorecards/judges.jsonl \
  --out scorecards/pr-123.json \
  --surface opportunity_briefings \
  --surface funding_opportunity_ai_remarks

Compare with the main-branch scorecard:

uv run arkcore-evals compare \
  --base scorecards/main.json \
  --candidate scorecards/pr-123.json \
  --markdown-out scorecards/eval-comment.md \
  --surface opportunity_briefings \
  --surface funding_opportunity_ai_remarks

Publish accepted scorecards for admin review:

uv run arkcore-evals publish-scorecard \
  --scorecard scorecards/pr-123.json \
  --database-url "$EVALS_DATABASE_URL" \
  --status passed

Check the run envelope:

uv run arkcore-evals check-budget \
  --scorecard scorecards/pr-123.json \
  --max-cost-usd 15 \
  --max-duration-seconds 1500

Review blocked regressions. The default gate fails when deterministic score drops by more than 0.05 or judge score drops by more than 0.5.
Promote accepted prompts through /admin/evals, which writes a versioned prompt_registry row only when the run and selected surface passed gates. Promotion is blocked unless the selected prompt text is present in the scorecard prompt_candidates list, so the admin can only promote text that was evaluated. The completion audit checks these gate tests directly. arkcore-llm reads promoted prompts at startup and falls back to prompt files if the database is unavailable.

CI Adoption

The reusable arkcore-evals GitHub workflow runs in the caller repository, installs the harness, enforces data/golden/ dataset scope, runs release-check, generates or reads candidate predictions, runs live judges or supplied judge scores, checks budget, compares scorecards, comments on the PR, and uploads artifacts. Callers must provide the golden manifest and main judge-score file, and their path filters must cover the runtime prompt, LLM client, vector, and text-processing files that feed eval predictions.

Caller workflows should pass surfaces as a space-separated list matching the repo ownership. For example, arkcore-llm owns prompt surfaces such as opportunity_briefings, while arkcore-vector and arkcore-text-processor own retrieval and matching surfaces such as arksearch_retrieval, expert_to_cluster_matching, and partner_search_recommendations.

Set publish_scorecard: "true" only for trusted scheduled runs with the EVALS_DATABASE_URL secret. PR runs should produce artifacts and comments, not publish to the admin review tables.

Daily Drift Worker

Run daily provider-drift checks through the dedicated eval queue:

celery -A arkcore_evals.worker worker -Q evals --loglevel=INFO
celery -A arkcore_evals.worker beat --loglevel=INFO

The worker reads golden datasets only by default. Prediction exporters must generate candidate JSONL from those inputs and must not query production data during the eval run.

Use these guards for scheduled runs:

ARKCORE_EVALS_DAILY_REQUIRE_GOLDEN_DATASET=true rejects daily datasets outside data/golden/.
ARKCORE_EVALS_DAILY_MAX_COST_USD caps candidate plus judge spend before publish.
ARKCORE_EVALS_DAILY_MAX_DURATION_SECONDS caps full-run duration before publish.
ARKCORE_EVALS_DAILY_BASELINE_SCORECARD points to the main scorecard used for provider-drift comparison.
ARKCORE_EVALS_DAILY_COMPARISON_JSON_OUT and ARKCORE_EVALS_DAILY_COMPARISON_MARKDOWN_OUT store the daily drift comparison artifacts.
ARKCORE_EVALS_DAILY_PUBLISH=true requires EVALS_DATABASE_URL and should be used only after the budget and comparison gates pass.

Daily runs compare production prompt outputs against the baseline scorecard before publish. Regression-blocked runs are published as failed, while non-regressing runs with judge disagreements are routed to review.

Dataset Rules

Golden datasets must be JSONL, version-pinned, PII-scrubbed, and human reviewed. Each row needs input, expected output or retrieval labels, source attribution, difficulty, review metadata, and rubric instructions for faithfulness, helpfulness, completeness, and safety. Retrieval rows need non-empty expected.relevant_ids; structured prompt rows need expected.output, non-empty expected.required_fields, and expected.json_schema. Curated production rows must use source_attribution.kind: traffic_sample. The sidecar manifest records dataset version, source sample window, sampling method, PII scrubbing method, reviewer list, approval metadata, dataset hash, total examples, per-surface counts, and per-difficulty counts.

Scorecard Contents

Scorecards include retrieval metrics, structured-output metrics, latency, run duration, total and per-call token/cost metrics, two-judge rubric scores, Cohen's kappa, and any judge disagreements that require human review. Release judge files must include both Claude Opus and Gemini Pro scores for every evaluated example.

Live judge runs require pricing environment variables for Anthropic and Gemini input/output dollars per 1M tokens. The budget gate fails if scorecards omit cost or duration totals, and uses total_cost_usd, which includes candidate and judge spend. run_duration_seconds covers candidate latency, judge latency, and local scoring time.

Admin Review

publish-scorecard stores run summaries, per-surface scores, prompt candidates, and judge disagreements for /admin/evals. Admins can mark disagreements as in_review, resolved, or dismissed without changing scorecard history. Prompt promotion remains separate from disagreement triage and only writes a new prompt_registry row after the selected run, surface, and evaluated prompt candidate pass validation.