ArkCore Evals
Offline regression evaluation workflow for ResearchArk LLM prompts and retrieval quality gates.
ArkCore Evals is the offline quality gate for ResearchArk prompt and retrieval changes. It scores candidate outputs against version-pinned golden datasets, compares the result with the main-branch scorecard, and blocks regressions before prompt or embedding changes ship.
When To Run It
Run an evaluation before merging changes that affect:
- arkcore-llm prompts, model routing, or structured-output behavior
- arkcore-vector ranking or indexed document shape
- arkcore-text-processor embedding or reranking behavior
- ArkSearch retrieval, partner recommendations, opportunity briefings, consortium scoring, expert matching, or funding-opportunity AI remarks
Walkthrough
- Build the reviewed release input from an offline sampled-event handoff. If the row-level handoff is blocked, refresh
scorecards/release-source-counts.jsonwith the operator helper, then runsource-counts-check,release-handoff-report,source-collection-check,source-sampling-plan,build_release_source_exports.py,source-exports-check,build-sampled-event-label-packet,build-draft-sampled-event-labels,sampled-event-labels-check, andbuild-sampled-events; use the focused source-collection report to archive the remaining missing persisted-row actions before attempting sampled-event validation, and use the source-sampling plan to archive read-only SQL templates for persisted segments with available rows. Use the handoff report's sampled-event sampling plan to see the target rows, sampleable rows, selected source segments, source allocation across shared sources, source shortfalls per surface, source-collection actions for missing persisted rows, and unused-source guidance. The handoff report can be regenerated fromscorecards/release-source-counts.json, which must contain aggregate counts only, match the dataset version, avoid PII-bearing source names, and include every mapped source key. The refresh helper uses live read-onlycount(*)queries from separate user-data and funding DSNs; it is not eval runtime code and does not export rows.source-sampling-planalso does not connect to a database or export row-level traffic; it exits non-zero while persisted-row shortfalls remain and omits SQL for blocked surfaces unless--allow-partialis used for available-row SQL templates.scripts/build_release_source_exports.pyexecutes only audited single-SELECT templates against explicit read-only DSNs, PII-scans rows before writing the ignored packet, and leaves strict validation tosource-exports-check.source-exports-checkvalidates the offline row packet created from those SQL templates, matches rows to the sampling plan, rejects partial source-sampling plans unless--allow-partialis explicit, and rejects PII or raw source keys before sampled-event assembly. Source-export rows wrap source columns astraffic_id,surface,source,pii_scrubbed: true, andsource_payload;schemas/source-export.schema.jsondocuments the wrapper shape, whilesource-exports-checkenforces plan-specific output columns and row counts. If only available persisted sources have been exported,--allow-partialonsource-sampling-plan,build_release_source_exports.py,source-exports-check,build-sampled-event-label-packet,build-draft-sampled-event-labels,sampled-event-labels-check,build-sampled-events, andsampled-events-checkvalidates, labels, and assembles those rows early while preservingpartial: trueand the blocked-surface list so the reports cannot satisfy release readiness; when partial work is possible,release-handoff-reportlists those optional commands separately from the strict release-facing command list.build-sampled-event-label-packetrenders an ignored Markdown worksheet from valid exports so operators can author labels without the harness inventing expected outputs.build-draft-sampled-event-labelswrites a draft label seed from validated exports, warns where reviewers must replace retrieval relevance ids or expand opportunity briefing outputs, and still does not create human-review evidence.sampled-event-labels-checkverifies one label per export row, no extra labels, valid retrieval or structured expected-output shape for the matched surface, and the same PII/raw-key guardrails before assembly.build-sampled-eventsjoins that validated export with operator-authoredsampled-event-labels.jsonl, copiessourceandsurfacefrom the export rather than the label file, requires one label per export row, rejects extra labels, and never adds human-review metadata. Endpoint-specific LLM usage counts can confirm volume, latency, or cost attribution, but they are not sample-ready golden rows unless paired with persisted, PII-scrubbed surface outputs and human review. Each sampled row must carry one of those non-PII source keys so the reviewer packet stays traceable to the handoff plan;--handoff-reporttreats the plan as a target contract, so stale dataset-version plans are rejected, ready entries cannot lower sampled rows below the plan target or requested surface threshold, shared source segments cannot allocate more rows than their available count, and non-ready plan rows repeat the source-collection shortfall at validation time. The sampled-event report carries the requested dataset version for reviewer traceability. Start withsampled-traffic-events.jsonl, which must come from validated exports plus labels and must not contain raw production database rows, runtime logs, user identifiers, raw queries, or human-review claims.
ArkSphere expert, consortium, and partner surfaces use role-expanded source rows derived from persisted required_roles JSON and sphere-opportunity links, while endpoint-specific LLM usage remains telemetry only.
uv run python scripts/refresh_release_source_counts.py \
--user-data-dsn "$ARKCORE_EVALS_USER_DATA_DSN" \
--funding-dsn "$ARKCORE_EVALS_FUNDING_DSN" \
--dataset-version v0.1.0 \
--out scorecards/release-source-counts.json
uv run arkcore-evals source-counts-check \
--input scorecards/release-source-counts.json \
--dataset-version v0.1.0 \
--json-out scorecards/release-source-counts-check.json \
--markdown-out scorecards/release-source-counts-check.md
uv run arkcore-evals release-handoff-report \
--root . \
--dataset-version v0.1.0 \
--source-counts-file scorecards/release-source-counts.json \
--target-total 500 \
--json-out scorecards/release-handoff-readiness.json \
--markdown-out scorecards/release-handoff-readiness.md
uv run arkcore-evals source-collection-check \
--handoff-report scorecards/release-handoff-readiness.json \
--dataset-version v0.1.0 \
--json-out scorecards/source-collection.json \
--markdown-out scorecards/source-collection.md
uv run arkcore-evals source-sampling-plan \
--handoff-report scorecards/release-handoff-readiness.json \
--dataset-version v0.1.0 \
--json-out scorecards/source-sampling-plan.json \
--markdown-out scorecards/source-sampling-plan.md
uv run python scripts/build_release_source_exports.py \
--source-sampling-plan scorecards/source-sampling-plan.json \
--user-data-dsn "$ARKCORE_EVALS_USER_DATA_DSN" \
--funding-dsn "$ARKCORE_EVALS_FUNDING_DSN" \
--dataset-version v0.1.0 \
--out source-exports.jsonl \
--json-out scorecards/source-exports-build.json \
--markdown-out scorecards/source-exports-build.md
uv run arkcore-evals source-exports-check \
--input source-exports.jsonl \
--source-sampling-plan scorecards/source-sampling-plan.json \
--dataset-version v0.1.0 \
--json-out scorecards/source-exports.json \
--markdown-out scorecards/source-exports.md
uv run arkcore-evals build-sampled-event-label-packet \
--source-exports source-exports.jsonl \
--source-sampling-plan scorecards/source-sampling-plan.json \
--dataset-version v0.1.0 \
--json-out scorecards/sampled-event-label-packet.json \
--markdown-out scorecards/sampled-event-label-packet.md
uv run arkcore-evals build-draft-sampled-event-labels \
--source-exports source-exports.jsonl \
--source-sampling-plan scorecards/source-sampling-plan.json \
--out sampled-event-labels.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/sampled-event-labels-draft.json \
--markdown-out scorecards/sampled-event-labels-draft.md
uv run arkcore-evals sampled-event-labels-check \
--labels sampled-event-labels.jsonl \
--source-exports source-exports.jsonl \
--source-sampling-plan scorecards/source-sampling-plan.json \
--dataset-version v0.1.0 \
--json-out scorecards/sampled-event-labels.json \
--markdown-out scorecards/sampled-event-labels.md
uv run arkcore-evals build-sampled-events \
--source-exports source-exports.jsonl \
--labels sampled-event-labels.jsonl \
--out sampled-traffic-events.jsonl \
--source-sampling-plan scorecards/source-sampling-plan.json \
--handoff-report scorecards/release-handoff-readiness.json \
--dataset-version v0.1.0 \
--json-out scorecards/sampled-events-build.json \
--markdown-out scorecards/sampled-events-build.md
uv run arkcore-evals sampled-events-check \
--input sampled-traffic-events.jsonl \
--handoff-report scorecards/release-handoff-readiness.json \
--dataset-version v0.1.0 \
--json-out scorecards/sampled-events.json \
--markdown-out scorecards/sampled-events.md
uv run arkcore-evals build-review-candidates \
--input sampled-traffic-events.jsonl \
--out review-candidates.jsonl \
--dataset-version v0.1.0 \
--handoff-report scorecards/release-handoff-readiness.json \
--json-out scorecards/review-candidates-build.json \
--markdown-out scorecards/review-candidates.md
uv run arkcore-evals review-candidates-check \
--input review-candidates.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/review-candidates.json \
--markdown-out scorecards/review-candidates.md \
--review-packet-out scorecards/review-candidate-packet.mdThe review-candidate packet is for a human reviewer. It does not approve rows. Create a controlled worksheet if useful:
uv run arkcore-evals build-review-decision-template \
--candidates review-candidates.jsonl \
--out review-decisions.template.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/review-decision-template.json \
--markdown-out scorecards/review-decision-template.mdTemplate rows start with approved: false and blank reviewer fields, so they do not validate as review evidence. Capture completed reviewer approvals in review-decisions.jsonl; reviewer identifiers must not be AI, automation, fixture, or test markers. Then build and validate the reviewed sample file:
uv run arkcore-evals review-decisions-check \
--candidates review-candidates.jsonl \
--decisions review-decisions.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/review-decisions.json \
--markdown-out scorecards/review-decisions.md
uv run arkcore-evals build-reviewed-samples \
--candidates review-candidates.jsonl \
--decisions review-decisions.jsonl \
--out reviewed-traffic-samples.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/reviewed-samples-build.json \
--markdown-out scorecards/reviewed-samples.md
uv run arkcore-evals reviewed-samples-check \
--input reviewed-traffic-samples.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/reviewed-samples.json \
--markdown-out scorecards/reviewed-samples.md- Curate a versioned golden dataset under
data/golden/<dataset-version>/from the offline JSONL export of sampled, PII-scrubbed, human-reviewed traffic rows. Curation validates every reviewed input row before selection, so clean selected rows cannot mask an invalid export.
uv run arkcore-evals curate-dataset \
--input reviewed-traffic-samples.jsonl \
--out-dir data/golden/v0.1.0 \
--dataset-version v0.1.0 \
--traffic-from 2026-05-01 \
--traffic-to 2026-05-18 \
--sampling-method "stratified production traffic sample" \
--pii-scrubbing-method "automated PII scan plus human review" \
--reviewer alpha912 \
--approved-by alpha912 \
--approved-at 2026-05-18- Select the golden dataset version for the surface under test. Scorecard runs, live judge runs, daily drift, and the reusable CI workflow reject datasets outside
data/golden/; candidate scoring and live judging also require reviewed traffic-sample provenance before scoring or model calls. - For a new production baseline, validate the raw reviewed input, baseline prediction, and judge files. The preflight rejects non-JSONL paths and obvious live database or log locations before parsing. Standalone baseline checks also report unsafe or missing baseline artifact paths before trusting reviewed rows, so malformed reviewed samples do not hide bad artifact locations. The repo checklist in
docs/release-input-format.mdand theschemas/*.schema.jsonfiles define the exact JSONL shape for reviewed rows, baseline predictions, Claude Opus/Sonnet/Haiku 4.x plus Gemini candidate model coverage, per-surface and per-model prediction counts, Claude Opus plus Gemini Pro judge scores, therelease_requirementsthreshold summary, therequired_fileshandoff summary, and themissing_filesreport for absent handoff files:
uv run arkcore-evals release-inputs-check \
--reviewed-input reviewed-traffic-samples.jsonl \
--baseline-predictions baseline-predictions.jsonl \
--judge-scores baseline-judges.jsonl \
--dataset-version v0.1.0 \
--json-out scorecards/release-inputs.json- Build the canonical release bundle from offline reviewed inputs:
uv run arkcore-evals build-release-bundle \
--reviewed-input reviewed-traffic-samples.jsonl \
--baseline-predictions baseline-predictions.jsonl \
--judge-scores baseline-judges.jsonl \
--root . \
--dataset-version v0.1.0 \
--traffic-from 2026-05-01 \
--traffic-to 2026-05-18 \
--sampling-method "stratified production traffic sample" \
--pii-scrubbing-method "automated PII scan plus human review" \
--reviewer alpha912 \
--approved-by alpha912 \
--approved-at 2026-05-18The command reruns the raw input validator, stages curation, filters baseline predictions and two-judge scores to the selected golden examples, generates scorecards/main.json, runs release-check, and writes the canonical dataset, manifest, scorecards/main.json, and scorecards/main-judges.jsonl only after the staged bundle is release-ready. Add --dry-run --json-out scorecards/release-dry-run.json to validate those release inputs without writing canonical artifacts.
- Generate candidate predictions from that dataset only. Do not query production data during the evaluation run.
- Run dataset validation:
uv run arkcore-evals validate-dataset \
data/golden/v0.1.0/golden.jsonl \
--manifest data/golden/v0.1.0/manifest.json- Check release readiness when adopting a production dataset or main scorecard. The golden manifest must include reviewed-input provenance: export SHA-256, input row count, selection seed, difficulty counts, and the release minimums used by curation. The baseline artifacts must be
scorecards/main.jsonandscorecards/main-judges.jsonl; the judge file must use the exact release Claude Opus and Gemini Pro ids, and the scorecard run id must match the candidate hash, dataset hash, dataset version, seed, and full-surface scope so the release can be reproduced.
uv run arkcore-evals release-check \
--dataset data/golden/v0.1.0/golden.jsonl \
--manifest data/golden/v0.1.0/manifest.json \
--baseline-scorecard scorecards/main.json \
--judge-scores scorecards/main-judges.jsonlService deployments can run the same readiness gate through POST /v1/release-check. The endpoint accepts only golden dataset and manifest paths under data/golden/, with baseline scorecards and judge-score JSONL under the configured scorecard directory; missing in-scope artifacts return a readiness report instead of a transport error.
Before calling the harness objective complete, run uv run arkcore-evals completion-audit. It maps the source-export and sampled-event label handoff contracts, raw release handoff files, canonical release files, metric implementation, FastAPI route and API-test evidence, reusable and caller CI workflows, caller exporter golden-path guards, prompt registry, admin promotion path, observability, docs, and coverage gate to concrete evidence and fails until scorecards/release-inputs.json, the real handoff files, and release artifacts are present. It also checks that the manifest curation hash and row count match the reviewed traffic handoff, that the golden rows were selected from that file, that the main scorecard candidate hash and accounting totals match the supplied baseline predictions, and that the canonical judge file matches the supplied judge handoff filtered to the golden dataset. The required scorecards/release-inputs.json, scorecards/source-sampling-plan.json, scorecards/source-exports.json, scorecards/sampled-event-label-packet.json, scorecards/sampled-event-labels.json, and scorecards/sampled-events-build.json reports must be valid, non-partial, empty of errors, match the audit dataset version, and meet release row thresholds. If scorecards/release-handoff-readiness.json exists, the audit includes any source-collection shortfalls from that report. The JSON output includes a checklist that maps each gate to expected artifacts, commands, evidence, and missing items.
- Run the two-judge rubric pass:
uv run arkcore-evals judge \
--dataset data/golden/v0.1.0/golden.jsonl \
--predictions scorecards/candidate-predictions.jsonl \
--out scorecards/judges.jsonlService deployments can generate the same cache-backed judge JSONL through POST /v1/judge-scores; the endpoint enforces the same data/golden/ dataset boundary as the CLI, writes artifacts only under the configured scorecard directory, and keeps judge cache writes inside the configured judge-cache directory.
For a service-owned subset, repeat --surface on the judge, run, and compare commands so the same rows are used through the whole gate:
uv run arkcore-evals judge \
--dataset data/golden/v0.1.0/golden.jsonl \
--predictions scorecards/candidate-predictions.jsonl \
--out scorecards/judges.jsonl \
--surface opportunity_briefings \
--surface funding_opportunity_ai_remarks- Run the candidate scorecard:
uv run arkcore-evals run \
--dataset data/golden/v0.1.0/golden.jsonl \
--predictions scorecards/candidate-predictions.jsonl \
--candidate-id pr-123 \
--judge-scores scorecards/judges.jsonl \
--out scorecards/pr-123.json \
--surface opportunity_briefings \
--surface funding_opportunity_ai_remarks- Compare with the main-branch scorecard:
uv run arkcore-evals compare \
--base scorecards/main.json \
--candidate scorecards/pr-123.json \
--markdown-out scorecards/eval-comment.md \
--surface opportunity_briefings \
--surface funding_opportunity_ai_remarks- Publish accepted scorecards for admin review:
uv run arkcore-evals publish-scorecard \
--scorecard scorecards/pr-123.json \
--database-url "$EVALS_DATABASE_URL" \
--status passed- Check the run envelope:
uv run arkcore-evals check-budget \
--scorecard scorecards/pr-123.json \
--max-cost-usd 15 \
--max-duration-seconds 1500- Review blocked regressions. The default gate fails when deterministic score drops by more than 0.05 or judge score drops by more than 0.5.
- Promote accepted prompts through
/admin/evals, which writes a versionedprompt_registryrow only when the run and selected surface passed gates. Promotion is blocked unless the selected prompt text is present in the scorecardprompt_candidateslist, so the admin can only promote text that was evaluated. The completion audit checks these gate tests directly. arkcore-llm reads promoted prompts at startup and falls back to prompt files if the database is unavailable.
CI Adoption
The reusable arkcore-evals GitHub workflow runs in the caller repository, installs the harness, enforces data/golden/ dataset scope, runs release-check, generates or reads candidate predictions, runs live judges or supplied judge scores, checks budget, compares scorecards, comments on the PR, and uploads artifacts. Callers must provide the golden manifest and main judge-score file, and their path filters must cover the runtime prompt, LLM client, vector, and text-processing files that feed eval predictions.
Caller workflows should pass surfaces as a space-separated list matching the repo ownership. For example, arkcore-llm owns prompt surfaces such as opportunity_briefings, while arkcore-vector and arkcore-text-processor own retrieval and matching surfaces such as arksearch_retrieval, expert_to_cluster_matching, and partner_search_recommendations.
Set publish_scorecard: "true" only for trusted scheduled runs with the EVALS_DATABASE_URL secret. PR runs should produce artifacts and comments, not publish to the admin review tables.
Daily Drift Worker
Run daily provider-drift checks through the dedicated eval queue:
celery -A arkcore_evals.worker worker -Q evals --loglevel=INFO
celery -A arkcore_evals.worker beat --loglevel=INFOThe worker reads golden datasets only by default. Prediction exporters must generate candidate JSONL from those inputs and must not query production data during the eval run.
Use these guards for scheduled runs:
ARKCORE_EVALS_DAILY_REQUIRE_GOLDEN_DATASET=truerejects daily datasets outsidedata/golden/.ARKCORE_EVALS_DAILY_MAX_COST_USDcaps candidate plus judge spend before publish.ARKCORE_EVALS_DAILY_MAX_DURATION_SECONDScaps full-run duration before publish.ARKCORE_EVALS_DAILY_BASELINE_SCORECARDpoints to the main scorecard used for provider-drift comparison.ARKCORE_EVALS_DAILY_COMPARISON_JSON_OUTandARKCORE_EVALS_DAILY_COMPARISON_MARKDOWN_OUTstore the daily drift comparison artifacts.ARKCORE_EVALS_DAILY_PUBLISH=truerequiresEVALS_DATABASE_URLand should be used only after the budget and comparison gates pass.
Daily runs compare production prompt outputs against the baseline scorecard before publish. Regression-blocked runs are published as failed, while non-regressing runs with judge disagreements are routed to review.
Dataset Rules
Golden datasets must be JSONL, version-pinned, PII-scrubbed, and human reviewed. Each row needs input, expected output or retrieval labels, source attribution, difficulty, review metadata, and rubric instructions for faithfulness, helpfulness, completeness, and safety. Retrieval rows need non-empty expected.relevant_ids; structured prompt rows need expected.output, non-empty expected.required_fields, and expected.json_schema. Curated production rows must use source_attribution.kind: traffic_sample. The sidecar manifest records dataset version, source sample window, sampling method, PII scrubbing method, reviewer list, approval metadata, dataset hash, total examples, per-surface counts, and per-difficulty counts.
Scorecard Contents
Scorecards include retrieval metrics, structured-output metrics, latency, run duration, total and per-call token/cost metrics, two-judge rubric scores, Cohen's kappa, and any judge disagreements that require human review. Release judge files must include both Claude Opus and Gemini Pro scores for every evaluated example.
Live judge runs require pricing environment variables for Anthropic and Gemini input/output dollars per 1M tokens. The budget gate fails if scorecards omit cost or duration totals, and uses total_cost_usd, which includes candidate and judge spend. run_duration_seconds covers candidate latency, judge latency, and local scoring time.
Admin Review
publish-scorecard stores run summaries, per-surface scores, prompt candidates, and judge disagreements for /admin/evals. Admins can mark disagreements as in_review, resolved, or dismissed without changing scorecard history. Prompt promotion remains separate from disagreement triage and only writes a new prompt_registry row after the selected run, surface, and evaluated prompt candidate pass validation.
Funding Schemes
Explore EU funding schemes with hierarchical category grouping, grid browsing, search, distribution charts, evolution timelines, and performance analytics.
ArkID
Persistent research identity platform providing ISO 27729-style identifiers for researchers, organizations, projects, ideas, products, and funding -- with verifiable credentials and relationship graphs.