Omar Protocol / Robotics
Omar Protocol
Reliability before action.
Civilization Payload
Install what must outlive us.
Action Gate
Validate before motion.
Orbital Infrastructure
Omar Protocol / Robotics
Reliability before action.
Civilization Payload
Action Gate
Orbital Infrastructure
OmarAGI Evidence Board
Live OmarAGI benchmark lanes, restyled for Omar Protocol. Each row keeps the same-sample boundary and routes Play into the BYOK replay harness.
Hard-exam lane for expert knowledge plus exact-answer and multiple-choice routing.
Hard reasoning stress row for constraints, shortcut resistance, and anti-family traps.
Short factual QA with concise verified answers and F1-style factual reliability.
Big-Bench Hard symbolic, logical, linguistic, and multi-step reasoning lane.
Multi-step reasoning holdout for narrative and logical state preservation.
Prompt-injection resistance check for untrusted external content and conflicting instructions.
Hallucination and factuality check for false or unsupported content pressure.
Truthfulness benchmark for popular misconceptions and false-premise traps.
Agent/tool-use safety row for real-user objectives under malicious tool content.
Numeric and constants-oriented research math lane with auto-checkable answers.
Olympiad-style math with exact final answers and review-bound exactness status.
Graduate-level science QA for specialist knowledge and careful choice selection.
Professional breadth benchmark for broad knowledge and distractor handling.
Simple factual question-answering for short answers, abstention, and unsupported filler.
Grounded factuality lane for staying inside supplied evidence without invented claims.
Difficult medical reliability slice for clinical constraints and safety-sensitive reasoning.
Main medical response reliability lane for evidence-following and objective preservation.
Medical consensus-stability lane, separate from hard and main HealthBench rows.
Frozen embodied QA smoke for DSL grammar, next-step boundaries, and conservative progress checks.
Local robotics policy smoke over 100 asset-light cases, not an official ManiSkill leaderboard claim.