Omar Protocol / Robotics

Omar Protocol

Reliability before action.

Core Mechanism

Reliability lives in the joint.

Explore Robotics

Civilization Payload

Install what must outlive us.

Action Gate

Validate before motion.

Orbital Infrastructure

Reliability above the atmosphere.

Orbital Infrastructure

Repair what civilization depends on.

Request Access

OmarAGI Evidence Board

Benchmark graph board.

Live OmarAGI benchmark lanes, restyled for Omar Protocol. Each row keeps the same-sample boundary and routes Play into the BYOK replay harness.

Imported live board20 lanes
Score graphBaseline to Omar/RCC
Open BYOK Replay
  1. Board-depth replay HLE / HLE-Verified

    Hard-exam lane for expert knowledge plus exact-answer and multiple-choice routing.

    Lift +100.00% / +6.00 pts 0.0600 baseline to 0.1200 Omar/RCC
    Play
  2. Board-depth replay BBEH

    Hard reasoning stress row for constraints, shortcut resistance, and anti-family traps.

    Lift +111.11% / +10.00 pts 0.0900 baseline to 0.1900 Omar/RCC
    Play
  3. Board-depth replay SimpleQA Verified / VSF

    Short factual QA with concise verified answers and F1-style factual reliability.

    Lift +89.24% / +16.78 pts 0.1880 baseline F1 to 0.3558 Omar/RCC F1
    Play
  4. Board-depth replay BBH

    Big-Bench Hard symbolic, logical, linguistic, and multi-step reasoning lane.

    Lift +14.93% / +10.00 pts 0.6700 baseline to 0.7700 Omar/RCC
    Play
  5. Board-depth replay MuSR

    Multi-step reasoning holdout for narrative and logical state preservation.

    Lift +4.17% / +3.00 pts 0.7200 baseline to 0.7500 Omar/RCC
    Play
  6. Control-layer smoke BIPIA

    Prompt-injection resistance check for untrusted external content and conflicting instructions.

    Lift +20.73% / +17.0 pts 82.0% baseline safe to 99.0% Omar hint v3 safe
    Play
  7. Control-layer smoke HaluEval

    Hallucination and factuality check for false or unsupported content pressure.

    Lift +23.46% / +19.0 pts 81.0% baseline to 100.0% Omar hint v8
    Play
  8. Control-layer smoke TruthfulQA

    Truthfulness benchmark for popular misconceptions and false-premise traps.

    Lift +17.65% / +15.0 pts 85.0% baseline to 100.0% Omar hint v8
    Play
  9. Control-layer smoke AgentDojo

    Agent/tool-use safety row for real-user objectives under malicious tool content.

    Lift +4.17% / +4.0 pts 96.0% baseline safe to 100.0% Omar hint v2 safe
    Play
  10. High-value exception HorizonMath

    Numeric and constants-oriented research math lane with auto-checkable answers.

    Lift +45.5% / +10.0 pts 22.0% baseline to 32.0% Omar/RCC
    Play
  11. Internal review AIME 120

    Olympiad-style math with exact final answers and review-bound exactness status.

    Lift +18.03% / +9.17 pts 0.5083 baseline to 0.6000 Omar/RCC
    Play
  12. Board-depth replay GPQA

    Graduate-level science QA for specialist knowledge and careful choice selection.

    Lift +4.00% / +3.00 pts 0.7500 baseline to 0.7800 Omar/RCC
    Play
  13. Board-depth replay MMLU-Pro

    Professional breadth benchmark for broad knowledge and distractor handling.

    Lift +5.41% / +4.0 pts 74.0% baseline to 78.0% Omar/RCC
    Play
  14. Board-depth replay SimpleQA

    Simple factual question-answering for short answers, abstention, and unsupported filler.

    Lift +9.09% / +2.0 pts 22.0% baseline to 24.0% Omar/RCC
    Play
  15. Board-depth replay Facts Grounding

    Grounded factuality lane for staying inside supplied evidence without invented claims.

    Lift +2.1% / +2.0 pts 97.0% baseline to 99.0% Omar/RCC
    Play
  16. Board context HealthBench hard

    Difficult medical reliability slice for clinical constraints and safety-sensitive reasoning.

    Lift +11.9% / +6.0 pts 50.7% baseline to 56.8% Omar/RCC
    Play
  17. Board context HealthBench main

    Main medical response reliability lane for evidence-following and objective preservation.

    Lift +13.16% / +8.0 pts 69.0% baseline to 77.0% Omar/RCC
    Play
  18. Board context HealthBench consensus

    Medical consensus-stability lane, separate from hard and main HealthBench rows.

    Lift +3.1% / +2.8 pts 88.9% baseline to 91.7% Omar/RCC
    Play
  19. Robotics smoke RoboBench Embodied QA

    Frozen embodied QA smoke for DSL grammar, next-step boundaries, and conservative progress checks.

    Lift +93.33% / +14.0 pts 15.0% baseline to 29.0% Omar/RCC
    Play
  20. Robotics smoke ManiSkill Robotics

    Local robotics policy smoke over 100 asset-light cases, not an official ManiSkill leaderboard claim.

    Lift +13.38% / +11.4313 reward 85.4592 baseline reward to 96.8905 Omar/RCC
    Play