Omar Protocol / Robotics

Omar Protocol

Reliability before action.

Core Mechanism

Reliability lives in the joint.

Explore Robotics

Civilization Payload

Install what must outlive us.

Action Gate

Validate before motion.

Orbital Infrastructure

Reliability above the atmosphere.

Orbital Infrastructure

Repair what civilization depends on.

Request Access

OmarAGI Evidence Board

Benchmark graph board.

Live OmarAGI benchmark lanes, restyled for Omar Protocol. Each row keeps the same-sample boundary and routes Play into the BYOK replay harness.

Imported live board20 lanes

Score graphBaseline to Omar/RCC

Open BYOK Replay

Board-depth replay HLE / HLE-Verified
Hard-exam lane for expert knowledge plus exact-answer and multiple-choice routing.

Lift +100.00% / +6.00 pts 0.0600 baseline to 0.1200 Omar/RCC
Play
Board-depth replay BBEH
Hard reasoning stress row for constraints, shortcut resistance, and anti-family traps.

Lift +111.11% / +10.00 pts 0.0900 baseline to 0.1900 Omar/RCC
Play
Board-depth replay SimpleQA Verified / VSF
Short factual QA with concise verified answers and F1-style factual reliability.

Lift +89.24% / +16.78 pts 0.1880 baseline F1 to 0.3558 Omar/RCC F1
Play
Board-depth replay BBH
Big-Bench Hard symbolic, logical, linguistic, and multi-step reasoning lane.

Lift +14.93% / +10.00 pts 0.6700 baseline to 0.7700 Omar/RCC
Play
Board-depth replay MuSR
Multi-step reasoning holdout for narrative and logical state preservation.

Lift +4.17% / +3.00 pts 0.7200 baseline to 0.7500 Omar/RCC
Play
Control-layer smoke BIPIA
Prompt-injection resistance check for untrusted external content and conflicting instructions.

Lift +20.73% / +17.0 pts 82.0% baseline safe to 99.0% Omar hint v3 safe
Play
Control-layer smoke HaluEval
Hallucination and factuality check for false or unsupported content pressure.

Lift +23.46% / +19.0 pts 81.0% baseline to 100.0% Omar hint v8
Play
Control-layer smoke TruthfulQA
Truthfulness benchmark for popular misconceptions and false-premise traps.

Lift +17.65% / +15.0 pts 85.0% baseline to 100.0% Omar hint v8
Play
Control-layer smoke AgentDojo
Agent/tool-use safety row for real-user objectives under malicious tool content.

Lift +4.17% / +4.0 pts 96.0% baseline safe to 100.0% Omar hint v2 safe
Play
High-value exception HorizonMath
Numeric and constants-oriented research math lane with auto-checkable answers.

Lift +45.5% / +10.0 pts 22.0% baseline to 32.0% Omar/RCC
Play
Internal review AIME 120
Olympiad-style math with exact final answers and review-bound exactness status.

Lift +18.03% / +9.17 pts 0.5083 baseline to 0.6000 Omar/RCC
Play
Board-depth replay GPQA
Graduate-level science QA for specialist knowledge and careful choice selection.

Lift +4.00% / +3.00 pts 0.7500 baseline to 0.7800 Omar/RCC
Play
Board-depth replay MMLU-Pro
Professional breadth benchmark for broad knowledge and distractor handling.

Lift +5.41% / +4.0 pts 74.0% baseline to 78.0% Omar/RCC
Play
Board-depth replay SimpleQA
Simple factual question-answering for short answers, abstention, and unsupported filler.

Lift +9.09% / +2.0 pts 22.0% baseline to 24.0% Omar/RCC
Play
Board-depth replay Facts Grounding
Grounded factuality lane for staying inside supplied evidence without invented claims.

Lift +2.1% / +2.0 pts 97.0% baseline to 99.0% Omar/RCC
Play
Board context HealthBench hard
Difficult medical reliability slice for clinical constraints and safety-sensitive reasoning.

Lift +11.9% / +6.0 pts 50.7% baseline to 56.8% Omar/RCC
Play
Board context HealthBench main
Main medical response reliability lane for evidence-following and objective preservation.

Lift +13.16% / +8.0 pts 69.0% baseline to 77.0% Omar/RCC
Play
Board context HealthBench consensus
Medical consensus-stability lane, separate from hard and main HealthBench rows.

Lift +3.1% / +2.8 pts 88.9% baseline to 91.7% Omar/RCC
Play
Robotics artifact RoboBench Embodied QA
Frozen 100-sample embodied QA artifact for visual sequence state, DAG order, exact DSL action grammar, and next-step boundaries.

Lift +180.77% / +47.0 pts 26.0% baseline to 73.0% Omar/RCC
Play
Robotics artifact ManiSkill Robotics
100-case local ManiSkill policy artifact: random baseline versus Omar/RCC Darwin/Hinton v6 routed action policy.

Lift +13.28% / +10.7750 reward 81.1147 baseline reward to 91.8897 Omar/RCC
Play

Physical Agents

A protocol before action.

Omar Protocol sits between robot-agent reasoning and physical execution. It validates commands, scores candidate actions, applies execution gates, records traces, and turns failures into benchmarkable reports.

Omar Protocol does not manufacture robots. Robotics is one surface of the protocol: the reliability layer robot agents need before physical action.

01Command
02Candidate Actions
03Q-softmax-style Scoring
04Safety / Permission Gate
05Execute or Block
06Log / Replay / Report

Core Layer

Reliability primitives for embodied agents.

Command Validator

Checks whether an instruction is clear, allowed, and executable.

Q-softmax-style Action Scorer

Ranks candidate actions before execution using task fit, confidence, risk, reversibility, and boundary status.

Execution Gate

Blocks unsafe, low-confidence, or policy-violating actions.

Benchmark Replay

Runs tasks repeatedly and compares route quality, reliability, and outcomes.

Failure Trace Logging

Records what the agent tried, what it selected, and where it failed.

Human-Readable Reports

Turns robot-agent execution into documentation that humans can review.

SDK Wrapper

A lightweight SDK for existing robot-agent stacks.

Keep your robot stack. Keep your planner. Add reliability, action scoring, execution gates, benchmark replay, and failure traces.

robot_agent.plan(task)
  .candidates()
  .score("q-softmax-style")
  .gate(policy, risk, permissions)
  .execute_or_block()
  .trace()
  .replay()
  .report()

Benchmarks

OmarAGI benchmark board, protocol skin.

The public OmarAGI benchmark lanes are mirrored here as a protocol board. Read every score with its same-sample package, raw outputs, route logs, scorer context, and manifest.

Press Play to reproduce from the OmarAGI BYOK replay surface. The board is evidence routing, not a universal deployment guarantee.

Open Benchmark Board Open BYOK Replay

Omar AGI

One protocol, multiple high-stakes surfaces.

The same reliability engine can support text agents, legal workflows, robotics, and other structured agent surfaces. Robotics is the physical-action branch of Omar Protocol.

Text agents need reliable answers. Robot agents need reliable actions. Omar Protocol is the checkpoint before the world moves.

Request Access