GateBench

GateBench is a challenging benchmark for Vision Language Models (VLMs) designed to test visual reasoning and image understanding capabilities. The benchmark requires VLMs to extract boolean algebra expressions from images of logic gate circuits, using this task as a proxy for assessing detailed image understanding and complex visual reasoning.

Leaderboard

Model	Score
gemini-3-pro-preview (high)	53.1%
gpt-5.1 (high)	40.6%
qwen3-vl-235b-a22b-thinking	39.0%
qwen3-vl-235b-a22b-instruct	32.8%
gemini-2.5-flash	32.8%
qwen3-vl-30b-a3b-thinking	20.3%
qwen3-vl-30b-a3b-instruct	15.6%
claude-opus-4.5 (non-thinking)	15.6%
glm-4.5v	15.6%
llama-4-maverick	15.6%
claude-sonnet-4.5 (thinking)	14.1%
claude-opus-4.5 (thinking)	14.0%
nova-2-lite-v1 (thinking)	9.4%
llama-4-scout	7.8%
nova-2-lite-v1 (non-thinking)	6.3%
claude-haiku-4.5	6.2%
grok-4.1-fast	4.6%
claude-sonnet-4.5 (non-thinking)	3.1%
mistral-large-2512	0.0%

Overview

Unlike benchmarks that focus on single-object recognition (e.g. Name the breed of the dog in the image), GateBench is difficult because it requires the model to reason over the entire image. The model must correctly identify multiple logic gates, trace the connections between them (wires), and understand the logical flow to derive the correct boolean expression.

Key capabilities tested:

Visual Reasoning: Tracing complex connections and data flow in a diagram.
Image Understanding: Identifying specific symbols (logic gates) and their spatial relationships.
Symbolic Translation: Converting visual diagrams into structured text (boolean expressions).

Example Question

User:

Extract the boolean algebra expression from the image.

Respond with the single line boolean algebra expression in a code block. Use operators in word form, not symbols. (e.g. "and" instead of "∧")

Example:

not ((A and B) xor C)

Features

Challenging Test Suite: Includes diagrams of varying complexity (different gate counts).
Multi-model Support: Compatible with OpenAI-compatible APIs (OpenAI, OpenRouter, etc.).
Automated Evaluation: Automatically verifies the equivalence of the extracted expression against the ground truth using boolean algebra rules.
Caching: Caches model responses to avoid redundant API calls and costs.

Installation

Prerequisites

Python 3
pip

1. Clone the Repository

git clone https://github.com/johnbean393/GateBench.git
cd GateBench

2. Install Dependencies

pip install -r requirements.txt

Usage

The main entry point for the benchmark is src/run.py.

Basic Usage

Run the benchmark on a specific model using an OpenAI-compatible endpoint (defaults to OpenRouter):

python src/run.py --model "openai/gpt-4o" --api-key "your-api-key"

Command Line Options

--model: The model identifier (e.g., openai/gpt-5.1, anthropic/claude-sonnet-4.5). You can test multiple models by separating them with a semicolon (e.g., "model1;model2").
--api-key: Your API key for the endpoint.
--endpoint: The API endpoint URL (default: https://openrouter.ai/api/v1).
--open-router-api-key: Specific API key for OpenRouter if different.
--reasoning-effort: Set reasoning effort (high, medium, low) for models that support it.
--reasoning-max-tokens: Max tokens for reasoning (for models like Claude that support thinking/reasoning parameters).

Examples

Run with OpenRouter:

python src/run.py --model "anthropic/claude-sonnet-4.5" --api-key "sk-or-..."

Run multiple models:

python src/run.py --model "google/gemini-3-pro-preview;openai/gpt-5.1" --api-key "sk-..."

Project Structure

GateBench/
├── cache/              # Cached model responses
├── questions/          # Dataset files
│   ├── expressions.json # Ground truth expressions
│   └── images/         # Generated logic gate diagrams
├── results/            # Benchmark results
├── src/                # Source code
│   ├── run.py                  # Main execution script
│   ├── llm.py                  # LLM interface handler
│   ├── expression_evaluator.py # Evaluation logic
│   └── diagram_renderer.py     # Logic gate diagram generator
└── requirements.txt    # Python dependencies

Evaluation

The benchmark evaluates the model's accuracy by comparing the extracted boolean expression with the ground truth. Since equivalent boolean expressions can be written in multiple ways (e.g., De Morgan's laws), the evaluator checks for logical equivalence rather than simple string matching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GateBench

Leaderboard

Overview

Example Question

Features

Installation

Prerequisites

1. Clone the Repository

2. Install Dependencies

Usage

Basic Usage

Command Line Options

Examples

Project Structure

Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
questions		questions
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

johnbean393/GateBench

Folders and files

Latest commit

History

Repository files navigation

GateBench

Leaderboard

Overview

Example Question

Features

Installation

Prerequisites

1. Clone the Repository

2. Install Dependencies

Usage

Basic Usage

Command Line Options

Examples

Project Structure

Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages