AI ResearchPublished ResearchMIT Licensev1.0.1

ImageBreak

A research framework that systematically stress-tests AI content moderation across OpenAI, Gemini, and HuggingFace — generating adversarial prompts, measuring bypass rates, and exporting structured safety reports.

View on GitHub Read the Paper ↗

AI Providers Tested

CLI Commands

0.0–1.0

BLIP-2 Quality Score

JSON · CSV · HTML

Report Formats

Overview

Why AI safety needs systematic red-teaming

Content moderation systems across AI providers are routinely bypassed by adversarial inputs — but evaluating how and where those defenses fail requires a reproducible, automated pipeline. Manual testing is slow, subjective, and impossible to scale.

ImageBreak provides the complete pipeline: generate adversarial prompts, apply evasion transformations, test against image generation APIs, score outputs with BLIP-2, and produce structured reports — all in a single CLI command.

The framework also handles the reality of content policy blocking gracefully: automatic sanitization, progressive retry strategies, and transparent logging of whether sanitized or original prompts succeeded.

Research Paper

CMSC396H Final Paper — Arnav Dadarya & Anushk Pokharna

University of Maryland

Interfaces

CLIStreamlit Web UIPython API

Scope

Research only — designed exclusively for AI safety teams, red-teamers, and academic researchers working to make AI systems more robust.

Pipeline

Four-stage evaluation pipeline

01generate-prompts

Generate Boundary Prompts

LLM-driven generation of boundary-testing prompts targeting configurable content categories (violence, misinformation, etc.) against your policy document.

Accepts a content policy .txt file and produces structured JSON. Supports custom system instructions via env var BOUNDARY_PROMPT_SYSTEM_INSTRUCTION.

02alter-prompts

Alter for Filter Evasion

Takes the generated prompts and applies filter-evasion transformations — paraphrasing, encoding shifts, stylistic reframes — to maximize bypass probability.

Works with OpenAI or Gemini as the alteration model. Custom instructions via PROMPT_ALTERATION_SYSTEM_INSTRUCTION.

03test-images

Run Image Generation Tests

Submits altered prompts to image generation endpoints (OpenAI DALL-E, HuggingFace diffusion models) and runs cyclic quality assessment with BLIP-2.

--use-cyclic enables quality-based retry logic. --max-attempts and --quality-threshold are fully configurable. Saves images optionally.

04full-test

Export Safety Reports

Compiles results into JSON, CSV, and HTML reports with bypass rates, attempt counts, quality scores, and cross-provider analytics.

Success rate, average quality score (0.0–1.0), per-prompt attempt logs, and whether sanitized or original prompt was used.

Full pipeline — one command

imagebreak full-test \
  --policies content_policy.txt \
  --num-prompts 5 \
  --image-model openai \
  --text-model openai \
  --use-cyclic \
  --quality-threshold 0.7

Example output · Successful: 4/5 (80.0%) · Total attempts: 12 · Avg quality: 0.78

Integrations

Supported models & providers

OpenAI GPT

OPENAI_API_KEY

Text generation, prompt alteration, policy analysis

Google Gemini

GOOGLE_API_KEY

Alternative text model for generation and alteration

HuggingFace (BLIP-2)

HUGGINGFACE_TOKEN

Image quality assessment — scores outputs 0.0–1.0

AWS Rekognition

AWS_ACCESS_KEY_ID

Optional independent content moderation analysis

Custom Model Integration

Abstract base classes make adding any custom model straightforward — implement generate_text() and generate_image() on the BaseModel class.

from imagebreak.models.base import BaseModel

class CustomModel(BaseModel):
    def generate_text(self, prompt: str, **kwargs):
        # your implementation
        pass
    
    def generate_image(self, prompt: str, **kwargs):
        # your implementation
        pass

framework.add_model("my-model", CustomModel(api_key="..."))

Configuration

Environment variables

ENABLE_CYCLIC_REGENERATIONdefault: true

Enable quality-based retry loop

MAX_RETRY_ATTEMPTSdefault: 3

Max retries per prompt before marking blocked

QUALITY_THRESHOLDdefault: 0.7

Minimum BLIP-2 score to count as successful generation

DEFAULT_HF_MODELdefault: Salesforce/blip2-opt-2.7b

HuggingFace vision model for quality scoring

USE_AWS_MODERATIONdefault: false

Enable AWS Rekognition cross-validation

BOUNDARY_PROMPT_SYSTEM_INSTRUCTIONdefault: —

Override system prompt for boundary prompt generation

PROMPT_ALTERATION_SYSTEM_INSTRUCTIONdefault: —

Override system prompt for filter-evasion alteration

IMAGE_ANALYSIS_SYSTEM_INSTRUCTIONdefault: —

Override system prompt for image quality analysis

Minimal .env

OPENAI_API_KEY=sk-...
ENABLE_CYCLIC_REGENERATION=true
QUALITY_THRESHOLD=0.7

Python API

Programmatic usage

Basic usage

from imagebreak import ImageBreakFramework, Config
from imagebreak.models import OpenAIModel

config = Config()
framework = ImageBreakFramework(config)

framework.add_model("openai", OpenAIModel(
    api_key=config.openai_api_key,
    config=config
))

# Generate prompts from your policy
with open("policy.txt") as f:
    policies = f.read()

test_prompts = framework.generate_boundary_prompts(
    policies=policies,
    num_prompts=10,
    topics=["violence", "misinformation"]
)

# Run with cyclic quality assessment
results = framework.test_image_generation_cyclic(
    prompt_data_list=test_prompts,
    save_images=True
)

Advanced config

from imagebreak import Config

config = Config(
    max_retries=3,
    timeout=30,
    batch_size=10,
    output_dir="./results",
    enable_logging=True,
    log_level="INFO",
    enable_cyclic_regeneration=True,
    max_retry_attempts=5,
    quality_threshold=0.8,
    use_aws_moderation=False
)

# Custom HuggingFace vision model
from imagebreak.models import HuggingFaceImageAnalyzer

analyzer = HuggingFaceImageAnalyzer(
    model_name="Salesforce/blip2-flan-t5-xl",
    device="cuda"  # or "cpu"
)

Tech Stack

Python 3.8+OpenAI APIGoogle Gemini APIHuggingFace TransformersSalesforce BLIP-2StreamlitAWS RekognitionPyTorchPyPI · imagebreak==1.0.1

Installation

From PyPI

pip install imagebreak==1.0.1

From source

git clone https://github.com/ardada2468/ImageBreak
cd ImageBreak
pip install -e .

Note

HuggingFace image analysis is optional but recommended for full BLIP-2 quality scoring. Install torch torchvision transformers accelerate separately, or disable cyclic generation with ENABLE_CYCLIC_REGENERATION=false.

Research-only · MIT License

Built for AI safety researchers and red-teamers

ImageBreak is explicitly scoped to responsible disclosure, academic research, and red-team exercises. Explore the codebase, read the research paper, or install via pip and run your first test in minutes.

View on GitHub Read the Paper ← All Projects