ImageBreak
A research framework that systematically stress-tests AI content moderation across OpenAI, Gemini, and HuggingFace — generating adversarial prompts, measuring bypass rates, and exporting structured safety reports.
Why AI safety needs systematic red-teaming
Content moderation systems across AI providers are routinely bypassed by adversarial inputs — but evaluating how and where those defenses fail requires a reproducible, automated pipeline. Manual testing is slow, subjective, and impossible to scale.
ImageBreak provides the complete pipeline: generate adversarial prompts, apply evasion transformations, test against image generation APIs, score outputs with BLIP-2, and produce structured reports — all in a single CLI command.
The framework also handles the reality of content policy blocking gracefully: automatic sanitization, progressive retry strategies, and transparent logging of whether sanitized or original prompts succeeded.
Research Paper
CMSC396H Final Paper — Arnav Dadarya & Anushk Pokharna
University of Maryland
Interfaces
Scope
Research only — designed exclusively for AI safety teams, red-teamers, and academic researchers working to make AI systems more robust.
Four-stage evaluation pipeline
generate-promptsGenerate Boundary Prompts
LLM-driven generation of boundary-testing prompts targeting configurable content categories (violence, misinformation, etc.) against your policy document.
Accepts a content policy .txt file and produces structured JSON. Supports custom system instructions via env var BOUNDARY_PROMPT_SYSTEM_INSTRUCTION.
alter-promptsAlter for Filter Evasion
Takes the generated prompts and applies filter-evasion transformations — paraphrasing, encoding shifts, stylistic reframes — to maximize bypass probability.
Works with OpenAI or Gemini as the alteration model. Custom instructions via PROMPT_ALTERATION_SYSTEM_INSTRUCTION.
test-imagesRun Image Generation Tests
Submits altered prompts to image generation endpoints (OpenAI DALL-E, HuggingFace diffusion models) and runs cyclic quality assessment with BLIP-2.
--use-cyclic enables quality-based retry logic. --max-attempts and --quality-threshold are fully configurable. Saves images optionally.
full-testExport Safety Reports
Compiles results into JSON, CSV, and HTML reports with bypass rates, attempt counts, quality scores, and cross-provider analytics.
Success rate, average quality score (0.0–1.0), per-prompt attempt logs, and whether sanitized or original prompt was used.
Full pipeline — one command
imagebreak full-test \ --policies content_policy.txt \ --num-prompts 5 \ --image-model openai \ --text-model openai \ --use-cyclic \ --quality-threshold 0.7
Example output · Successful: 4/5 (80.0%) · Total attempts: 12 · Avg quality: 0.78
Supported models & providers
OpenAI GPT
OPENAI_API_KEYText generation, prompt alteration, policy analysis
Google Gemini
GOOGLE_API_KEYAlternative text model for generation and alteration
HuggingFace (BLIP-2)
HUGGINGFACE_TOKENImage quality assessment — scores outputs 0.0–1.0
AWS Rekognition
AWS_ACCESS_KEY_IDOptional independent content moderation analysis
Custom Model Integration
Abstract base classes make adding any custom model straightforward — implement generate_text() and generate_image() on the BaseModel class.
from imagebreak.models.base import BaseModel
class CustomModel(BaseModel):
def generate_text(self, prompt: str, **kwargs):
# your implementation
pass
def generate_image(self, prompt: str, **kwargs):
# your implementation
pass
framework.add_model("my-model", CustomModel(api_key="..."))Environment variables
ENABLE_CYCLIC_REGENERATIONEnable quality-based retry loop
MAX_RETRY_ATTEMPTSMax retries per prompt before marking blocked
QUALITY_THRESHOLDMinimum BLIP-2 score to count as successful generation
DEFAULT_HF_MODELHuggingFace vision model for quality scoring
USE_AWS_MODERATIONEnable AWS Rekognition cross-validation
BOUNDARY_PROMPT_SYSTEM_INSTRUCTIONOverride system prompt for boundary prompt generation
PROMPT_ALTERATION_SYSTEM_INSTRUCTIONOverride system prompt for filter-evasion alteration
IMAGE_ANALYSIS_SYSTEM_INSTRUCTIONOverride system prompt for image quality analysis
Minimal .env
OPENAI_API_KEY=sk-... ENABLE_CYCLIC_REGENERATION=true QUALITY_THRESHOLD=0.7
Programmatic usage
Basic usage
from imagebreak import ImageBreakFramework, Config
from imagebreak.models import OpenAIModel
config = Config()
framework = ImageBreakFramework(config)
framework.add_model("openai", OpenAIModel(
api_key=config.openai_api_key,
config=config
))
# Generate prompts from your policy
with open("policy.txt") as f:
policies = f.read()
test_prompts = framework.generate_boundary_prompts(
policies=policies,
num_prompts=10,
topics=["violence", "misinformation"]
)
# Run with cyclic quality assessment
results = framework.test_image_generation_cyclic(
prompt_data_list=test_prompts,
save_images=True
)Advanced config
from imagebreak import Config
config = Config(
max_retries=3,
timeout=30,
batch_size=10,
output_dir="./results",
enable_logging=True,
log_level="INFO",
enable_cyclic_regeneration=True,
max_retry_attempts=5,
quality_threshold=0.8,
use_aws_moderation=False
)
# Custom HuggingFace vision model
from imagebreak.models import HuggingFaceImageAnalyzer
analyzer = HuggingFaceImageAnalyzer(
model_name="Salesforce/blip2-flan-t5-xl",
device="cuda" # or "cpu"
)From PyPI
pip install imagebreak==1.0.1
From source
git clone https://github.com/ardada2468/ImageBreak cd ImageBreak pip install -e .
Note
HuggingFace image analysis is optional but recommended for full BLIP-2 quality scoring. Install torch torchvision transformers accelerate separately, or disable cyclic generation with ENABLE_CYCLIC_REGENERATION=false.
Research-only · MIT License
Built for AI safety researchers and red-teamers
ImageBreak is explicitly scoped to responsible disclosure, academic research, and red-team exercises. Explore the codebase, read the research paper, or install via pip and run your first test in minutes.