BENCHMARK COMMANDS
COMPLETE DOCUMENTATION
TECHNICAL BULLETIN NO. 005
BENCHMARK COMMANDS - PERFORMANCE ANALYSIS FIELD OPERATIONS
task-o-matic-cli-benchmark-commands-v2All PersonnelMANDATORY COMPLIANCE: Yes
⚠️ CRITICAL SURVIVAL NOTICE
Citizen, benchmark commands are your intelligence gathering tools in the post-deadline wasteland. Without proper performance analysis, you're flying blind into AI provider selection storms. These commands provide the data needed to make informed decisions about which AI models and tools will keep your projects alive.
The benchmark system has been upgraded with Git Branch Isolation capabilities. This means you can now test different AI models in isolation—each model gets its own temporary branch. No cross-contamination. No overwritten work. Pure, comparative analysis in the safety of your own bunker.
COMMAND ARCHITECTURE OVERVIEW
The benchmark command group represents the performance analysis and optimization layer of Task-O-Matic. It provides comprehensive testing capabilities across multiple AI providers, models, and workflows. The architecture supports concurrent testing, progressive model escalation, and detailed performance metrics collection.
Performance Analysis Components:
- Multi-Provider Testing: Compare performance across OpenRouter, Anthropic, OpenAI, and custom endpoints
- Model Benchmarking: Detailed performance metrics for individual AI models
- Workflow Analysis: End-to-end workflow performance across complete project lifecycles
- Execution Benchmarking: Git Branch Isolation for safe, concurrent model testing
- Loop Benchmarking: Batch task execution with progressive model escalation
- Comprehensive Metrics: Latency, throughput, cost, and quality measurements
COMPLETE BENCHMARK COMMAND DOCUMENTATION
RUN COMMAND
Command: task-o-matic benchmark run
COMMAND SIGNATURE
task-o-matic benchmark run <operation> --models <list> [options]REQUIRED ARGUMENTS
<operation> # Operation to benchmark (prd-parse, task-breakdown, task-create, prd-create, etc.)
<models> # Comma-separated list of models (required)REQUIRED OPTIONS
--models <list> # Comma-separated list of models (provider:model[:reasoning=<tokens>])GENERAL OPTIONS
--file <path> # Input file path (for PRD ops)
--task-id <id> # Task ID (for Task ops)
--concurrency <number> # Max concurrent requests (default: 5)
--delay <number> # Delay between requests in ms (default: 250)
--prompt <prompt> # Override prompt
--message <message> # User message
--tools # Enable filesystem tools
--feedback <feedback> # Feedback (for prd-rework)TASK CREATION OPTIONS
--title <title> # Task title (for task-create)
--content <content> # Task content (for task-create)
--parent-id <id> # Parent task ID (for task-create)
--effort <effort> # Effort estimate: small, medium, large (for task-create)
--force # Force operation (for task-document)PRD CREATION OPTIONS
--description <desc> # Project/PRD description (for prd-create, prd-combine)
--output-dir <dir> # Output directory (for prd-create, prd-combine)
--filename <name> # Output filename (for prd-create, prd-combine)PRD PROCESSING OPTIONS
--prds <list> # Comma-separated list of PRD file paths (for prd-combine)PRD REFINEMENT OPTIONS
--question-mode <mode> # Question mode: user or ai (for prd-refine)
--answers <json> # JSON string of answers (for prd-refine user mode)BENCHMARK RUN EXAMPLES
#### Basic Benchmarking
# Simple PRD parsing benchmark
task-o-matic benchmark run prd-parse \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4"
# Task creation benchmark
task-o-matic benchmark run task-create \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--title "Build radiation detector" \
--content "Implement Geiger counter integration" \
--effort medium
# PRD creation benchmark
task-o-matic benchmark run prd-create \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,google:gemini-pro" \
--description "Emergency shelter management system"#### Advanced Benchmarking with Reasoning
# Benchmark with reasoning tokens
task-o-matic benchmark run task-breakdown \
--models "openrouter:anthropic/claude-3.5-sonnet:reasoning=2048,openrouter:openai/o1-preview:reasoning=8192" \
--task-id task-complex-123 \
--tools
# Multi-model PRD generation with reasoning
task-o-matic benchmark run prd-create \
--models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048" \
--description "AI-powered emergency response system"
# Task enhancement with reasoning
task-o-matic benchmark run task-enhance \
--models "openrouter:anthropic/claude-3.5-sonnet:reasoning=1024" \
--task-id task-123 \
--toolsMODEL FORMAT SPECIFICATIONS
#### Basic Model Format
provider:model#### Model Format Examples
# Anthropic models
anthropic:claude-3.5-sonnet
anthropic:claude-opus-2024
# OpenAI models
openai:gpt-4
openai:gpt-4-turbo
openai:o1-preview
# OpenRouter models
openrouter:anthropic/claude-3.5-sonnet
openrouter:openai/gpt-4o
openrouter:google/gemini-2.0-flash-exp
# Custom endpoints
custom:http://localhost:8080/v1/custom-model#### Advanced Model Format with Reasoning
provider:model:reasoning=<tokens>#### Reasoning Examples
# OpenRouter with reasoning
openrouter:anthropic/claude-3.5-sonnet:reasoning=2048
openrouter:openai/o1-preview:reasoning=8192
openrouter:google/gemini-2.0-flash-exp:reasoning=4096
# OpenAI with reasoning (if supported)
openai:o1-preview:reasoning=4096BENCHMARK RUN OUTPUT FORMAT
#### Real-time Progress Display
Benchmark Progress:
- anthropic:claude-3.5-sonnet: Starting...
- openai:gpt-4: Running... Size: 2048 B, Speed: 15.2 B/s
- anthropic:claude-3.5-sonnet: Completed (2340ms)
- openai:gpt-4: Completed (3120ms)#### Final Results Table
Model | Duration | TTFT | Tokens | TPS | BPS | Size | Cost
-------------------------------------------|----------|----------|----------|---------|---------|----------|----------
anthropic:claude-3.5-sonnet | 2340ms | 120ms | 1250 | 0.53 | 89 | 208 | $0.004250
openai:gpt-4 | 3120ms | 180ms | 980 | 0.31 | 67 | 195 | $0.039200
openrouter:anthropic/claude-3.5-sonnet | 2450ms | 135ms | 1180 | 0.48 | 85 | 202 | $0.004720PERFORMANCE METRICS
#### Measured Metrics
- Duration: Total execution time in milliseconds
- TTFT (Time to First Token): Response latency measurement
- Tokens: Total token usage (prompt + completion)
- TPS (Tokens Per Second): Processing speed measurement
- BPS (Bytes Per Second): Data transfer rate
- Response Size: Output size in bytes
- Cost: Estimated API cost in USD
#### Metric Calculations
# TPS Calculation
TPS = Total Tokens / (Duration / 1000)
# BPS Calculation
BPS = Response Size / (Duration / 1000)
# Cost Estimation
Cost = (Prompt Tokens * Prompt Rate) + (Completion Tokens * Completion Rate)ERROR CONDITIONS
# Invalid model format
Error: Invalid model format: invalid-model. Expected provider:model[:reasoning=<tokens>]
Solution: Use format "provider:model" or "provider:model:reasoning=tokens"
# Operation not found
Error: Benchmark operation 'invalid-op' not found
Solution: Use 'benchmark operations' to see available operations
# Concurrency limit exceeded
Error: Too many concurrent requests
Solution: Reduce --concurrency value or increase --delay
# Rate limiting
Error: Rate limit exceeded for provider
Solution: Increase --delay between requestsLIST COMMAND
Command: task-o-matic benchmark list
COMMAND SIGNATURE
task-o-matic benchmark listLIST COMMAND EXAMPLES
#### Basic Listing
# List all benchmark runs
task-o-matic benchmark listLIST OUTPUT FORMAT
#### Standard Output
Benchmark Runs:
- 2024-01-15_14-30-22_prd-parse (2024-01-15 2:30:22 PM) - prd-parse
- 2024-01-15_13-45-10_task-create (2024-01-15 1:45:10 PM) - task-create
- 2024-01-14_16-20-45_workflow-full (2024-01-14 4:20:45 PM) - workflow-fullBENCHMARK HISTORY ANALYSIS
#### Performance Trends
- Model Comparison: Compare performance across different runs
- Temporal Analysis: Identify performance changes over time
- Cost Tracking: Monitor API usage costs
- Success Rates: Track reliability and consistency
#### Run Filtering
# Filter by operation type
task-o-matic benchmark list | grep prd-parse
# Filter by date range
task-o-matic benchmark list | grep "2024-01-1"
# Filter by model
task-o-matic benchmark list | grep "claude-3.5-sonnet"OPERATIONS COMMAND
Command: task-o-matic benchmark operations
COMMAND SIGNATURE
task-o-matic benchmark operationsOPERATIONS COMMAND EXAMPLES
#### Basic Operations Listing
# List all available operations
task-o-matic benchmark operationsOPERATIONS OUTPUT FORMAT
#### Categorized Operations List
📊 Available Benchmark Operations
Task Operations:
task-create - Create a new task with AI enhancement
task-enhance - Enhance an existing task with AI
task-breakdown - Break task into subtasks using AI
task-document - Fetch and analyze documentation for task
PRD Operations:
prd-create - Generate PRD from product description
prd-parse - Parse PRD into structured tasks
prd-rework - Rework PRD based on feedback
prd-combine - Combine multiple PRDs into master PRD
prd-question - Generate clarifying questions for PRD
prd-refine - Refine PRD by answering questions
Workflow Operations:
workflow-full - Complete workflow execution
Total operations: 11#### Example usage:
task-o-matic benchmark run task-create --models anthropic:claude-3.5-sonnet --title "Example task"
task-o-matic benchmark run prd-parse --models openai:gpt-4 --file ./prd.mdOPERATION CATEGORIES
#### Task Operations
- Creation: AI-powered task generation and enhancement
- Enhancement: Task description improvement with Context7
- Breakdown: Intelligent task decomposition into subtasks
- Documentation: Automatic documentation research and integration
#### PRD Operations
- Generation: AI-powered PRD creation from descriptions
- Parsing: Structured task extraction from PRD documents
- Refinement: PRD improvement through feedback and Q&A
- Combination: Multiple PRD synthesis into master documents
#### Workflow Operations
- End-to-End: Complete project lifecycle benchmarking
- Integration: Cross-component performance measurement
SHOW COMMAND
Command: task-o-matic benchmark show
COMMAND SIGNATURE
task-o-matic benchmark show <id>REQUIRED ARGUMENTS
<id> # Run ID to display details for (required)SHOW COMMAND EXAMPLES
#### Basic Run Display
# Show benchmark run details
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse
# Show workflow benchmark details
task-o-matic benchmark show 2024-01-14_16-20-45_workflow-fullSHOW OUTPUT FORMAT
#### Standard Run Display
Run: 2024-01-15_14-30-22_prd-parse
Date: 2024-01-15 2:30:22 PM
Command: prd-parse
Input: {"file":"./requirements.md","prompt":"Default PRD parsing prompt","message":"Parse PRD into structured tasks"}
Configuration:
Concurrency: 5
Delay: 250ms
Results:
[anthropic:claude-3.5-sonnet]
Duration: 2340ms
TTFT: 120ms
Tokens: 1250 (Prompt: 800, Completion: 450)
Throughput: 89 B/s
Size: 208 bytes
Estimated Cost: $0.004250
Output: {"tasks":[...]}
[openai:gpt-4]
Duration: 3120ms
TTFT: 180ms
Tokens: 980 (Prompt: 750, Completion: 230)
Throughput: 67 B/s
Size: 195 bytes
Estimated Cost: $0.039200
Output: {"tasks":[...]}COMPARE COMMAND
Command: task-o-matic benchmark compare
COMMAND SIGNATURE
task-o-matic benchmark compare <id>REQUIRED ARGUMENTS
<id> # Run ID to compare results for (required)COMPARE COMMAND EXAMPLES
#### Basic Comparison
# Compare benchmark run results
task-o-matic benchmark compare 2024-01-15_14-30-22_prd-parse
# Compare workflow benchmark
task-o-matic benchmark compare 2024-01-14_16-20-45_workflow-fullCOMPARE OUTPUT FORMAT
#### Comparison Table
┌─────────────────────────────────────────────┬──────────┬──────────┬────────┬─────────┬────────┐
│ Model │ Status │ Duration │ Tokens │ BPS │ Size │
├─────────────────────────────────────────────┼──────────┼──────────┼────────┼─────────┼────────┤
│ anthropic:claude-3.5-sonnet │ SUCCESS │ 2340ms │ 1250 │ 89 │ 208 │
│ openai:gpt-4 │ SUCCESS │ 3120ms │ 980 │ 67 │ 195 │
└─────────────────────────────────────────────┴──────────┴──────────┴────────┴─────────┴────────┘EXECUTION COMMAND ⚡ NEW
Command: task-o-matic benchmark execution
COMMAND SIGNATURE
task-o-matic benchmark execution --task-id <id> --models <list> [options]⚠️ CRITICAL SURVIVAL FEATURE
Citizen, this is where the Git Branch Isolation system shines. Each AI model gets its own isolated git branch. No cross-contamination. No overwriting your work. Each model operates in a quarantine environment, executing the task independently. When complete, you have multiple branches—one per model—showing how each would have solved the problem. Pick the best. Discard the rest. Survive with optimal code.
REQUIRED OPTIONS
--task-id <id> # Task ID to benchmark (required)
--models <list> # Comma-separated list of models (provider:model) (required)EXECUTION OPTIONS
--verify <command> # Verification command (can be used multiple times)
--max-retries <number> # Maximum retries per model (default: 3)
--no-keep-branches # Delete benchmark branches after run (default: keep)EXECUTION COMMAND EXAMPLES
#### Basic Execution Benchmark
# Execute task across multiple models with Git Branch Isolation
task-o-matic benchmark execution \
--task-id task-123 \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp"
# Execute with verification commands
task-o-matic benchmark execution \
--task-id task-123 \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--verify "bun test" \
--verify "bun run build"
# Execute with custom retry settings
task-o-matic benchmark execution \
--task-id task-complex-456 \
--models "anthropic:claude-3.5-sonnet:reasoning=4096" \
--max-retries 5 \
--no-keep-branches#### Verification Scenarios
# Test execution with multiple verification checks
task-o-matic benchmark execution \
--task-id task-auth-789 \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--verify "bun test --coverage" \
--verify "bun run type-check" \
--verify "bun run lint"
# Quick execution benchmark (cleanup after)
task-o-matic benchmark execution \
--task-id task-simple-001 \
--models "openrouter:google/gemini-2.0-flash-exp" \
--verify "bun test" \
--no-keep-branchesEXECUTION OUTPUT FORMAT
#### Real-time Progress Display
🚀 Starting Execution Benchmark for Task task-123
Models: 3
- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...
- openrouter:google/gemini-2.0-flash-exp: Waiting...
anthropic:claude-3.5-sonnet: Running...
anthropic:claude-3.5-sonnet: PASS (5432ms)
openai:gpt-4: Running...
openai:gpt-4: PASS (6789ms)
openrouter:google/gemini-2.0-flash-exp: Running...
openrouter:google/gemini-2.0-flash-exp: PASS (4521ms)#### Final Results Summary
✓ Execution Benchmark Completed!
Summary:
Model | Status | Branch | Duration
-----------------------------------|--------------|-------------------------------------------------|----------
anthropic:claude-3.5-sonnet | PASS | benchmark-execution-task123-claude-3.5-sonnet | 5432ms
openai:gpt-4 | PASS | benchmark-execution-task123-gpt-4 | 6789ms
openrouter:google/gemini-2.0-flash-exp | PASS | benchmark-execution-task123-gemini-2.0-flash | 4521ms
Run ID: 2024-01-15_14-30-22_execution-task123
To switch to a branch: git checkout <branch_name>GIT BRANCH ISOLATION EXPLAINED
#### How It Works
- Branch Creation: Each model gets its own git branch named
benchmark-execution-<taskId>-<modelName> - Isolation: All AI execution happens on isolated branches—no impact on your working branch
- Verification: Commands run on each branch independently, validating each solution
- Result Comparison: Switch between branches to compare implementations
- Cleanup: Use
--no-keep-branchesto auto-delete after comparison
#### Branch Switching and Comparison
# After benchmark completes, switch to any branch to review
git checkout benchmark-execution-task123-claude-3.5-sonnet
# View the changes
git diff main
# Test the solution
bun test
# Switch to another branch to compare
git checkout benchmark-execution-task123-gpt-4
# Merge the best solution to main
git checkout main
git merge benchmark-execution-task123-claude-3.5-sonnet
# Clean up all benchmark branches (if not auto-deleted)
git branch | grep benchmark-execution | xargs git branch -DEXECUTION PERFORMANCE METRICS
#### Execution Metrics
- Branch Creation: Time to create and switch to isolated environment
- AI Execution: Actual AI task completion time
- Verification Time: Time to run verification commands
- Total Duration: End-to-end execution time including all steps
- Pass/Fail Status: Whether verification passed or failed
- Branch Name: The isolated branch identifier for the model
#### Quality Assessment
- Verification Pass Rate: Number of verification commands that passed
- Code Quality: Subjective assessment from manual branch review
- Execution Success: Whether the task completed without errors
- Solution Completeness: How thoroughly the task was addressed
ERROR CONDITIONS
# Task ID not found
Error: Task task-123 not found
Solution: Verify task ID using 'task-o-matic tasks list'
# Branch creation failure
Error: Failed to create branch for model anthropic:claude-3.5-sonnet
Solution: Check git repository state, ensure no uncommitted changes
# Verification command failure
Error: Verification command failed for openai:gpt-4
Solution: Review branch output, adjust verification commands
# Concurrency limitation
Error: Execution benchmarks must be serial due to git constraints
Solution: Models execute sequentially (automatically enforced)EXECUTE-LOOP COMMAND ⚡ NEW
Command: task-o-matic benchmark execute-loop
COMMAND SIGNATURE
task-o-matic benchmark execute-loop --models <list> [options]⚠️ CRITICAL SURVIVAL FEATURE
The execute-loop benchmark lets you batch execute multiple tasks across multiple AI models with progressive model escalation. When a task fails on one model, it automatically retries with progressively stronger models. This is your automated escalation protocol—start cheap, escalate only when necessary. Maximum efficiency, minimum waste of resources.
REQUIRED OPTIONS
--models <list> # Comma-separated list of models (provider:model) (required)TASK FILTERING OPTIONS
--status <status> # Filter tasks by status (todo/in-progress/completed)
--tag <tag> # Filter tasks by tag
--ids <ids> # Comma-separated list of task IDs to executeEXECUTE-LOOP OPTIONS
--verify <command> # Verification command to run after each task (can be used multiple times)
--max-retries <number> # Maximum number of retries per task (default: 3)
--try-models <models> # Progressive model/executor configs for each retry
--no-keep-branches # Delete benchmark branches after run (default: keep)EXECUTE-LOOP COMMAND EXAMPLES
#### Basic Loop Benchmark
# Execute all todo tasks across multiple models
task-o-matic benchmark execute-loop \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--status todo
# Execute specific tasks
task-o-matic benchmark execute-loop \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--ids "task-1,task-5,task-9"
# Execute tasks by tag
task-o-matic benchmark execute-loop \
--models "openai:gpt-4o-mini,openai:gpt-4o" \
--tag frontend#### Advanced Loop with Escalation
# Execute with progressive model escalation
task-o-matic benchmark execute-loop \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--status todo \
--max-retries 3 \
--try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
--verify "bun test"
# Execute with multiple verification checks
task-o-matic benchmark execute-loop \
--models "anthropic:claude-3.5-sonnet" \
--status in-progress \
--verify "bun test" \
--verify "bun run build" \
--verify "bun run type-check"
# Quick execution with auto-cleanup
task-o-matic benchmark execute-loop \
--models "openrouter:google/gemini-2.0-flash-exp" \
--ids "task-1,task-2,task-3" \
--verify "bun test" \
--no-keep-branchesEXECUTE-LOOP OUTPUT FORMAT
#### Real-time Progress Display
🚀 Starting Execute Loop Benchmark
Models: 2
- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...
anthropic:claude-3.5-sonnet: Running...
anthropic:claude-3.5-sonnet: PASS (4321ms)
openai:gpt-4: Running...
openai:gpt-4: PASS (5678ms)#### Final Results Summary
✓ Execute Loop Benchmark Completed!
Summary:
Model | Status | Branch | Duration
-----------------------------------|--------------|-------------------------------------------------|----------
anthropic:claude-3.5-sonnet | PASS | benchmark-execute-loop-claude-3.5-sonnet-001 | 4321ms
openai:gpt-4 | PASS | benchmark-execute-loop-gpt-4-002 | 5678msPROGRESSIVE MODEL ESCALATION
#### Try-Models Format
--try-models "model1,model2,model3"#### How Escalation Works
- Initial Attempt: Try the cheapest/fastest model first
- Verification: Run verification commands to check quality
- Escalate on Failure: If verification fails, retry with next model
- Max Retries: Stop after max-retries attempts (default: 3)
- Branch Isolation: Each attempt gets its own isolated git branch
#### Escalation Example
task-o-matic benchmark execute-loop \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--status todo \
--max-retries 3 \
--try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
--verify "bun test"
# Execution flow:
# Attempt 1: gpt-4o-mini (fast, cheap)
# - If PASS: Done, cost ~$0.0001
# - If FAIL: Escalate to gpt-4o
# Attempt 2: gpt-4o (better quality)
# - If PASS: Done, cost ~$0.0005
# - If FAIL: Escalate to claude-3.5-sonnet
# Attempt 3: claude-3.5-sonnet (best quality)
# - If PASS: Done, cost ~$0.003
# - If FAIL: Mark as failed, review manuallyEXECUTE-LOOP PERFORMANCE METRICS
#### Loop Metrics
- Tasks Processed: Number of tasks executed
- Total Attempts: Sum of all retry attempts across all tasks
- Pass Rate: Percentage of tasks that passed verification
- Average Retries: Average number of retries per task
- Cost Efficiency: Actual cost vs. maximum possible cost (if all tasks used best model)
#### Escalation Analytics
- First Attempt Success: Percentage of tasks that passed on first (cheapest) model
- Escalation Rate: Percentage of tasks that required model escalation
- Cost Savings: Money saved by starting with cheaper models
WORKFLOW COMMAND
Command: task-o-matic benchmark workflow
COMMAND SIGNATURE
task-o-matic benchmark workflow --models <list> [options]REQUIRED OPTIONS
--models <list> # Comma-separated list of models (provider:model[:reasoning=<tokens>]) (required)WORKFLOW CONFIGURATION OPTIONS
--concurrency <number> # Max concurrent requests (default: 3)
--delay <number> # Delay between requests in ms (default: 1000)BENCHMARK SPECIFIC OPTIONS
--temp-dir <dir> # Base directory for temporary projects (default: /tmp)
--execute # Execute generated tasks in the benchmarkPROJECT INITIALIZATION OPTIONS
--project-name <name> # Project name
--init-method <method> # Initialization method: quick, custom, ai
--project-description <desc> # Project description for AI-assisted init
--frontend <framework> # Frontend framework
--backend <framework> # Backend framework
--auth # Include authenticationPRD DEFINITION OPTIONS
--prd-method <method> # PRD method: upload, manual, ai, skip
--prd-file <path> # Path to existing PRD file
--prd-description <desc> # Product description for AI-assisted PRDTASK GENERATION OPTIONS
--skip-refine # Skip PRD refinement
--skip-generate # Skip task generation
--skip-split # Skip task splitting
--generate-instructions <instructions> # Custom task generation instructions
--split-instructions <instructions> # Custom split instructionsWORKFLOW BENCHMARK EXAMPLES
#### Basic Workflow Benchmark
# Simple workflow benchmark
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4"
# Workflow with custom settings
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--concurrency 2 \
--delay 2000#### Advanced Workflow with Execution
# Multi-model workflow with execution
task-o-matic benchmark workflow \
--models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048" \
--concurrency 1 \
--delay 5000 \
--execute
# Workflow with project specification
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet" \
--project-name "Wasteland Shelter" \
--init-method ai \
--project-description "Emergency shelter management" \
--skip-refine \
--skip-splitWORKFLOW BENCHMARK OUTPUT
#### Real-time Progress
🚀 Task-O-Matic Workflow Benchmark
Models: 2, Concurrency: 2, Delay: 2000ms
- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...
⚡ Phase 2: Executing Workflows
Running workflow on 2 models...
- anthropic:claude-3.5-sonnet: Starting...
- openai:gpt-4: Starting...
anthropic:claude-3.5-sonnet: Completed (15432ms)
openai:gpt-4: Completed (18921ms)#### Final Results Summary
✅ Workflow benchmark completed! Run ID: 2024-01-15_14-30-22_workflow-full
📊 Workflow Benchmark Results
Model | Duration | Tasks | Steps | Execution
-----------------------------------|----------|--------|----------------|--------------------
anthropic:claude-opus-2024 | 15432ms | 12 | 5/5 | 12 pass, 0 fail
openrouter:anthropic/claude-3.5-sonnet | 18930ms | 10 | 5/5 | 10 pass, 0 fail
openrouter:openai/gpt-4o | 22150ms | 8 | 5/5 | 8 pass, 0 failWORKFLOW PERFORMANCE METRICS
#### Workflow Step Analysis
- Initialization Time: Project setup and configuration
- PRD Generation Time: PRD creation and processing
- Task Generation Time: Task extraction from PRD
- Task Splitting Time: Task decomposition into subtasks
- Total Workflow Time: End-to-end completion time
- Execution Results: If --execute is enabled, shows pass/fail for task execution
#### Quality Assessment
- Step Completion Rate: Percentage of workflow steps completed successfully
- Task Quality: Number and quality of generated tasks
- PRD Coherence: Logical consistency and completeness
- Integration Success: How well components work together
FIELD OPERATIONS PROTOCOLS
#### BENCHMARK LIFECYCLE MANAGEMENT
The benchmark commands implement a complete performance testing lifecycle:
- Test Planning Phase
- Operation selection and parameter configuration
- Model selection and format validation
- Concurrency and delay optimization
- Test data and input preparation
- Execution Phase
- Concurrent model execution with rate limiting
- Real-time progress tracking and status updates
- Error handling and retry logic
- Performance metrics collection
- Git Branch Isolation for execution benchmarks
- Analysis Phase
- Result aggregation and statistical analysis
- Performance comparison and ranking
- Cost analysis and optimization recommendations
- Error pattern identification and reporting
- Branch comparison for execution benchmarks
- Reporting Phase
- Detailed result display with multiple format options
- Historical trend analysis and comparison
- Export capabilities for external analysis
- Branch switching instructions for code review
#### PERFORMANCE OPTIMIZATION STRATEGIES
All benchmark operations support performance optimization through:
- Model Selection: Data-driven model recommendations based on historical performance
- Parameter Tuning: Automatic optimization of concurrency and delay settings
- Cost Management: Real-time cost tracking and budget enforcement
- Quality Balancing: Trade-off analysis between speed, cost, and quality
- Resource Efficiency: Optimal utilization of system resources
- Progressive Escalation: Start cheap, escalate only when necessary
#### INTEGRATION PATTERNS
Benchmark operations integrate with other Task-O-Matic components:
- Service Integration: Direct calls to TaskService, PRDService, WorkflowService
- Configuration Management: Unified configuration system across all operations
- Storage Integration: Results storage in .task-o-matic/benchmark/ directory
- AI Provider Abstraction: Consistent interface across all supported AI providers
- Error Handling: Standardized error reporting and recovery mechanisms
- Git Integration: Branch isolation for safe concurrent execution testing
SURVIVAL SCENARIOS
#### SCENARIO 1: AI Model Selection for Single Task
# Benchmark single task execution with Git Branch Isolation
task-o-matic benchmark execution \
--task-id task-auth-module \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp" \
--verify "bun test" \
--verify "bun run type-check"
# Compare results and pick best implementation
git checkout benchmark-execution-task-auth-module-claude-3.5-sonnet
git diff main
# Review and merge best solution
git checkout main
git merge benchmark-execution-task-auth-module-claude-3.5-sonnet#### SCENARIO 2: Batch Task Execution with Escalation
# Execute all frontend tasks with progressive model escalation
task-o-matic benchmark execute-loop \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--status todo \
--tag frontend \
--max-retries 3 \
--try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
--verify "bun test --coverage" \
--verify "bun run build"
# Analyze cost savings
# - Simple tasks pass on gpt-4o-mini ($0.0001 each)
# - Medium tasks escalate to gpt-4o ($0.0005 each)
# - Complex tasks escalate to claude-3.5-sonnet ($0.003 each)
# - Overall cost: 30-50% lower than using best model for all tasks#### SCENARIO 3: Complete Workflow Comparison
# Compare end-to-end workflow across multiple models
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp" \
--project-name "Emergency Shelter Manager" \
--init-method ai \
--project-description "Complete shelter management system with resource tracking" \
--execute
# Analyze results based on:
# - Total workflow duration
# - Number of tasks generated
# - Task execution pass rate
# - Estimated cost#### SCENARIO 4: Cost Optimization for Large Task Sets
# Execute 50 tasks with optimal cost strategy
task-o-matic benchmark execute-loop \
--models "openrouter:google/gemini-2.0-flash-exp" \
--max-retries 4 \
--try-models "gemini-2.0-flash-exp,gpt-4o-mini,gpt-4o,claude-3.5-sonnet,claude-opus" \
--verify "bun test" \
--no-keep-branches
# Expected breakdown:
# - 60% pass on gemini-2.0-flash-exp (fastest, cheapest)
# - 25% escalate to gpt-4o-mini
# - 10% escalate to gpt-4o
# - 4% escalate to claude-3.5-sonnet
# - 1% escalate to claude-opus (most expensive, best quality)
# Cost: ~80% lower than using claude-opus for all tasks
# Time: ~40% faster than using claude-opus for all tasksTECHNICAL SPECIFICATIONS
#### BENCHMARK DATA MODEL
interface BenchmarkRun {
id: string; // Unique run identifier
timestamp: Date; // When benchmark was executed
operation: string; // Type of operation benchmarked
command: string; // Command that was run
config: { // Benchmark configuration
models: BenchmarkModelConfig[];
concurrency: number;
delay: number;
};
input: any; // Input parameters for benchmark
results: BenchmarkResult[]; // Results for each model
}
interface BenchmarkResult {
modelId: string; // Model identifier (provider:model)
duration: number; // Execution time in milliseconds
timeToFirstToken?: number; // TTFT in milliseconds
tokenUsage?: { // Token usage breakdown
prompt: number;
completion: number;
total: number;
};
tps?: number; // Tokens per second
bps?: number; // Bytes per second
responseSize?: number; // Response size in bytes
cost?: number; // Estimated cost in USD
output?: any; // Operation output
error?: string; // Error message if failed
}
interface BenchmarkModelConfig {
provider: string; // AI provider name
model: string; // Model name
reasoningTokens?: number; // Reasoning token limit (if applicable)
}#### PERFORMANCE CHARACTERISTICS
- Concurrent Execution: Support for multiple simultaneous model tests (except execution benchmarks)
- Rate Limiting: Configurable delays between requests
- Progress Tracking: Real-time status updates during execution
- Error Recovery: Automatic retry with fallback models
- Metrics Collection: Comprehensive performance data gathering
- Git Branch Isolation: Safe concurrent execution testing for execution benchmarks
#### STORAGE REQUIREMENTS
- Run Storage: 10-50KB per benchmark run
- Results Storage: 5-20KB per model result
- History Storage: 1-5MB for benchmark history
- Cache Storage: 50-200MB for performance optimization cache
#### CONCURRENCY AND SAFETY
- Atomic Operations: Safe concurrent benchmark execution
- Resource Management: Memory and CPU usage optimization
- Error Isolation: Failed model execution doesn't affect others
- Data Integrity: Consistent result storage and retrieval
- Git Safety: Execution benchmarks enforce serial execution to prevent git conflicts
Remember: Citizen, in the resource-constrained wasteland, benchmark commands are your strategic intelligence tools. They provide the data needed to optimize AI usage, control costs, and make informed decisions about which tools will keep your projects running efficiently. The Git Branch Isolation system means you can test fearlessly—no risk to your working code. Use them to measure, analyze, and optimize your AI-powered development workflow.
DOCUMENT STATUS: Updated
NEXT REVIEW: After AI provider updates
CONTACT: Task-O-Matic Benchmark Team
OPEN QUESTIONS / TODO:
- [ ] Add support for custom metric definitions in benchmark runs
- [ ] Implement benchmark result export to CSV/JSON formats
- [ ] Add historical trend visualization
- [ ] Implement cost budget enforcement with alerts
- [ ] Add quality scoring algorithm for comparing model outputs