TASK-O-MATIC

TECHNICAL BULLETIN NO. 005

BENCHMARK COMMANDS - PERFORMANCE ANALYSIS FIELD OPERATIONS

DOCUMENT ID: task-o-matic-cli-benchmark-commands-v2

CLEARANCE: All Personnel

MANDATORY COMPLIANCE: Yes

⚠️ CRITICAL SURVIVAL NOTICE

Citizen, benchmark commands are your intelligence gathering tools in the post-deadline wasteland. Without proper performance analysis, you're flying blind into AI provider selection storms. These commands provide the data needed to make informed decisions about which AI models and tools will keep your projects alive.

The benchmark system has been upgraded with Git Branch Isolation capabilities. This means you can now test different AI models in isolation—each model gets its own temporary branch. No cross-contamination. No overwritten work. Pure, comparative analysis in the safety of your own bunker.

COMMAND ARCHITECTURE OVERVIEW

The benchmark command group represents the performance analysis and optimization layer of Task-O-Matic. It provides comprehensive testing capabilities across multiple AI providers, models, and workflows. The architecture supports concurrent testing, progressive model escalation, and detailed performance metrics collection.

Performance Analysis Components:

Multi-Provider Testing: Compare performance across OpenRouter, Anthropic, OpenAI, and custom endpoints
Model Benchmarking: Detailed performance metrics for individual AI models
Workflow Analysis: End-to-end workflow performance across complete project lifecycles
Execution Benchmarking: Git Branch Isolation for safe, concurrent model testing
Loop Benchmarking: Batch task execution with progressive model escalation
Comprehensive Metrics: Latency, throughput, cost, and quality measurements

COMPLETE BENCHMARK COMMAND DOCUMENTATION

RUN COMMAND

Command: task-o-matic benchmark run

COMMAND SIGNATURE

task-o-matic benchmark run <operation> --models <list> [options]

REQUIRED ARGUMENTS

<operation>                  # Operation to benchmark (prd-parse, task-breakdown, task-create, prd-create, etc.)
<models>                    # Comma-separated list of models (required)

REQUIRED OPTIONS

--models <list>              # Comma-separated list of models (provider:model[:reasoning=<tokens>])

GENERAL OPTIONS

--file <path>                # Input file path (for PRD ops)
--task-id <id>               # Task ID (for Task ops)
--concurrency <number>        # Max concurrent requests (default: 5)
--delay <number>               # Delay between requests in ms (default: 250)
--prompt <prompt>             # Override prompt
--message <message>           # User message
--tools                      # Enable filesystem tools
--feedback <feedback>         # Feedback (for prd-rework)

TASK CREATION OPTIONS

--title <title>              # Task title (for task-create)
--content <content>           # Task content (for task-create)
--parent-id <id>             # Parent task ID (for task-create)
--effort <effort>            # Effort estimate: small, medium, large (for task-create)
--force                      # Force operation (for task-document)

PRD CREATION OPTIONS

--description <desc>          # Project/PRD description (for prd-create, prd-combine)
--output-dir <dir>           # Output directory (for prd-create, prd-combine)
--filename <name>            # Output filename (for prd-create, prd-combine)

PRD PROCESSING OPTIONS

--prds <list>                # Comma-separated list of PRD file paths (for prd-combine)

PRD REFINEMENT OPTIONS

--question-mode <mode>        # Question mode: user or ai (for prd-refine)
--answers <json>              # JSON string of answers (for prd-refine user mode)

BENCHMARK RUN EXAMPLES

#### Basic Benchmarking

# Simple PRD parsing benchmark
task-o-matic benchmark run prd-parse \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4"

# Task creation benchmark
task-o-matic benchmark run task-create \
  --models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
  --title "Build radiation detector" \
  --content "Implement Geiger counter integration" \
  --effort medium

# PRD creation benchmark
task-o-matic benchmark run prd-create \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4,google:gemini-pro" \
  --description "Emergency shelter management system"

#### Advanced Benchmarking with Reasoning

# Benchmark with reasoning tokens
task-o-matic benchmark run task-breakdown \
  --models "openrouter:anthropic/claude-3.5-sonnet:reasoning=2048,openrouter:openai/o1-preview:reasoning=8192" \
  --task-id task-complex-123 \
  --tools

# Multi-model PRD generation with reasoning
task-o-matic benchmark run prd-create \
  --models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048" \
  --description "AI-powered emergency response system"

# Task enhancement with reasoning
task-o-matic benchmark run task-enhance \
  --models "openrouter:anthropic/claude-3.5-sonnet:reasoning=1024" \
  --task-id task-123 \
  --tools

MODEL FORMAT SPECIFICATIONS

#### Basic Model Format

provider:model

#### Model Format Examples

# Anthropic models
anthropic:claude-3.5-sonnet
anthropic:claude-opus-2024

# OpenAI models
openai:gpt-4
openai:gpt-4-turbo
openai:o1-preview

# OpenRouter models
openrouter:anthropic/claude-3.5-sonnet
openrouter:openai/gpt-4o
openrouter:google/gemini-2.0-flash-exp

# Custom endpoints
custom:http://localhost:8080/v1/custom-model

#### Advanced Model Format with Reasoning

provider:model:reasoning=<tokens>

#### Reasoning Examples

# OpenRouter with reasoning
openrouter:anthropic/claude-3.5-sonnet:reasoning=2048
openrouter:openai/o1-preview:reasoning=8192
openrouter:google/gemini-2.0-flash-exp:reasoning=4096

# OpenAI with reasoning (if supported)
openai:o1-preview:reasoning=4096

BENCHMARK RUN OUTPUT FORMAT

#### Real-time Progress Display

Benchmark Progress:
- anthropic:claude-3.5-sonnet: Starting...
- openai:gpt-4: Running... Size: 2048 B, Speed: 15.2 B/s
- anthropic:claude-3.5-sonnet: Completed (2340ms)
- openai:gpt-4: Completed (3120ms)

#### Final Results Table

Model                                     | Duration | TTFT    | Tokens   | TPS     | BPS     | Size     | Cost
-------------------------------------------|----------|----------|----------|---------|---------|----------|----------
anthropic:claude-3.5-sonnet                | 2340ms   | 120ms    | 1250     | 0.53    | 89      | 208      | $0.004250
openai:gpt-4                              | 3120ms   | 180ms    | 980      | 0.31    | 67      | 195      | $0.039200
openrouter:anthropic/claude-3.5-sonnet     | 2450ms   | 135ms    | 1180     | 0.48    | 85      | 202      | $0.004720

PERFORMANCE METRICS

#### Measured Metrics

Duration: Total execution time in milliseconds
TTFT (Time to First Token): Response latency measurement
Tokens: Total token usage (prompt + completion)
TPS (Tokens Per Second): Processing speed measurement
BPS (Bytes Per Second): Data transfer rate
Response Size: Output size in bytes
Cost: Estimated API cost in USD

#### Metric Calculations

# TPS Calculation
TPS = Total Tokens / (Duration / 1000)

# BPS Calculation
BPS = Response Size / (Duration / 1000)

# Cost Estimation
Cost = (Prompt Tokens * Prompt Rate) + (Completion Tokens * Completion Rate)

ERROR CONDITIONS

# Invalid model format
Error: Invalid model format: invalid-model. Expected provider:model[:reasoning=<tokens>]
Solution: Use format "provider:model" or "provider:model:reasoning=tokens"

# Operation not found
Error: Benchmark operation 'invalid-op' not found
Solution: Use 'benchmark operations' to see available operations

# Concurrency limit exceeded
Error: Too many concurrent requests
Solution: Reduce --concurrency value or increase --delay

# Rate limiting
Error: Rate limit exceeded for provider
Solution: Increase --delay between requests

LIST COMMAND

Command: task-o-matic benchmark list

COMMAND SIGNATURE

task-o-matic benchmark list

LIST COMMAND EXAMPLES

#### Basic Listing

# List all benchmark runs
task-o-matic benchmark list

LIST OUTPUT FORMAT

#### Standard Output

Benchmark Runs:
- 2024-01-15_14-30-22_prd-parse (2024-01-15 2:30:22 PM) - prd-parse
- 2024-01-15_13-45-10_task-create (2024-01-15 1:45:10 PM) - task-create
- 2024-01-14_16-20-45_workflow-full (2024-01-14 4:20:45 PM) - workflow-full

BENCHMARK HISTORY ANALYSIS

#### Performance Trends

Model Comparison: Compare performance across different runs
Temporal Analysis: Identify performance changes over time
Cost Tracking: Monitor API usage costs
Success Rates: Track reliability and consistency

#### Run Filtering

# Filter by operation type
task-o-matic benchmark list | grep prd-parse

# Filter by date range
task-o-matic benchmark list | grep "2024-01-1"

# Filter by model
task-o-matic benchmark list | grep "claude-3.5-sonnet"

OPERATIONS COMMAND

Command: task-o-matic benchmark operations

COMMAND SIGNATURE

task-o-matic benchmark operations

OPERATIONS COMMAND EXAMPLES

#### Basic Operations Listing

# List all available operations
task-o-matic benchmark operations

OPERATIONS OUTPUT FORMAT

#### Categorized Operations List

📊 Available Benchmark Operations

Task Operations:
  task-create          - Create a new task with AI enhancement
  task-enhance        - Enhance an existing task with AI
  task-breakdown      - Break task into subtasks using AI
  task-document       - Fetch and analyze documentation for task

PRD Operations:
  prd-create          - Generate PRD from product description
  prd-parse           - Parse PRD into structured tasks
  prd-rework          - Rework PRD based on feedback
  prd-combine         - Combine multiple PRDs into master PRD
  prd-question        - Generate clarifying questions for PRD
  prd-refine          - Refine PRD by answering questions

Workflow Operations:
  workflow-full       - Complete workflow execution

Total operations: 11

#### Example usage:

task-o-matic benchmark run task-create --models anthropic:claude-3.5-sonnet --title "Example task"
task-o-matic benchmark run prd-parse --models openai:gpt-4 --file ./prd.md

OPERATION CATEGORIES

#### Task Operations

Creation: AI-powered task generation and enhancement
Enhancement: Task description improvement with Context7
Breakdown: Intelligent task decomposition into subtasks
Documentation: Automatic documentation research and integration

#### PRD Operations

Generation: AI-powered PRD creation from descriptions
Parsing: Structured task extraction from PRD documents
Refinement: PRD improvement through feedback and Q&A
Combination: Multiple PRD synthesis into master documents

#### Workflow Operations

End-to-End: Complete project lifecycle benchmarking
Integration: Cross-component performance measurement

SHOW COMMAND

Command: task-o-matic benchmark show

COMMAND SIGNATURE

task-o-matic benchmark show <id>

REQUIRED ARGUMENTS

<id>                        # Run ID to display details for (required)

SHOW COMMAND EXAMPLES

#### Basic Run Display

# Show benchmark run details
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse

# Show workflow benchmark details
task-o-matic benchmark show 2024-01-14_16-20-45_workflow-full

SHOW OUTPUT FORMAT

#### Standard Run Display

Run: 2024-01-15_14-30-22_prd-parse
Date: 2024-01-15 2:30:22 PM
Command: prd-parse
Input: {"file":"./requirements.md","prompt":"Default PRD parsing prompt","message":"Parse PRD into structured tasks"}
Configuration:
  Concurrency: 5
  Delay: 250ms

Results:

[anthropic:claude-3.5-sonnet]
Duration: 2340ms
TTFT: 120ms
Tokens: 1250 (Prompt: 800, Completion: 450)
Throughput: 89 B/s
Size: 208 bytes
Estimated Cost: $0.004250
Output: {"tasks":[...]}

[openai:gpt-4]
Duration: 3120ms
TTFT: 180ms
Tokens: 980 (Prompt: 750, Completion: 230)
Throughput: 67 B/s
Size: 195 bytes
Estimated Cost: $0.039200
Output: {"tasks":[...]}

COMPARE COMMAND

Command: task-o-matic benchmark compare

COMMAND SIGNATURE

task-o-matic benchmark compare <id>

REQUIRED ARGUMENTS

<id>                        # Run ID to compare results for (required)

COMPARE COMMAND EXAMPLES

#### Basic Comparison

# Compare benchmark run results
task-o-matic benchmark compare 2024-01-15_14-30-22_prd-parse

# Compare workflow benchmark
task-o-matic benchmark compare 2024-01-14_16-20-45_workflow-full

COMPARE OUTPUT FORMAT

#### Comparison Table

┌─────────────────────────────────────────────┬──────────┬──────────┬────────┬─────────┬────────┐
│ Model                                      │ Status   │ Duration │ Tokens │ BPS     │ Size   │
├─────────────────────────────────────────────┼──────────┼──────────┼────────┼─────────┼────────┤
│ anthropic:claude-3.5-sonnet                │ SUCCESS  │ 2340ms   │ 1250   │ 89      │ 208    │
│ openai:gpt-4                              │ SUCCESS  │ 3120ms   │ 980    │ 67      │ 195    │
└─────────────────────────────────────────────┴──────────┴──────────┴────────┴─────────┴────────┘

EXECUTION COMMAND ⚡ NEW

Command: task-o-matic benchmark execution

COMMAND SIGNATURE

task-o-matic benchmark execution --task-id <id> --models <list> [options]

⚠️ CRITICAL SURVIVAL FEATURE

Citizen, this is where the Git Branch Isolation system shines. Each AI model gets its own isolated git branch. No cross-contamination. No overwriting your work. Each model operates in a quarantine environment, executing the task independently. When complete, you have multiple branches—one per model—showing how each would have solved the problem. Pick the best. Discard the rest. Survive with optimal code.

REQUIRED OPTIONS

--task-id <id>               # Task ID to benchmark (required)
--models <list>              # Comma-separated list of models (provider:model) (required)

EXECUTION OPTIONS

--verify <command>           # Verification command (can be used multiple times)
--max-retries <number>       # Maximum retries per model (default: 3)
--no-keep-branches          # Delete benchmark branches after run (default: keep)

EXECUTION COMMAND EXAMPLES

#### Basic Execution Benchmark

# Execute task across multiple models with Git Branch Isolation
task-o-matic benchmark execution \
  --task-id task-123 \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp"

# Execute with verification commands
task-o-matic benchmark execution \
  --task-id task-123 \
  --models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
  --verify "bun test" \
  --verify "bun run build"

# Execute with custom retry settings
task-o-matic benchmark execution \
  --task-id task-complex-456 \
  --models "anthropic:claude-3.5-sonnet:reasoning=4096" \
  --max-retries 5 \
  --no-keep-branches

#### Verification Scenarios

# Test execution with multiple verification checks
task-o-matic benchmark execution \
  --task-id task-auth-789 \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
  --verify "bun test --coverage" \
  --verify "bun run type-check" \
  --verify "bun run lint"

# Quick execution benchmark (cleanup after)
task-o-matic benchmark execution \
  --task-id task-simple-001 \
  --models "openrouter:google/gemini-2.0-flash-exp" \
  --verify "bun test" \
  --no-keep-branches

EXECUTION OUTPUT FORMAT

#### Real-time Progress Display

🚀 Starting Execution Benchmark for Task task-123
Models: 3

- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...
- openrouter:google/gemini-2.0-flash-exp: Waiting...

anthropic:claude-3.5-sonnet: Running...
anthropic:claude-3.5-sonnet: PASS (5432ms)

openai:gpt-4: Running...
openai:gpt-4: PASS (6789ms)

openrouter:google/gemini-2.0-flash-exp: Running...
openrouter:google/gemini-2.0-flash-exp: PASS (4521ms)

#### Final Results Summary

✓ Execution Benchmark Completed!

Summary:
Model                              | Status       | Branch                                          | Duration
-----------------------------------|--------------|-------------------------------------------------|----------
anthropic:claude-3.5-sonnet        | PASS         | benchmark-execution-task123-claude-3.5-sonnet    | 5432ms
openai:gpt-4                       | PASS         | benchmark-execution-task123-gpt-4               | 6789ms
openrouter:google/gemini-2.0-flash-exp | PASS   | benchmark-execution-task123-gemini-2.0-flash    | 4521ms

Run ID: 2024-01-15_14-30-22_execution-task123
To switch to a branch: git checkout <branch_name>

GIT BRANCH ISOLATION EXPLAINED

#### How It Works

Branch Creation: Each model gets its own git branch named benchmark-execution-<taskId>-<modelName>
Isolation: All AI execution happens on isolated branches—no impact on your working branch
Verification: Commands run on each branch independently, validating each solution
Result Comparison: Switch between branches to compare implementations
Cleanup: Use --no-keep-branches to auto-delete after comparison

#### Branch Switching and Comparison

# After benchmark completes, switch to any branch to review
git checkout benchmark-execution-task123-claude-3.5-sonnet

# View the changes
git diff main

# Test the solution
bun test

# Switch to another branch to compare
git checkout benchmark-execution-task123-gpt-4

# Merge the best solution to main
git checkout main
git merge benchmark-execution-task123-claude-3.5-sonnet

# Clean up all benchmark branches (if not auto-deleted)
git branch | grep benchmark-execution | xargs git branch -D

EXECUTION PERFORMANCE METRICS

#### Execution Metrics

Branch Creation: Time to create and switch to isolated environment
AI Execution: Actual AI task completion time
Verification Time: Time to run verification commands
Total Duration: End-to-end execution time including all steps
Pass/Fail Status: Whether verification passed or failed
Branch Name: The isolated branch identifier for the model

#### Quality Assessment

Verification Pass Rate: Number of verification commands that passed
Code Quality: Subjective assessment from manual branch review
Execution Success: Whether the task completed without errors
Solution Completeness: How thoroughly the task was addressed

ERROR CONDITIONS

# Task ID not found
Error: Task task-123 not found
Solution: Verify task ID using 'task-o-matic tasks list'

# Branch creation failure
Error: Failed to create branch for model anthropic:claude-3.5-sonnet
Solution: Check git repository state, ensure no uncommitted changes

# Verification command failure
Error: Verification command failed for openai:gpt-4
Solution: Review branch output, adjust verification commands

# Concurrency limitation
Error: Execution benchmarks must be serial due to git constraints
Solution: Models execute sequentially (automatically enforced)

EXECUTE-LOOP COMMAND ⚡ NEW

Command: task-o-matic benchmark execute-loop

COMMAND SIGNATURE

task-o-matic benchmark execute-loop --models <list> [options]

⚠️ CRITICAL SURVIVAL FEATURE

The execute-loop benchmark lets you batch execute multiple tasks across multiple AI models with progressive model escalation. When a task fails on one model, it automatically retries with progressively stronger models. This is your automated escalation protocol—start cheap, escalate only when necessary. Maximum efficiency, minimum waste of resources.

REQUIRED OPTIONS

--models <list>              # Comma-separated list of models (provider:model) (required)

TASK FILTERING OPTIONS

--status <status>             # Filter tasks by status (todo/in-progress/completed)
--tag <tag>                   # Filter tasks by tag
--ids <ids>                   # Comma-separated list of task IDs to execute

EXECUTE-LOOP OPTIONS

--verify <command>            # Verification command to run after each task (can be used multiple times)
--max-retries <number>        # Maximum number of retries per task (default: 3)
--try-models <models>         # Progressive model/executor configs for each retry
--no-keep-branches           # Delete benchmark branches after run (default: keep)

EXECUTE-LOOP COMMAND EXAMPLES

#### Basic Loop Benchmark

# Execute all todo tasks across multiple models
task-o-matic benchmark execute-loop \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
  --status todo

# Execute specific tasks
task-o-matic benchmark execute-loop \
  --models "openrouter:anthropic/claude-3.5-sonnet" \
  --ids "task-1,task-5,task-9"

# Execute tasks by tag
task-o-matic benchmark execute-loop \
  --models "openai:gpt-4o-mini,openai:gpt-4o" \
  --tag frontend

#### Advanced Loop with Escalation

# Execute with progressive model escalation
task-o-matic benchmark execute-loop \
  --models "openrouter:anthropic/claude-3.5-sonnet" \
  --status todo \
  --max-retries 3 \
  --try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
  --verify "bun test"

# Execute with multiple verification checks
task-o-matic benchmark execute-loop \
  --models "anthropic:claude-3.5-sonnet" \
  --status in-progress \
  --verify "bun test" \
  --verify "bun run build" \
  --verify "bun run type-check"

# Quick execution with auto-cleanup
task-o-matic benchmark execute-loop \
  --models "openrouter:google/gemini-2.0-flash-exp" \
  --ids "task-1,task-2,task-3" \
  --verify "bun test" \
  --no-keep-branches

EXECUTE-LOOP OUTPUT FORMAT

#### Real-time Progress Display

🚀 Starting Execute Loop Benchmark
Models: 2

- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...

anthropic:claude-3.5-sonnet: Running...
anthropic:claude-3.5-sonnet: PASS (4321ms)

openai:gpt-4: Running...
openai:gpt-4: PASS (5678ms)

#### Final Results Summary

✓ Execute Loop Benchmark Completed!

Summary:
Model                              | Status       | Branch                                          | Duration
-----------------------------------|--------------|-------------------------------------------------|----------
anthropic:claude-3.5-sonnet        | PASS         | benchmark-execute-loop-claude-3.5-sonnet-001    | 4321ms
openai:gpt-4                       | PASS         | benchmark-execute-loop-gpt-4-002               | 5678ms

PROGRESSIVE MODEL ESCALATION

#### Try-Models Format

--try-models "model1,model2,model3"

#### How Escalation Works

Initial Attempt: Try the cheapest/fastest model first
Verification: Run verification commands to check quality
Escalate on Failure: If verification fails, retry with next model
Max Retries: Stop after max-retries attempts (default: 3)
Branch Isolation: Each attempt gets its own isolated git branch

#### Escalation Example

task-o-matic benchmark execute-loop \
  --models "openrouter:anthropic/claude-3.5-sonnet" \
  --status todo \
  --max-retries 3 \
  --try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
  --verify "bun test"

# Execution flow:
# Attempt 1: gpt-4o-mini (fast, cheap)
#   - If PASS: Done, cost ~$0.0001
#   - If FAIL: Escalate to gpt-4o

# Attempt 2: gpt-4o (better quality)
#   - If PASS: Done, cost ~$0.0005
#   - If FAIL: Escalate to claude-3.5-sonnet

# Attempt 3: claude-3.5-sonnet (best quality)
#   - If PASS: Done, cost ~$0.003
#   - If FAIL: Mark as failed, review manually

EXECUTE-LOOP PERFORMANCE METRICS

#### Loop Metrics

Tasks Processed: Number of tasks executed
Total Attempts: Sum of all retry attempts across all tasks
Pass Rate: Percentage of tasks that passed verification
Average Retries: Average number of retries per task
Cost Efficiency: Actual cost vs. maximum possible cost (if all tasks used best model)

#### Escalation Analytics

First Attempt Success: Percentage of tasks that passed on first (cheapest) model
Escalation Rate: Percentage of tasks that required model escalation
Cost Savings: Money saved by starting with cheaper models

WORKFLOW COMMAND

Command: task-o-matic benchmark workflow

COMMAND SIGNATURE

task-o-matic benchmark workflow --models <list> [options]

REQUIRED OPTIONS

--models <list>              # Comma-separated list of models (provider:model[:reasoning=<tokens>]) (required)

WORKFLOW CONFIGURATION OPTIONS

--concurrency <number>        # Max concurrent requests (default: 3)
--delay <number>             # Delay between requests in ms (default: 1000)

BENCHMARK SPECIFIC OPTIONS

--temp-dir <dir>             # Base directory for temporary projects (default: /tmp)
--execute                    # Execute generated tasks in the benchmark

PROJECT INITIALIZATION OPTIONS

--project-name <name>        # Project name
--init-method <method>       # Initialization method: quick, custom, ai
--project-description <desc> # Project description for AI-assisted init
--frontend <framework>       # Frontend framework
--backend <framework>        # Backend framework
--auth                       # Include authentication

PRD DEFINITION OPTIONS

--prd-method <method>        # PRD method: upload, manual, ai, skip
--prd-file <path>            # Path to existing PRD file
--prd-description <desc>     # Product description for AI-assisted PRD

TASK GENERATION OPTIONS

--skip-refine               # Skip PRD refinement
--skip-generate             # Skip task generation
--skip-split                # Skip task splitting
--generate-instructions <instructions> # Custom task generation instructions
--split-instructions <instructions>    # Custom split instructions

WORKFLOW BENCHMARK EXAMPLES

#### Basic Workflow Benchmark

# Simple workflow benchmark
task-o-matic benchmark workflow \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4"

# Workflow with custom settings
task-o-matic benchmark workflow \
  --models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
  --concurrency 2 \
  --delay 2000

#### Advanced Workflow with Execution

# Multi-model workflow with execution
task-o-matic benchmark workflow \
  --models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048" \
  --concurrency 1 \
  --delay 5000 \
  --execute

# Workflow with project specification
task-o-matic benchmark workflow \
  --models "anthropic:claude-3.5-sonnet" \
  --project-name "Wasteland Shelter" \
  --init-method ai \
  --project-description "Emergency shelter management" \
  --skip-refine \
  --skip-split

WORKFLOW BENCHMARK OUTPUT

#### Real-time Progress

🚀 Task-O-Matic Workflow Benchmark

Models: 2, Concurrency: 2, Delay: 2000ms

- anthropic:claude-3.5-sonnet: Waiting...
- openai:gpt-4: Waiting...

⚡ Phase 2: Executing Workflows

Running workflow on 2 models...

- anthropic:claude-3.5-sonnet: Starting...
- openai:gpt-4: Starting...

anthropic:claude-3.5-sonnet: Completed (15432ms)
openai:gpt-4: Completed (18921ms)

#### Final Results Summary

✅ Workflow benchmark completed! Run ID: 2024-01-15_14-30-22_workflow-full

📊 Workflow Benchmark Results

Model                               | Duration | Tasks  | Steps          | Execution
-----------------------------------|----------|--------|----------------|--------------------
anthropic:claude-opus-2024         | 15432ms  | 12     | 5/5            | 12 pass, 0 fail
openrouter:anthropic/claude-3.5-sonnet | 18930ms  | 10     | 5/5            | 10 pass, 0 fail
openrouter:openai/gpt-4o           | 22150ms  | 8      | 5/5            | 8 pass, 0 fail

WORKFLOW PERFORMANCE METRICS

#### Workflow Step Analysis

Initialization Time: Project setup and configuration
PRD Generation Time: PRD creation and processing
Task Generation Time: Task extraction from PRD
Task Splitting Time: Task decomposition into subtasks
Total Workflow Time: End-to-end completion time
Execution Results: If --execute is enabled, shows pass/fail for task execution

#### Quality Assessment

Step Completion Rate: Percentage of workflow steps completed successfully
Task Quality: Number and quality of generated tasks
PRD Coherence: Logical consistency and completeness
Integration Success: How well components work together

FIELD OPERATIONS PROTOCOLS

#### BENCHMARK LIFECYCLE MANAGEMENT

The benchmark commands implement a complete performance testing lifecycle:

Test Planning Phase

- Operation selection and parameter configuration

- Model selection and format validation

- Concurrency and delay optimization

- Test data and input preparation

Execution Phase

- Concurrent model execution with rate limiting

- Real-time progress tracking and status updates

- Error handling and retry logic

- Performance metrics collection

- Git Branch Isolation for execution benchmarks

Analysis Phase

- Result aggregation and statistical analysis

- Performance comparison and ranking

- Cost analysis and optimization recommendations

- Error pattern identification and reporting

- Branch comparison for execution benchmarks

Reporting Phase

- Detailed result display with multiple format options

- Historical trend analysis and comparison

- Export capabilities for external analysis

- Branch switching instructions for code review

#### PERFORMANCE OPTIMIZATION STRATEGIES

All benchmark operations support performance optimization through:

Model Selection: Data-driven model recommendations based on historical performance
Parameter Tuning: Automatic optimization of concurrency and delay settings
Cost Management: Real-time cost tracking and budget enforcement
Quality Balancing: Trade-off analysis between speed, cost, and quality
Resource Efficiency: Optimal utilization of system resources
Progressive Escalation: Start cheap, escalate only when necessary

#### INTEGRATION PATTERNS

Benchmark operations integrate with other Task-O-Matic components:

Service Integration: Direct calls to TaskService, PRDService, WorkflowService
Configuration Management: Unified configuration system across all operations
Storage Integration: Results storage in .task-o-matic/benchmark/ directory
AI Provider Abstraction: Consistent interface across all supported AI providers
Error Handling: Standardized error reporting and recovery mechanisms
Git Integration: Branch isolation for safe concurrent execution testing

SURVIVAL SCENARIOS

#### SCENARIO 1: AI Model Selection for Single Task

# Benchmark single task execution with Git Branch Isolation
task-o-matic benchmark execution \
  --task-id task-auth-module \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp" \
  --verify "bun test" \
  --verify "bun run type-check"

# Compare results and pick best implementation
git checkout benchmark-execution-task-auth-module-claude-3.5-sonnet
git diff main

# Review and merge best solution
git checkout main
git merge benchmark-execution-task-auth-module-claude-3.5-sonnet

#### SCENARIO 2: Batch Task Execution with Escalation

# Execute all frontend tasks with progressive model escalation
task-o-matic benchmark execute-loop \
  --models "openrouter:anthropic/claude-3.5-sonnet" \
  --status todo \
  --tag frontend \
  --max-retries 3 \
  --try-models "gpt-4o-mini,gpt-4o,claude-3.5-sonnet" \
  --verify "bun test --coverage" \
  --verify "bun run build"

# Analyze cost savings
# - Simple tasks pass on gpt-4o-mini ($0.0001 each)
# - Medium tasks escalate to gpt-4o ($0.0005 each)
# - Complex tasks escalate to claude-3.5-sonnet ($0.003 each)
# - Overall cost: 30-50% lower than using best model for all tasks

#### SCENARIO 3: Complete Workflow Comparison

# Compare end-to-end workflow across multiple models
task-o-matic benchmark workflow \
  --models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:google/gemini-2.0-flash-exp" \
  --project-name "Emergency Shelter Manager" \
  --init-method ai \
  --project-description "Complete shelter management system with resource tracking" \
  --execute

# Analyze results based on:
# - Total workflow duration
# - Number of tasks generated
# - Task execution pass rate
# - Estimated cost

#### SCENARIO 4: Cost Optimization for Large Task Sets

# Execute 50 tasks with optimal cost strategy
task-o-matic benchmark execute-loop \
  --models "openrouter:google/gemini-2.0-flash-exp" \
  --max-retries 4 \
  --try-models "gemini-2.0-flash-exp,gpt-4o-mini,gpt-4o,claude-3.5-sonnet,claude-opus" \
  --verify "bun test" \
  --no-keep-branches

# Expected breakdown:
# - 60% pass on gemini-2.0-flash-exp (fastest, cheapest)
# - 25% escalate to gpt-4o-mini
# - 10% escalate to gpt-4o
# - 4% escalate to claude-3.5-sonnet
# - 1% escalate to claude-opus (most expensive, best quality)

# Cost: ~80% lower than using claude-opus for all tasks
# Time: ~40% faster than using claude-opus for all tasks

TECHNICAL SPECIFICATIONS

#### BENCHMARK DATA MODEL

interface BenchmarkRun {
  id: string;                    // Unique run identifier
  timestamp: Date;               // When benchmark was executed
  operation: string;              // Type of operation benchmarked
  command: string;               // Command that was run
  config: {                     // Benchmark configuration
    models: BenchmarkModelConfig[];
    concurrency: number;
    delay: number;
  };
  input: any;                     // Input parameters for benchmark
  results: BenchmarkResult[];    // Results for each model
}

interface BenchmarkResult {
  modelId: string;               // Model identifier (provider:model)
  duration: number;              // Execution time in milliseconds
  timeToFirstToken?: number;    // TTFT in milliseconds
  tokenUsage?: {                // Token usage breakdown
    prompt: number;
    completion: number;
    total: number;
  };
  tps?: number;                 // Tokens per second
  bps?: number;                 // Bytes per second
  responseSize?: number;          // Response size in bytes
  cost?: number;                 // Estimated cost in USD
  output?: any;                 // Operation output
  error?: string;                // Error message if failed
}

interface BenchmarkModelConfig {
  provider: string;              // AI provider name
  model: string;                // Model name
  reasoningTokens?: number;       // Reasoning token limit (if applicable)
}

#### PERFORMANCE CHARACTERISTICS

Concurrent Execution: Support for multiple simultaneous model tests (except execution benchmarks)
Rate Limiting: Configurable delays between requests
Progress Tracking: Real-time status updates during execution
Error Recovery: Automatic retry with fallback models
Metrics Collection: Comprehensive performance data gathering
Git Branch Isolation: Safe concurrent execution testing for execution benchmarks

#### STORAGE REQUIREMENTS

Run Storage: 10-50KB per benchmark run
Results Storage: 5-20KB per model result
History Storage: 1-5MB for benchmark history
Cache Storage: 50-200MB for performance optimization cache

#### CONCURRENCY AND SAFETY

Atomic Operations: Safe concurrent benchmark execution
Resource Management: Memory and CPU usage optimization
Error Isolation: Failed model execution doesn't affect others
Data Integrity: Consistent result storage and retrieval
Git Safety: Execution benchmarks enforce serial execution to prevent git conflicts

Remember: Citizen, in the resource-constrained wasteland, benchmark commands are your strategic intelligence tools. They provide the data needed to optimize AI usage, control costs, and make informed decisions about which tools will keep your projects running efficiently. The Git Branch Isolation system means you can test fearlessly—no risk to your working code. Use them to measure, analyze, and optimize your AI-powered development workflow.

DOCUMENT STATUS: Updated

NEXT REVIEW: After AI provider updates

CONTACT: Task-O-Matic Benchmark Team

OPEN QUESTIONS / TODO:

[ ] Add support for custom metric definitions in benchmark runs
[ ] Implement benchmark result export to CSV/JSON formats
[ ] Add historical trend visualization
[ ] Implement cost budget enforcement with alerts
[ ] Add quality scoring algorithm for comparing model outputs

BENCHMARK COMMANDS

COMPLETE DOCUMENTATION

TECHNICAL BULLETIN NO. 005

BENCHMARK COMMANDS - PERFORMANCE ANALYSIS FIELD OPERATIONS

⚠️ CRITICAL SURVIVAL NOTICE

COMMAND ARCHITECTURE OVERVIEW

COMPLETE BENCHMARK COMMAND DOCUMENTATION

RUN COMMAND

COMMAND SIGNATURE

REQUIRED ARGUMENTS

REQUIRED OPTIONS

GENERAL OPTIONS

TASK CREATION OPTIONS

PRD CREATION OPTIONS

PRD PROCESSING OPTIONS

PRD REFINEMENT OPTIONS

BENCHMARK RUN EXAMPLES

MODEL FORMAT SPECIFICATIONS

BENCHMARK RUN OUTPUT FORMAT

PERFORMANCE METRICS

ERROR CONDITIONS

LIST COMMAND

COMMAND SIGNATURE

LIST COMMAND EXAMPLES

LIST OUTPUT FORMAT

BENCHMARK HISTORY ANALYSIS

OPERATIONS COMMAND

COMMAND SIGNATURE

OPERATIONS COMMAND EXAMPLES

OPERATIONS OUTPUT FORMAT

OPERATION CATEGORIES

SHOW COMMAND

COMMAND SIGNATURE

REQUIRED ARGUMENTS

SHOW COMMAND EXAMPLES

SHOW OUTPUT FORMAT

COMPARE COMMAND

COMMAND SIGNATURE

REQUIRED ARGUMENTS

COMPARE COMMAND EXAMPLES

COMPARE OUTPUT FORMAT

EXECUTION COMMAND ⚡ NEW

COMMAND SIGNATURE

⚠️ CRITICAL SURVIVAL FEATURE

REQUIRED OPTIONS

EXECUTION OPTIONS

EXECUTION COMMAND EXAMPLES

EXECUTION OUTPUT FORMAT

GIT BRANCH ISOLATION EXPLAINED

EXECUTION PERFORMANCE METRICS

ERROR CONDITIONS

EXECUTE-LOOP COMMAND ⚡ NEW

COMMAND SIGNATURE

⚠️ CRITICAL SURVIVAL FEATURE

REQUIRED OPTIONS

TASK FILTERING OPTIONS

EXECUTE-LOOP OPTIONS

EXECUTE-LOOP COMMAND EXAMPLES

EXECUTE-LOOP OUTPUT FORMAT

PROGRESSIVE MODEL ESCALATION

EXECUTE-LOOP PERFORMANCE METRICS

WORKFLOW COMMAND

COMMAND SIGNATURE

REQUIRED OPTIONS

WORKFLOW CONFIGURATION OPTIONS

BENCHMARK SPECIFIC OPTIONS

PROJECT INITIALIZATION OPTIONS

PRD DEFINITION OPTIONS

TASK GENERATION OPTIONS

WORKFLOW BENCHMARK EXAMPLES

WORKFLOW BENCHMARK OUTPUT

WORKFLOW PERFORMANCE METRICS

FIELD OPERATIONS PROTOCOLS

SURVIVAL SCENARIOS

TECHNICAL SPECIFICATIONS