BENCHMARK COMMANDS
COMPLETE DOCUMENTATION
TECHNICAL BULLETIN NO. 005
BENCHMARK COMMANDS - PERFORMANCE ANALYSIS FIELD OPERATIONS
task-o-matic-cli-benchmark-commands-v1All PersonnelMANDATORY COMPLIANCE: Yes
⚠️ CRITICAL SURVIVAL NOTICE
Citizen, benchmark commands are your intelligence gathering tools in the post-deadline wasteland. Without proper performance analysis, you're flying blind into AI provider selection storms. These commands provide the data needed to make informed decisions about which AI models and tools will keep your projects alive.
COMMAND ARCHITECTURE OVERVIEW
The benchmark command group represents the performance analysis and optimization layer of Task-O-Matic. It provides comprehensive testing capabilities across multiple AI providers, models, and workflows. The architecture supports concurrent testing, progressive model escalation, and detailed performance metrics collection.
Performance Analysis Components:
- Multi-Provider Testing: Compare performance across OpenRouter, Anthropic, OpenAI, and custom endpoints
- Model Benchmarking: Detailed performance metrics for individual AI models
- Workflow Analysis: End-to-end workflow performance across complete project lifecycles
- Progressive Escalation: Automatic retry with better models on failure
- Comprehensive Metrics: Latency, throughput, cost, and quality measurements
COMPLETE BENCHMARK COMMAND DOCUMENTATION
RUN COMMAND
Command: task-o-matic benchmark run
COMMAND SIGNATURE
task-o-matic benchmark run <operation> --models <list> [options]REQUIRED ARGUMENTS
<operation> # Operation to benchmark (prd-parse, task-breakdown, task-create, prd-create, etc.)
<models> # Comma-separated list of models (required)REQUIRED OPTIONS
--models <list> # Comma-separated list of models (provider:model[:reasoning=<tokens>])GENERAL OPTIONS
--file <path> # Input file path (for PRD ops)
--task-id <id> # Task ID (for Task ops)
--concurrency <number> # Max concurrent requests (default: 5)
--delay <number> # Delay between requests in ms (default: 250)
--prompt <prompt> # Override prompt
--message <message> # User message
--tools # Enable filesystem tools
--feedback <feedback> # Feedback (for prd-rework)TASK CREATION OPTIONS
--title <title> # Task title (for task-create)
--content <content> # Task content (for task-create)
--parent-id <id> # Parent task ID (for task-create)
--effort <effort> # Effort estimate: small, medium, large (for task-create)
--force # Force operation (for task-document)PRD CREATION OPTIONS
--description <desc> # Project/PRD description (for prd-create, prd-combine)
--output-dir <dir> # Output directory (for prd-create, prd-combine)
--filename <name> # Output filename (for prd-create, prd-combine)PRD PROCESSING OPTIONS
--prds <list> # Comma-separated list of PRD file paths (for prd-combine)PRD REFINEMENT OPTIONS
--question-mode <mode> # Question mode: user or ai (for prd-refine)
--answers <json> # JSON string of answers (for prd-refine user mode)BENCHMARK RUN EXAMPLES
#### Basic Benchmarking
# Simple PRD parsing benchmark
task-o-matic benchmark run prd-parse \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4"
# Task creation benchmark
task-o-matic benchmark run task-create \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--title "Build radiation detector" \
--content "Implement Geiger counter integration" \
--effort medium
# PRD creation benchmark
task-o-matic benchmark run prd-create \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,google:gemini-pro" \
--description "Emergency shelter management system"#### Advanced Benchmarking with Reasoning
# Benchmark with reasoning tokens
task-o-matic benchmark run task-breakdown \
--models "openrouter:anthropic/claude-3.5-sonnet:reasoning=2048,openrouter:openai/o1-preview:reasoning=8192" \
--task-id task-complex-123 \
--tools
# Multi-model PRD generation with reasoning
task-o-matic benchmark run prd-create \
--models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048" \
--description "AI-powered emergency response system"
# Task enhancement with reasoning
task-o-matic benchmark run task-enhance \
--models "openrouter:anthropic/claude-3.5-sonnet:reasoning=1024" \
--task-id task-123 \
--tools#### Workflow Benchmarking
# Complete workflow benchmark
task-o-matic benchmark run workflow-full \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--project-name "Wasteland Shelter" \
--init-method ai \
--project-description "Emergency shelter management system"
# Multi-step workflow benchmark
task-o-matic benchmark run workflow-full \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o,openrouter:google/gemini-2.0-flash-exp" \
--auto-accept \
--skip-refine \
--generate-method ai \
--split-all#### Custom Operation Benchmarking
# PRD rework benchmark
task-o-matic benchmark run prd-rework \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--file ./requirements.md \
--feedback "Add more security requirements and technical specifications"
# PRD question generation benchmark
task-o-matic benchmark run prd-question \
--models "openrouter:anthropic/claude-3.5-sonnet,openai:gpt-4" \
--file ./prd.md \
--tools
# PRD refine benchmark
task-o-matic benchmark run prd-refine \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--question-mode ai \
--file ./prd.mdMODEL FORMAT SPECIFICATIONS
#### Basic Model Format
provider:model#### Model Format Examples
# Anthropic models
anthropic:claude-3.5-sonnet
anthropic:claude-opus-2024
# OpenAI models
openai:gpt-4
openai:gpt-4-turbo
openai:o1-preview
# OpenRouter models
openrouter:anthropic/claude-3.5-sonnet
openrouter:openai/gpt-4o
openrouter:google/gemini-2.0-flash-exp
# Custom endpoints
custom:http://localhost:8080/v1/custom-model#### Advanced Model Format with Reasoning
provider:model:reasoning=<tokens>#### Reasoning Examples
# OpenRouter with reasoning
openrouter:anthropic/claude-3.5-sonnet:reasoning=2048
openrouter:openai/o1-preview:reasoning=8192
openrouter:google/gemini-2.0-flash-exp:reasoning=4096
# OpenAI with reasoning (if supported)
openai:o1-preview:reasoning=4096BENCHMARK RUN OUTPUT FORMAT
#### Real-time Progress Display
Benchmark Progress:
- anthropic:claude-3.5-sonnet: Starting...
- openai:gpt-4: Running... Size: 2048 B, Speed: 15.2 B/s
- anthropic:claude-3.5-sonnet: Completed (2340ms)
- openai:gpt-4: Completed (3120ms)#### Final Results Table
Model | Duration | TTFT | Tokens | TPS | BPS | Size | Cost
-------------------------------------------|----------|----------|----------|---------|---------|----------|----------
anthropic:claude-3.5-sonnet | 2340ms | 120ms | 1250 | 0.53 | 89 | 208 | $0.004250
openai:gpt-4 | 3120ms | 180ms | 980 | 0.31 | 67 | 195 | $0.039200
openrouter:anthropic/claude-3.5-sonnet | 2450ms | 135ms | 1180 | 0.48 | 85 | 202 | $0.004720PERFORMANCE METRICS
#### Measured Metrics
- Duration: Total execution time in milliseconds
- TTFT (Time to First Token): Response latency measurement
- Tokens: Total token usage (prompt + completion)
- TPS (Tokens Per Second): Processing speed measurement
- BPS (Bytes Per Second): Data transfer rate
- Response Size: Output size in bytes
- Cost: Estimated API cost in USD
#### Metric Calculations
# TPS Calculation
TPS = Total Tokens / (Duration / 1000)
# BPS Calculation
BPS = Response Size / (Duration / 1000)
# Cost Estimation
Cost = (Prompt Tokens * Prompt Rate) + (Completion Tokens * Completion Rate)ERROR CONDITIONS
# Invalid model format
Error: Invalid model format: invalid-model. Expected provider:model[:reasoning=<tokens>]
Solution: Use format "provider:model" or "provider:model:reasoning=tokens"
# Operation not found
Error: Benchmark operation 'invalid-op' not found
Solution: Use 'benchmark operations' to see available operations
# Concurrency limit exceeded
Error: Too many concurrent requests
Solution: Reduce --concurrency value or increase --delay
# Rate limiting
Error: Rate limit exceeded for provider
Solution: Increase --delay between requestsLIST COMMAND
Command: task-o-matic benchmark list
COMMAND SIGNATURE
task-o-matic benchmark listLIST COMMAND EXAMPLES
#### Basic Listing
# List all benchmark runs
task-o-matic benchmark list
# List with details
task-o-matic benchmark list --verboseLIST OUTPUT FORMAT
#### Standard Output
Benchmark Runs:
- 2024-01-15_14-30-22_prd-parse (2024-01-15 2:30:22 PM) - prd-parse
- 2024-01-15_13-45-10_task-create (2024-01-15 1:45:10 PM) - task-create
- 2024-01-14_16-20-45_workflow-full (2024-01-14 4:20:45 PM) - workflow-full#### Detailed Output with Metadata
Benchmark Runs:
ID: 2024-01-15_14-30-22_prd-parse
Date: 2024-01-15 2:30:22 PM
Operation: prd-parse
Models: anthropic:claude-3.5-sonnet, openai:gpt-4
Duration: 5460ms
Success Rate: 100%
Average TTFT: 145ms
Total Tokens: 2450
Estimated Cost: $0.043650
ID: 2024-01-15_13-45-10_task-create
Date: 2024-01-15 1:45:10 PM
Operation: task-create
Models: openrouter:anthropic/claude-3.5-sonnet
Duration: 2340ms
Success Rate: 100%
Average TTFT: 120ms
Total Tokens: 1250
Estimated Cost: $0.004250BENCHMARK HISTORY ANALYSIS
#### Performance Trends
- Model Comparison: Compare performance across different runs
- Temporal Analysis: Identify performance changes over time
- Cost Tracking: Monitor API usage costs
- Success Rates: Track reliability and consistency
#### Run Filtering
# Filter by operation type
task-o-matic benchmark list | grep prd-parse
# Filter by date range
task-o-matic benchmark list | grep "2024-01-1"
# Filter by model
task-o-matic benchmark list | grep "claude-3.5-sonnet"OPERATIONS COMMAND
Command: task-o-matic benchmark operations
COMMAND SIGNATURE
task-o-matic benchmark operationsOPERATIONS COMMAND EXAMPLES
#### Basic Operations Listing
# List all available operations
task-o-matic benchmark operations
# List with categories
task-o-matic benchmark operations --categorizedOPERATIONS OUTPUT FORMAT
#### Categorized Operations List
📊 Available Benchmark Operations
Task Operations:
task-create - Create a new task with AI enhancement
task-enhance - Enhance an existing task with AI
task-breakdown - Break task into subtasks using AI
task-document - Fetch and analyze documentation for task
PRD Operations:
prd-create - Generate PRD from product description
prd-parse - Parse PRD into structured tasks
prd-rework - Rework PRD based on feedback
prd-combine - Combine multiple PRDs into master PRD
prd-question - Generate clarifying questions for PRD
prd-refine - Refine PRD by answering questions
Workflow Operations:
workflow-full - Complete workflow execution
workflow-init - Project initialization step
workflow-prd - PRD creation and refinement
workflow-generate - Task generation from PRD
workflow-split - Task splitting into subtasks
Total operations: 14#### Operation Details
# Get details for specific operation
task-o-matic benchmark operations --details task-create
# Search operations by keyword
task-o-matic benchmark operations --search "enhance"
# Filter by category
task-o-matic benchmark operations --category taskOPERATION CATEGORIES
#### Task Operations
- Creation: AI-powered task generation and enhancement
- Enhancement: Task description improvement with Context7
- Breakdown: Intelligent task decomposition into subtasks
- Documentation: Automatic documentation research and integration
#### PRD Operations
- Generation: AI-powered PRD creation from descriptions
- Parsing: Structured task extraction from PRD documents
- Refinement: PRD improvement through feedback and Q&A
- Combination: Multiple PRD synthesis into master documents
#### Workflow Operations
- End-to-End: Complete project lifecycle benchmarking
- Step Analysis: Individual workflow component performance
- Integration: Cross-component performance measurement
SHOW COMMAND
Command: task-o-matic benchmark show
COMMAND SIGNATURE
task-o-matic benchmark show <id>REQUIRED ARGUMENTS
<id> # Run ID to display details for (required)SHOW COMMAND EXAMPLES
#### Basic Run Display
# Show benchmark run details
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse
# Show workflow benchmark details
task-o-matic benchmark show 2024-01-14_16-20-45_workflow-full#### Detailed Analysis Examples
# Show run with full configuration
task-o-matic benchmark show 2024-01-15_13-45-10_task-create --verbose
# Show specific model results
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse --model anthropic:claude-3.5-sonnet
# Show error analysis
task-o-matic benchmark show 2024-01-15_11-30-15_prd-parse --errorsSHOW OUTPUT FORMAT
#### Standard Run Display
Run: 2024-01-15_14-30-22_prd-parse
Date: 2024-01-15 2:30:22 PM
Operation: prd-parse
Configuration:
Concurrency: 5
Delay: 250ms
Models: anthropic:claude-3.5-sonnet, openai:gpt-4
Input:
File: ./requirements.md
Prompt: Default PRD parsing prompt
Message: Parse PRD into structured tasks
Results:
anthropropic:claude-3.5-sonnet:
Duration: 2340ms
TTFT: 120ms
Tokens: 1250 (Prompt: 800, Completion: 450)
TPS: 0.53
BPS: 89
Size: 208 bytes
Cost: $0.004250
Status: Success
openai:gpt-4:
Duration: 3120ms
TTFT: 180ms
Tokens: 980 (Prompt: 750, Completion: 230)
TPS: 0.31
BPS: 67
Size: 195 bytes
Cost: $0.039200
Status: Success
Summary:
Total Duration: 5460ms
Average TTFT: 150ms
Total Tokens: 2230
Total Cost: $0.083450
Success Rate: 100%#### Error Analysis Display
# Show run with error details
task-o-matic benchmark show 2024-01-15_11-30-15_prd-parse --errors
# Show specific model errors
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse --model-errorsCOMPARE COMMAND
Command: task-o-matic benchmark compare
COMMAND SIGNATURE
task-o-matic benchmark compare <id>REQUIRED ARGUMENTS
<id> # Run ID to compare results for (required)COMPARE COMMAND EXAMPLES
#### Basic Comparison
# Compare benchmark run results
task-o-matic benchmark compare 2024-01-15_14-30-22_prd-parse
# Compare workflow benchmark
task-o-matic benchmark compare 2024-01-14_16-20-45_workflow-full#### Advanced Comparison Analysis
# Compare with detailed metrics
task-o-matic benchmark compare 2024-01-15_13-45-10_task-create --detailed
# Compare specific models
task-o-matic benchmark compare 2024-01-15_14-30-22_prd-parse --models anthropic:claude-3.5-sonnet,openai:gpt-4COMPARE OUTPUT FORMAT
#### Comparison Table
Comparison for Run: 2024-01-15_14-30-22_prd-parse
Model | Status | Duration | Tokens | Cost | Efficiency
------------------------|--------|----------|---------|-----------|----------
anthropic:claude-3.5-sonnet | SUCCESS | 2340ms | 1250 | $0.004250 | 0.53 TPS
openai:gpt-4 | SUCCESS | 3120ms | 980 | $0.039200 | 0.31 TPS
Performance Analysis:
- Fastest Model: anthropic:claude-3.5-sonnet (2340ms)
- Most Cost-Effective: anthropic:claude-3.5-sonnet ($0.004250)
- Highest Throughput: anthropic:claude-3.5-sonnet (0.53 TPS)
- Best Value: anthropic:claude-3.5-sonnet (Speed + Cost balance)#### Statistical Analysis
# Performance distribution
Duration: Mean=2730ms, Median=2340ms, Min=2340ms, Max=3120ms
Tokens: Mean=1115, Median=1250, Min=980, Max=1250
Cost: Mean=$0.041725, Median=$0.004250, Min=$0.004250, Max=$0.039200
Success Rate: 100% (2/2 models successful)
Error Rate: 0% (0/2 models failed)WORKFLOW COMMAND
Command: task-o-matic benchmark workflow
COMMAND SIGNATURE
task-o-matic benchmark workflow --models <list> [options]REQUIRED OPTIONS
--models <list> # Comma-separated list of models (required)WORKFLOW CONFIGURATION OPTIONS
--concurrency <number> # Max concurrent requests (default: 3)
--delay <number> # Delay between requests in ms (default: 1000)
--stream # Show streaming AI output
--skip-all # Skip all optional steps (use defaults)
--auto-accept # Auto-accept all AI suggestions
--config-file <path> # Load workflow options from JSON fileWORKFLOW STEP OPTIONS
#### Initialization Options
--skip-init # Skip initialization step
--project-name <name> # Project name
--init-method <method> # Initialization method: quick, custom, ai
--project-description <desc> # Project description for AI-assisted init
--use-existing-config # Use existing configuration if found
--frontend <framework> # Frontend framework
--backend <framework> # Backend framework
--database <db> # Database choice
--auth # Include authentication
--no-auth # Exclude authentication
--bootstrap # Bootstrap with Better-T-Stack
--no-bootstrap # Skip bootstrapping#### PRD Definition Options
--skip-prd # Skip PRD definition
--prd-method <method> # PRD method: upload, manual, ai, skip
--prd-file <path> # Path to existing PRD file
--prd-description <desc> # Product description for AI-assisted PRD
--prd-content <content> # Direct PRD content#### PRD Refinement Options
--skip-refine # Skip PRD refinement
--refine-method <method> # Refinement method: manual, ai, skip
--refine-feedback <feedback> # Feedback for AI refinement#### Task Generation Options
--skip-generate # Skip task generation
--generate-method <method> # Generation method: standard, ai
--generate-instructions <instructions> # Custom task generation instructions#### Task Splitting Options
--skip-split # Skip task splitting
--split-tasks <ids> # Comma-separated task IDs to split
--split-all # Split all tasks
--split-method <method> # Split method: interactive, standard, custom
--split-instructions <instructions> # Custom split instructionsWORKFLOW BENCHMARK EXAMPLES
#### Basic Workflow Benchmark
# Simple workflow benchmark
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4"
# Workflow with custom settings
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--concurrency 2 \
--delay 2000 \
--auto-accept
# Workflow with specific steps
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet" \
--project-name "Wasteland Shelter" \
--init-method ai \
--project-description "Emergency shelter management" \
--skip-refine \
--generate-method ai \
--split-all#### Advanced Workflow Scenarios
# Multi-model workflow with reasoning
task-o-matic benchmark workflow \
--models "anthropic:claude-opus-2024:reasoning=4096,openrouter:anthropic/claude-3.5-sonnet:reasoning=2048,openrouter:openai/o1-preview:reasoning=8192" \
--concurrency 1 \
--delay 5000
# Workflow from configuration file
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--config-file ./workflow-config.json
# Workflow with selective steps
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--skip-init \
--prd-method upload \
--prd-file ./existing-prd.md \
--generate-method standard \
--split-tasks "task-123,task-456"WORKFLOW BENCHMARK OUTPUT
#### Real-time Progress
🚀 Task-O-Matic Workflow Benchmark
Models: 3, Concurrency: 2, Delay: 2000ms
- anthropic:claude-opus-2024:reasoning=4096: Starting...
- openrouter:anthropic/claude-3.5-sonnet: Starting...
- openrouter:openai/gpt-4o: Starting...
Step 1: Initialize
- anthropic:claude-opus-2024:reasoning=4096: Running... Project setup
- openrouter:anthropic/claude-3.5-sonnet: Running... Project setup
- openrouter:openai/gpt-4o: Running... Project setup
Step 2: Define PRD
- anthropic:claude-opus-2024:reasoning=4096: Completed (5430ms)
- openrouter:anthropic/claude-3.5-sonnet: Completed (6120ms)
- openrouter:openai/gpt-4o: Completed (7890ms)
Step 3: Refine PRD
- anthropic:claude-opus-2024:reasoning=4096: Skipped
- openrouter:anthropic/claude-3.5-sonnet: Skipped
- openrouter:openai/gpt-4o: Skipped
Step 4: Generate Tasks
- anthropic:claude-opus-2024:reasoning=4096: Running... Task generation
- openrouter:anthropic/claude-3.5-sonnet: Running... Task generation
- openrouter:openai/gpt-4o: Running... Task generation
Step 5: Split Tasks
- anthropic:claude-opus-2024:reasoning=4096: Running... Task splitting
- openrouter:anthropic/claude-3.5-sonnet: Running... Task splitting
- openrouter:openai/gpt-4o: Running... Task splitting#### Final Results Summary
✅ Workflow benchmark completed! Run ID: 2024-01-15_14-30-22_workflow-full
📊 Workflow Benchmark Results
Model | Duration | Tasks | PRD Size | Steps | Cost
-----------------------------------|----------|--------|----------|-------|--------
anthropic:claude-opus-2024:reasoning=4096 | 15420ms | 12 | 2.1KB | 5/5 | $0.156
openrouter:anthropic/claude-3.5-sonnet | 18930ms | 10 | 1.8KB | 5/5 | $0.089
openrouter:openai/gpt-4o | 22150ms | 8 | 1.6KB | 5/5 | $0.124
Performance Analysis:
- Fastest Overall: openrouter:anthropic/claude-3.5-sonnet (18930ms)
- Best Task Generation: anthropic:claude-opus-2024 (12 tasks)
- Most Cost-Effective: openrouter:anthropic/claude-3.5-sonnet ($0.089)
- Highest Quality: anthropic:claude-opus-2024 (5/5 steps completed)WORKFLOW PERFORMANCE METRICS
#### Workflow Step Analysis
- Initialization Time: Project setup and configuration
- PRD Generation Time: PRD creation and processing
- Task Generation Time: Task extraction from PRD
- Task Splitting Time: Task decomposition into subtasks
- Total Workflow Time: End-to-end completion time
#### Quality Assessment
- Step Completion Rate: Percentage of workflow steps completed successfully
- Task Quality: Number and quality of generated tasks
- PRD Coherence: Logical consistency and completeness
- Integration Success: How well components work together
#### Cost Analysis
- Per-Step Costs: Individual cost breakdown for each workflow step
- Total Cost: Aggregate cost across entire workflow
- Cost Efficiency: Cost per task or per workflow step
- ROI Analysis: Cost vs. output quality relationship
ERROR CONDITIONS
# Invalid model format in workflow
Error: Invalid model format: invalid-model. Expected provider:model[:reasoning=<tokens>]
Solution: Use correct format for all models in workflow
# Concurrency too high for workflow
Error: Workflow concurrency too high for system resources
Solution: Reduce --concurrency or increase system resources
# Workflow step failure
Error: Workflow step 'initialize' failed for model anthropic:claude-3.5-sonnet
Solution: Check model configuration and retry with different model
# Partial workflow completion
Warning: Some workflow steps failed for certain models
Solution: Review individual step results and adjust configurationFIELD OPERATIONS PROTOCOLS
#### BENCHMARK LIFECYCLE MANAGEMENT
The benchmark commands implement a complete performance testing lifecycle:
- Test Planning Phase
- Operation selection and parameter configuration
- Model selection and format validation
- Concurrency and delay optimization
- Test data and input preparation
- Execution Phase
- Concurrent model execution with rate limiting
- Real-time progress tracking and status updates
- Error handling and retry logic
- Performance metrics collection
- Analysis Phase
- Result aggregation and statistical analysis
- Performance comparison and ranking
- Cost analysis and optimization recommendations
- Error pattern identification and reporting
- Reporting Phase
- Detailed result display with multiple format options
- Historical trend analysis and comparison
- Export capabilities for external analysis
- Visual performance charts and graphs
#### PERFORMANCE OPTIMIZATION STRATEGIES
All benchmark operations support performance optimization through:
- Model Selection: Data-driven model recommendations based on historical performance
- Parameter Tuning: Automatic optimization of concurrency and delay settings
- Cost Management: Real-time cost tracking and budget enforcement
- Quality Balancing: Trade-off analysis between speed, cost, and quality
- Resource Efficiency: Optimal utilization of system resources
#### INTEGRATION PATTERNS
Benchmark operations integrate with other Task-O-Matic components:
- Service Integration: Direct calls to TaskService, PRDService, WorkflowService
- Configuration Management: Unified configuration system across all operations
- Storage Integration: Results storage in .task-o-matic/benchmark/ directory
- AI Provider Abstraction: Consistent interface across all supported AI providers
- Error Handling: Standardized error reporting and recovery mechanisms
SURVIVAL SCENARIOS
#### SCENARIO 1: AI Model Selection for Project Setup
# Benchmark PRD parsing models
task-o-matic benchmark run prd-parse \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4,openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o,google:gemini-pro" \
--file ./project-requirements.md
# Analyze results for model selection
task-o-matic benchmark show 2024-01-15_14-30-22_prd-parse
# Compare cost-effectiveness
task-o-matic benchmark compare 2024-01-15_14-30-22_prd-parse --metric cost-per-token
# Select best performing model
# Based on results: anthropic:claude-3.5-sonnet showed best balance of speed and cost#### SCENARIO 2: Workflow Performance Optimization
# Benchmark complete workflow
task-o-matic benchmark workflow \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--project-name "Performance Test" \
--auto-accept
# Test different concurrency settings
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--concurrency 1 \
--project-name "Concurrency Test 1"
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--concurrency 3 \
--project-name "Concurrency Test 3"
task-o-matic benchmark workflow \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--concurrency 5 \
--project-name "Concurrency Test 5"
# Analyze optimal concurrency
task-o-matic benchmark compare 2024-01-15_14-30-22_workflow-full \
--compare-with 2024-01-15_14-30-22_workflow-full_concurrency-1 \
--compare-with 2024-01-15_14-30-22_workflow-full_concurrency-3 \
--compare-with 2024-01-15_14-30-22_workflow-full_concurrency-5#### SCENARIO 3: Cost Management and Budgeting
# Benchmark with cost tracking
task-o-matic benchmark run task-create \
--models "openrouter:anthropic/claude-3.5-sonnet,openrouter:openai/gpt-4o" \
--title "Cost analysis task" \
--budget 0.10
# Monitor cost over time
task-o-matic benchmark list | grep "Cost analysis"
task-o-matic benchmark show --cost-analysis
# Compare cost efficiency
task-o-matic benchmark compare cost-analysis-run-1 cost-analysis-run-2
# Set cost alerts
task-o-matic benchmark run task-create \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--cost-alert 0.05#### SCENARIO 4: Quality Assurance and Testing
# Benchmark for quality metrics
task-o-matic benchmark run prd-parse \
--models "anthropic:claude-3.5-sonnet,openai:gpt-4" \
--quality-metrics accuracy,coherence,completeness
# Test with different prompts
task-o-matic benchmark run prd-parse \
--models "openrouter:anthropic/claude-3.5-sonnet" \
--prompt "Detailed technical analysis required" \
--prompt "Quick overview needed" \
--prompt "Focus on security requirements"
# Compare prompt effectiveness
task-o-matic benchmark compare prompt-test-run-1 prompt-test-run-2
# Quality scoring analysis
task-o-matic benchmark show prompt-test-run-1 --quality-scoresTECHNICAL SPECIFICATIONS
#### BENCHMARK DATA MODEL
interface BenchmarkRun {
id: string; // Unique run identifier
timestamp: Date; // When benchmark was executed
operation: string; // Type of operation benchmarked
config: { // Benchmark configuration
models: BenchmarkModelConfig[];
concurrency: number;
delay: number;
};
input: any; // Input parameters for benchmark
results: BenchmarkResult[]; // Results for each model
}
interface BenchmarkResult {
modelId: string; // Model identifier (provider:model)
duration: number; // Execution time in milliseconds
timeToFirstToken?: number; // TTFT in milliseconds
tokenUsage?: { // Token usage breakdown
prompt: number;
completion: number;
total: number;
};
tps?: number; // Tokens per second
bps?: number; // Bytes per second
responseSize?: number; // Response size in bytes
cost?: number; // Estimated cost in USD
output?: any; // Operation output
error?: string; // Error message if failed
}
interface BenchmarkModelConfig {
provider: string; // AI provider name
model: string; // Model name
reasoningTokens?: number; // Reasoning token limit (if applicable)
}#### PERFORMANCE CHARACTERISTICS
- Concurrent Execution: Support for multiple simultaneous model tests
- Rate Limiting: Configurable delays between requests
- Progress Tracking: Real-time status updates during execution
- Error Recovery: Automatic retry with fallback models
- Metrics Collection: Comprehensive performance data gathering
#### STORAGE REQUIREMENTS
- Run Storage: 10-50KB per benchmark run
- Results Storage: 5-20KB per model result
- History Storage: 1-5MB for benchmark history
- Cache Storage: 50-200MB for performance optimization cache
#### CONCURRENCY AND SAFETY
- Atomic Operations: Safe concurrent benchmark execution
- Resource Management: Memory and CPU usage optimization
- Error Isolation: Failed model execution doesn't affect others
- Data Integrity: Consistent result storage and retrieval
Remember: Citizen, in the resource-constrained wasteland, benchmark commands are your strategic intelligence tools. They provide the data needed to optimize AI usage, control costs, and make informed decisions about which tools will keep your projects running efficiently. Use them to measure, analyze, and optimize your AI-powered development workflow.
DOCUMENT STATUS: Complete
NEXT REVIEW: After AI provider updates
CONTACT: Task-O-Matic Benchmark Team