MAINFRAME TERMINAL #113

SERVICES DOCUMENTATION

File: workflow-benchmark.md

TECHNICAL BULLETIN NO. 005

WORKFLOW BENCHMARK SERVICE - PERFORMANCE EVALUATION SURVIVAL SYSTEM

DOCUMENT ID: task-o-matic-workflow-benchmark-v1
CLEARANCE: All Personnel

MANDATORY COMPLIANCE: Yes

⚠️ CRITICAL SURVIVAL NOTICE

Citizen, WorkflowBenchmarkService is your performance evaluation system in the post-deadline wasteland. Without proper benchmarking, your AI model choices will be as random as dust storms. This service provides isolated testing environments to compare AI performance across complete workflows.

SYSTEM ARCHITECTURE OVERVIEW

The WorkflowBenchmarkService executes complete project workflows in isolated environments to provide fair, comparable performance metrics across different AI models and configurations. It ensures clean separation between test runs and provides detailed analytics for model selection.

Core Dependencies:

  • WorkflowService: Complete workflow orchestration
  • File System: Temporary directory management and cleanup
  • Storage Layer: Task and project data management
  • Better-T-Stack CLI: Project bootstrapping integration

Benchmark Capabilities:

  • Isolated Execution: Separate environments for each model
  • Complete Workflows: Full project lifecycle testing
  • Performance Metrics: Detailed timing and quality measurements
  • Result Application: Apply selected benchmark to actual projects
  • Cleanup Management: Automatic temporary resource cleanup

COMPLETE API DOCUMENTATION

#### executeWorkflow

async executeWorkflow(
  input: WorkflowBenchmarkInput,
  aiOptions: AIOptions,
  streamingOptions?: StreamingOptions
): Promise<WorkflowBenchmarkResult["output"]>

Parameters:

  • input (WorkflowBenchmarkInput, required): Benchmark input configuration

- collectedResponses (object): User responses for workflow steps

- projectName (string): Project name

- initMethod ("quick" | "custom" | "ai"): Initialization method

- projectDescription (string): Project description

- stackConfig (object, optional): Stack configuration

- prdMethod ("upload" | "manual" | "ai" | "skip"): PRD method

- prdFile (string, optional): PRD file path

- prdDescription (string, optional): PRD description

- prdContent (string, optional): PRD content

- refinePrd (boolean, optional): Whether to refine PRD

- refineFeedback (string, optional): PRD refinement feedback

- generateTasks (boolean, optional): Whether to generate tasks

- customInstructions (string, optional): Custom task instructions

- splitTasks (boolean, optional): Whether to split tasks

- tasksToSplit (string[], optional): Specific tasks to split

- splitInstructions (string, optional): Splitting instructions

- tempDirBase (string, optional): Base directory for temp files

  • aiOptions (AIOptions, required): AI configuration for this benchmark run
  • streamingOptions (StreamingOptions, optional): Streaming callbacks

Returns: WorkflowBenchmarkResult["output"] containing:

  • projectDir (string): Temporary project directory path
  • prdFile (string): Generated PRD file path
  • prdContent (string): Generated PRD content
  • tasks (Task[]): Array of generated tasks
  • stats (object): Performance statistics

- initDuration (number): Initialization duration in ms

- prdGenerationDuration (number): PRD generation duration in ms

- prdRefinementDuration (number): PRD refinement duration in ms

- taskGenerationDuration (number): Task generation duration in ms

- taskSplittingDuration (number): Task splitting duration in ms

- totalTasks (number): Total tasks created

- tasksWithSubtasks (number): Tasks that have subtasks

- avgTaskComplexity (number): Average task complexity score

- prdSize (number): PRD content length

- totalSteps (number): Total workflow steps executed

- successfulSteps (number): Successfully completed steps

Error Conditions:

  • File system errors during temporary directory creation
  • Workflow execution failures from underlying services
  • Cleanup errors (non-fatal)

Example: Basic Benchmark Execution

const benchmarkInput = {
  collectedResponses: {
    projectName: "benchmark-test",
    initMethod: "ai",
    projectDescription: "AI-powered task management application",
    prdMethod: "ai",
    prdDescription: "Task management with AI assistance and collaboration features",
    generateTasks: true,
    splitTasks: true
  },
  tempDirBase: "/tmp"
};

const result = await workflowBenchmarkService.executeWorkflow(
  benchmarkInput,
  {
    aiProvider: "anthropic",
    aiModel: "claude-3-5-sonnet"
  },
  {
    onChunk: (chunk) => console.log(`[BENCHMARK] ${chunk}`)
  }
);

console.log(`Benchmark completed:`);
console.log(`- Project: ${result.projectDir}`);
console.log(`- PRD size: ${result.stats.prdSize} chars`);
console.log(`- Tasks created: ${result.stats.totalTasks}`);
console.log(`- Total duration: ${result.stats.initDuration + result.stats.prdGenerationDuration + result.stats.taskGenerationDuration}ms`);

Example: Multi-Model Comparison

const models = [
  { aiProvider: "anthropic", aiModel: "claude-3-5-sonnet" },
  { aiProvider: "openai", aiModel: "gpt-4" },
  { aiProvider: "openrouter", aiModel: "anthropic/claude-3-opus" }
];

const benchmarkInput = {
  collectedResponses: {
    projectName: "model-comparison",
    initMethod: "quick",
    prdMethod: "ai",
    prdDescription: "Real-time collaboration platform",
    generateTasks: true,
    splitTasks: true
  }
};

const results = await Promise.all(
  models.map(model => 
    workflowBenchmarkService.executeWorkflow(benchmarkInput, model)
  )
);

results.forEach((result, index) => {
  const model = models[index];
  console.log(`\n${model.aiProvider}/${model.aiModel}:`);
  console.log(`  Tasks: ${result.stats.totalTasks}`);
  console.log(`  Duration: ${result.stats.initDuration + result.stats.prdGenerationDuration}ms`);
  console.log(`  Success rate: ${result.stats.successfulSteps}/${result.stats.totalSteps}`);
});

#### applyBenchmarkResult

async applyBenchmarkResult(
  selectedResult: WorkflowBenchmarkResult,
  targetProjectDir: string,
  originalResponses: WorkflowBenchmarkInput["collectedResponses"]
): Promise<{ success: boolean; message: string }>

Parameters:

  • selectedResult (WorkflowBenchmarkResult, required): Benchmark result to apply

- modelId (string): Model identifier (e.g., "anthropic:claude-3-5-sonnet")

- output (object): Benchmark output data

  • targetProjectDir (string, required): Target project directory
  • originalResponses (object, required): Original user responses for workflow

Returns: Application result containing:

  • success (boolean): Whether application succeeded
  • message (string): Status message

Error Conditions:

  • File system errors during project creation
  • Task creation failures
  • Configuration errors

Example: Apply Best Benchmark Result

// Assume we have benchmark results from multiple models
const benchmarkResults = [
  { modelId: "anthropic:claude-3-5-sonnet", output: claudeResult },
  { modelId: "openai:gpt-4", output: gpt4Result },
  { modelId: "openrouter:claude-3-opus", output: opusResult }
];

// Select best result based on criteria
const bestResult = benchmarkResults.reduce((best, current) => {
  const bestScore = calculateScore(best.output.stats);
  const currentScore = calculateScore(current.output.stats);
  return currentScore > bestScore ? current : best;
});

// Apply to actual project
const application = await workflowBenchmarkService.applyBenchmarkResult(
  bestResult,
  "./my-real-project",
  {
    projectName: "my-real-project",
    initMethod: "ai",
    projectDescription: "AI-powered development platform"
  }
);

if (application.success) {
  console.log(`Successfully applied ${bestResult.modelId} results to project`);
  console.log(application.message);
} else {
  console.error("Application failed:", application.message);
}

function calculateScore(stats) {
  // Custom scoring logic
  const taskScore = stats.totalTasks * 10;
  const speedScore = 10000 / (stats.initDuration + stats.prdGenerationDuration);
  const successRate = stats.successfulSteps / stats.totalSteps;
  return taskScore + speedScore + (successRate * 1000);
}

Example: Selective Application

// Apply only PRD and tasks from benchmark, skipping initialization
const selectiveApplication = await workflowBenchmarkService.applyBenchmarkResult(
  {
    modelId: "anthropic:claude-3-5-sonnet",
    output: {
      projectDir: "/tmp/benchmark-123",
      prdContent: "# My Project\n\nThis is the PRD...",
      tasks: [/* task objects */],
      stats: { /* stats */ }
    }
  },
  "./existing-project",
  {
    projectName: "existing-project",
    prdMethod: "ai",
    generateTasks: true
  }
);

console.log("Applied benchmark results to existing project");

#### validateInput

validateInput(input: any): input is WorkflowBenchmarkInput

Parameters:

  • input (any, required): Input to validate

Returns: Boolean indicating if input is valid WorkflowBenchmarkInput

Error Conditions: None - validation only returns boolean

Example: Input Validation

const userInput = JSON.parse(process.argv[2] || "{}");

if (workflowBenchmarkService.validateInput(userInput)) {
  console.log("Valid benchmark input");
  // Proceed with benchmark
} else {
  console.error("Invalid benchmark input format");
  process.exit(1);
}

// Example of valid input structure
const validInput = {
  collectedResponses: {
    projectName: "test-project",
    initMethod: "ai",
    projectDescription: "Test project description",
    prdMethod: "ai",
    prdDescription: "Test PRD description"
  }
};

console.log("Input valid:", workflowBenchmarkService.validateInput(validInput)); // true

#### createTempProjectDir (Private)

private createTempProjectDir(
  tempBase: string, 
  projectName: string, 
  provider: string, 
  model: string
): string

Parameters:

  • tempBase (string, required): Base directory for temp files
  • projectName (string, required): Project name
  • provider (string, required): AI provider name
  • model (string, required): AI model name

Returns: Created temporary directory path

Error Conditions:

  • File system errors during directory creation

Example: Temp Directory Creation

// This is a private method, shown for understanding
const tempDir = workflowBenchmarkService.createTempProjectDir(
  "/tmp",
  "my-project",
  "anthropic",
  "claude-3-5-sonnet"
);

console.log(`Created temp directory: ${tempDir}`);
// Output: /tmp/task-o-matic-benchmark/benchmark-my-project-anthropic-claude-3-5-sonnet-1234567890

#### cleanupTempProjectDir (Private)

private cleanupTempProjectDir(projectDir: string): void

Parameters:

  • projectDir (string, required): Directory to clean up

Returns: void

Error Conditions: Cleanup errors are logged but don't throw

Example: Cleanup Operation

// This is a private method, shown for understanding
workflowBenchmarkService.cleanupTempProjectDir("/tmp/task-o-matic-benchmark/benchmark-test-123");
// Directory is removed if it contains "task-o-matic-benchmark"
// Errors during cleanup are logged as warnings

INTEGRATION PROTOCOLS

Temporary Directory Structure:

/tmp/task-o-matic-benchmark/
└── benchmark-{projectName}-{provider}-{model}-{timestamp}/
    ├── .task-o-matic/
    │   ├── config.json
    │   ├── tasks/
    │   ├── prd/
    │   └── logs/
    ├── [bootstrapped project files]
    └── [generated PRD and tasks]

Workflow Execution Steps:

  • Project Initialization: Setup temporary project and AI configuration
  • PRD Generation: Create PRD using specified method
  • PRD Refinement: Apply feedback if requested
  • Task Generation: Extract tasks from PRD
  • Task Splitting: Break down complex tasks if requested
  • Cleanup: Remove temporary directories

Performance Metrics Collection:

interface BenchmarkStats {
  initDuration: number;           // Project setup time
  prdGenerationDuration: number;  // PRD creation time
  prdRefinementDuration: number; // PRD improvement time
  taskGenerationDuration: number; // Task extraction time
  taskSplittingDuration: number; // Task breakdown time
  totalTasks: number;           // Tasks created
  tasksWithSubtasks: number;    // Tasks split further
  avgTaskComplexity: number;   // Complexity score
  prdSize: number;             // PRD character count
  totalSteps: number;           // Workflow steps attempted
  successfulSteps: number;       // Steps completed successfully
}

Result Application Process:

  • Extract model configuration from modelId
  • Initialize target project with selected model
  • Copy PRD content if available
  • Import tasks into target project
  • Provide feedback on application success

SURVIVAL SCENARIOS

Scenario 1: Model Performance Comparison

const models = [
  { aiProvider: "anthropic", aiModel: "claude-3-5-sonnet", name: "Claude 3.5 Sonnet" },
  { aiProvider: "openai", aiModel: "gpt-4", name: "GPT-4" },
  { aiProvider: "openrouter", aiModel: "anthropic/claude-3-opus", name: "Claude 3 Opus" }
];

const benchmarkInput = {
  collectedResponses: {
    projectName: "model-comparison",
    initMethod: "ai",
    projectDescription: "AI-powered code review and collaboration platform",
    prdMethod: "ai",
    prdDescription: "Real-time code review with AI assistance, team collaboration, and automated analysis",
    generateTasks: true,
    splitTasks: true
  }
};

console.log("Running model comparison benchmark...");
const results = await Promise.all(
  models.map(async (model, index) => {
    console.log(`Testing ${model.name}...`);
    const startTime = Date.now();
    
    const result = await workflowBenchmarkService.executeWorkflow(
      benchmarkInput,
      { aiProvider: model.aiProvider, aiModel: model.aiModel }
    );
    
    const duration = Date.now() - startTime;
    console.log(`${model.name} completed in ${duration}ms`);
    
    return {
      model: model.name,
      provider: model.aiProvider,
      modelId: model.aiModel,
      result,
      duration,
      stats: result.stats
    };
  })
);

// Analyze results
console.log("\n=== BENCHMARK RESULTS ===");
results.forEach(result => {
  const { stats } = result;
  console.log(`\n${result.model} (${result.provider}/${result.modelId}):`);
  console.log(`  Total Tasks: ${stats.totalTasks}`);
  console.log(`  Tasks Split: ${stats.tasksWithSubtasks}`);
  console.log(`  PRD Size: ${stats.prdSize} chars`);
  console.log(`  Success Rate: ${stats.successfulSteps}/${stats.totalSteps} (${Math.round(stats.successfulSteps/stats.totalSteps*100)}%)`);
  console.log(`  Timings:`);
  console.log(`    - Init: ${stats.initDuration}ms`);
  console.log(`    - PRD: ${stats.prdGenerationDuration}ms`);
  console.log(`    - Tasks: ${stats.taskGenerationDuration}ms`);
  console.log(`    - Splitting: ${stats.taskSplittingDuration}ms`);
  console.log(`    - Total: ${result.duration}ms`);
});

// Select best model based on custom criteria
const bestModel = results.reduce((best, current) => {
  const bestScore = calculatePerformanceScore(best.stats);
  const currentScore = calculatePerformanceScore(current.stats);
  
  console.log(`${current.model} Score: ${currentScore}`);
  return currentScore > bestScore ? current : best;
});

console.log(`\nBest model: ${bestModel.model}`);
console.log(`Applying results to project...`);

// Apply best results
const application = await workflowBenchmarkService.applyBenchmarkResult(
  {
    modelId: `${bestModel.provider}:${bestModel.modelId}`,
    output: bestModel.result
  },
  "./my-production-project",
  benchmarkInput.collectedResponses
);

console.log(application.message);

function calculatePerformanceScore(stats) {
  // Weighted scoring system
  const taskScore = stats.totalTasks * 20;                    // More tasks = better
  const qualityScore = stats.avgTaskComplexity * 10;             // Higher complexity = better analysis
  const speedScore = 100000 / (stats.initDuration + stats.prdGenerationDuration + stats.taskGenerationDuration); // Faster = better
  const successRate = stats.successfulSteps / stats.totalSteps;   // Higher success rate = better
  
  return taskScore + qualityScore + speedScore + (successRate * 5000);
}

Scenario 2: Iterative Benchmark Testing

const testScenarios = [
  {
    name: "Simple Project",
    description: "Basic todo list application",
    complexity: "low"
  },
  {
    name: "Complex Platform",
    description: "Enterprise collaboration platform with AI features, real-time editing, and advanced analytics",
    complexity: "high"
  },
  {
    name: "API-First",
    description: "RESTful API service with documentation and testing",
    complexity: "medium"
  }
];

const model = { aiProvider: "anthropic", aiModel: "claude-3-5-sonnet" };

for (const scenario of testScenarios) {
  console.log(`\n=== Testing: ${scenario.name} (${scenario.complexity} complexity) ===`);
  
  const benchmarkInput = {
    collectedResponses: {
      projectName: `test-${scenario.name.toLowerCase().replace(/\s+/g, '-')}`,
      initMethod: "ai",
      projectDescription: scenario.description,
      prdMethod: "ai",
      prdDescription: scenario.description,
      generateTasks: true,
      splitTasks: true
    }
  };

  const result = await workflowBenchmarkService.executeWorkflow(
    benchmarkInput,
    model
  );

  const { stats } = result;
  console.log(`Results for ${scenario.name}:`);
  console.log(`  Tasks: ${stats.totalTasks}`);
  console.log(`  PRD Size: ${stats.prdSize} chars`);
  console.log(`  Total Time: ${stats.initDuration + stats.prdGenerationDuration + stats.taskGenerationDuration}ms`);
  console.log(`  Success Rate: ${Math.round(stats.successfulSteps/stats.totalSteps*100)}%`);
  
  // Store results for comparison
  await fs.writeFile(
    `./benchmark-results-${scenario.name.toLowerCase().replace(/\s+/g, '-')}.json`,
    JSON.stringify({ scenario, model: model.model, stats, result }, null, 2)
  );
}

console.log("\nAll benchmark tests completed. Results saved to individual files.");

Scenario 3: Production Model Selection

// Run comprehensive benchmark before production deployment
const productionCandidates = [
  { aiProvider: "anthropic", aiModel: "claude-3-5-sonnet", cost: 3.0, speed: "fast" },
  { aiProvider: "openai", aiModel: "gpt-4", cost: 6.0, speed: "medium" },
  { aiProvider: "openrouter", aiModel: "anthropic/claude-3-haiku", cost: 1.5, speed: "very-fast" }
];

const productionScenario = {
  collectedResponses: {
    projectName: "production-app",
    initMethod: "ai",
    projectDescription: "Customer support chatbot with knowledge base integration",
    prdMethod: "ai",
    prdDescription: "AI-powered customer support with natural language processing, knowledge base search, and multi-channel support",
    generateTasks: true,
    splitTasks: true
  }
};

console.log("Running production model selection benchmark...");

const benchmarkResults = await Promise.all(
  productionCandidates.map(async (candidate) => {
    const result = await workflowBenchmarkService.executeWorkflow(
      productionScenario,
      { aiProvider: candidate.aiProvider, aiModel: candidate.aiModel }
    );
    
    return {
      ...candidate,
      result,
      stats: result.stats,
      performanceScore: calculateProductionScore(result.stats, candidate)
    };
  })
);

// Select optimal model for production
const productionChoice = benchmarkResults.reduce((best, current) => 
  current.performanceScore > best.performanceScore ? current : best
);

console.log(`\n=== PRODUCTION SELECTION ===`);
console.log(`Selected: ${productionChoice.aiProvider}/${productionChoice.aiModel}`);
console.log(`Performance Score: ${productionChoice.performanceScore}`);
console.log(`Cost per 1M tokens: $${productionChoice.cost}`);
console.log(`Speed Rating: ${productionChoice.speed}`);

// Apply to production project
const productionApplication = await workflowBenchmarkService.applyBenchmarkResult(
  {
    modelId: `${productionChoice.aiProvider}:${productionChoice.aiModel}`,
    output: productionChoice.result
  },
  "./production-customer-support",
  productionScenario.collectedResponses
);

if (productionApplication.success) {
  console.log("✅ Production project setup complete");
  console.log(productionApplication.message);
} else {
  console.error("❌ Production setup failed:", productionApplication.message);
}

function calculateProductionScore(stats, candidate) {
  // Production-optimized scoring
  const reliabilityScore = (stats.successfulSteps / stats.totalSteps) * 1000;
  const productivityScore = stats.totalTasks * 25;
  const speedScore = candidate.speed === "very-fast" ? 500 : candidate.speed === "fast" ? 300 : 100;
  const costPenalty = candidate.cost * 50; // Lower cost is better
  
  return reliabilityScore + productivityScore + speedScore - costPenalty;
}

TECHNICAL SPECIFICATIONS

Benchmark Isolation Strategy:

  • Unique temporary directories for each run
  • Clean separation from user projects
  • No interference between concurrent benchmarks
  • Automatic cleanup to prevent disk space issues

Performance Measurement:

  • High-resolution timing with Date.now()
  • Step-by-step duration tracking
  • Success/failure rate calculation
  • Quality metrics (task count, complexity, PRD size)

Error Handling:

  • Non-fatal error handling during cleanup
  • Detailed error logging without interrupting benchmarks
  • Graceful degradation when steps fail
  • Comprehensive error reporting

Resource Management:

  • Memory-efficient temporary file handling
  • Disk space monitoring and cleanup
  • Process isolation for concurrent benchmarks
  • Automatic resource release on completion

Data Collection:

  • Comprehensive workflow execution data
  • Model performance characteristics
  • Quality indicators and metrics
  • Timing and success rate statistics

Integration Points:

  • WorkflowService for complete workflow execution
  • File system for temporary environment management
  • Storage layer for data persistence
  • Better-T-Stack CLI for project bootstrapping

Security Considerations:

  • Isolated execution environments
  • No access to user project data during benchmarks
  • Secure temporary file handling
  • Clean separation of test and production data

Remember: Citizen, WorkflowBenchmarkService is your testing ground in the wasteland of AI model selection. Use it to evaluate performance before committing to production, but remember that benchmarks are controlled conditions. Real-world performance may vary. Test thoroughly, choose wisely, and your projects will stand strong against the challenges of the post-deadline world.