SERVICES DOCUMENTATION
TECHNICAL BULLETIN NO. 005
WORKFLOW BENCHMARK SERVICE - PERFORMANCE EVALUATION SURVIVAL SYSTEM
task-o-matic-workflow-benchmark-v1All PersonnelMANDATORY COMPLIANCE: Yes
⚠️ CRITICAL SURVIVAL NOTICE
Citizen, WorkflowBenchmarkService is your performance evaluation system in the post-deadline wasteland. Without proper benchmarking, your AI model choices will be as random as dust storms. This service provides isolated testing environments to compare AI performance across complete workflows.
SYSTEM ARCHITECTURE OVERVIEW
The WorkflowBenchmarkService executes complete project workflows in isolated environments to provide fair, comparable performance metrics across different AI models and configurations. It ensures clean separation between test runs and provides detailed analytics for model selection.
Core Dependencies:
- WorkflowService: Complete workflow orchestration
- File System: Temporary directory management and cleanup
- Storage Layer: Task and project data management
- Better-T-Stack CLI: Project bootstrapping integration
Benchmark Capabilities:
- Isolated Execution: Separate environments for each model
- Complete Workflows: Full project lifecycle testing
- Performance Metrics: Detailed timing and quality measurements
- Result Application: Apply selected benchmark to actual projects
- Cleanup Management: Automatic temporary resource cleanup
COMPLETE API DOCUMENTATION
#### executeWorkflow
async executeWorkflow(
input: WorkflowBenchmarkInput,
aiOptions: AIOptions,
streamingOptions?: StreamingOptions
): Promise<WorkflowBenchmarkResult["output"]>Parameters:
input(WorkflowBenchmarkInput, required): Benchmark input configuration
- collectedResponses (object): User responses for workflow steps
- projectName (string): Project name
- initMethod ("quick" | "custom" | "ai"): Initialization method
- projectDescription (string): Project description
- stackConfig (object, optional): Stack configuration
- prdMethod ("upload" | "manual" | "ai" | "skip"): PRD method
- prdFile (string, optional): PRD file path
- prdDescription (string, optional): PRD description
- prdContent (string, optional): PRD content
- refinePrd (boolean, optional): Whether to refine PRD
- refineFeedback (string, optional): PRD refinement feedback
- generateTasks (boolean, optional): Whether to generate tasks
- customInstructions (string, optional): Custom task instructions
- splitTasks (boolean, optional): Whether to split tasks
- tasksToSplit (string[], optional): Specific tasks to split
- splitInstructions (string, optional): Splitting instructions
- tempDirBase (string, optional): Base directory for temp files
aiOptions(AIOptions, required): AI configuration for this benchmark runstreamingOptions(StreamingOptions, optional): Streaming callbacks
Returns: WorkflowBenchmarkResult["output"] containing:
projectDir(string): Temporary project directory pathprdFile(string): Generated PRD file pathprdContent(string): Generated PRD contenttasks(Task[]): Array of generated tasksstats(object): Performance statistics
- initDuration (number): Initialization duration in ms
- prdGenerationDuration (number): PRD generation duration in ms
- prdRefinementDuration (number): PRD refinement duration in ms
- taskGenerationDuration (number): Task generation duration in ms
- taskSplittingDuration (number): Task splitting duration in ms
- totalTasks (number): Total tasks created
- tasksWithSubtasks (number): Tasks that have subtasks
- avgTaskComplexity (number): Average task complexity score
- prdSize (number): PRD content length
- totalSteps (number): Total workflow steps executed
- successfulSteps (number): Successfully completed steps
Error Conditions:
- File system errors during temporary directory creation
- Workflow execution failures from underlying services
- Cleanup errors (non-fatal)
Example: Basic Benchmark Execution
const benchmarkInput = {
collectedResponses: {
projectName: "benchmark-test",
initMethod: "ai",
projectDescription: "AI-powered task management application",
prdMethod: "ai",
prdDescription: "Task management with AI assistance and collaboration features",
generateTasks: true,
splitTasks: true
},
tempDirBase: "/tmp"
};
const result = await workflowBenchmarkService.executeWorkflow(
benchmarkInput,
{
aiProvider: "anthropic",
aiModel: "claude-3-5-sonnet"
},
{
onChunk: (chunk) => console.log(`[BENCHMARK] ${chunk}`)
}
);
console.log(`Benchmark completed:`);
console.log(`- Project: ${result.projectDir}`);
console.log(`- PRD size: ${result.stats.prdSize} chars`);
console.log(`- Tasks created: ${result.stats.totalTasks}`);
console.log(`- Total duration: ${result.stats.initDuration + result.stats.prdGenerationDuration + result.stats.taskGenerationDuration}ms`);Example: Multi-Model Comparison
const models = [
{ aiProvider: "anthropic", aiModel: "claude-3-5-sonnet" },
{ aiProvider: "openai", aiModel: "gpt-4" },
{ aiProvider: "openrouter", aiModel: "anthropic/claude-3-opus" }
];
const benchmarkInput = {
collectedResponses: {
projectName: "model-comparison",
initMethod: "quick",
prdMethod: "ai",
prdDescription: "Real-time collaboration platform",
generateTasks: true,
splitTasks: true
}
};
const results = await Promise.all(
models.map(model =>
workflowBenchmarkService.executeWorkflow(benchmarkInput, model)
)
);
results.forEach((result, index) => {
const model = models[index];
console.log(`\n${model.aiProvider}/${model.aiModel}:`);
console.log(` Tasks: ${result.stats.totalTasks}`);
console.log(` Duration: ${result.stats.initDuration + result.stats.prdGenerationDuration}ms`);
console.log(` Success rate: ${result.stats.successfulSteps}/${result.stats.totalSteps}`);
});#### applyBenchmarkResult
async applyBenchmarkResult(
selectedResult: WorkflowBenchmarkResult,
targetProjectDir: string,
originalResponses: WorkflowBenchmarkInput["collectedResponses"]
): Promise<{ success: boolean; message: string }>Parameters:
selectedResult(WorkflowBenchmarkResult, required): Benchmark result to apply
- modelId (string): Model identifier (e.g., "anthropic:claude-3-5-sonnet")
- output (object): Benchmark output data
targetProjectDir(string, required): Target project directoryoriginalResponses(object, required): Original user responses for workflow
Returns: Application result containing:
success(boolean): Whether application succeededmessage(string): Status message
Error Conditions:
- File system errors during project creation
- Task creation failures
- Configuration errors
Example: Apply Best Benchmark Result
// Assume we have benchmark results from multiple models
const benchmarkResults = [
{ modelId: "anthropic:claude-3-5-sonnet", output: claudeResult },
{ modelId: "openai:gpt-4", output: gpt4Result },
{ modelId: "openrouter:claude-3-opus", output: opusResult }
];
// Select best result based on criteria
const bestResult = benchmarkResults.reduce((best, current) => {
const bestScore = calculateScore(best.output.stats);
const currentScore = calculateScore(current.output.stats);
return currentScore > bestScore ? current : best;
});
// Apply to actual project
const application = await workflowBenchmarkService.applyBenchmarkResult(
bestResult,
"./my-real-project",
{
projectName: "my-real-project",
initMethod: "ai",
projectDescription: "AI-powered development platform"
}
);
if (application.success) {
console.log(`Successfully applied ${bestResult.modelId} results to project`);
console.log(application.message);
} else {
console.error("Application failed:", application.message);
}
function calculateScore(stats) {
// Custom scoring logic
const taskScore = stats.totalTasks * 10;
const speedScore = 10000 / (stats.initDuration + stats.prdGenerationDuration);
const successRate = stats.successfulSteps / stats.totalSteps;
return taskScore + speedScore + (successRate * 1000);
}Example: Selective Application
// Apply only PRD and tasks from benchmark, skipping initialization
const selectiveApplication = await workflowBenchmarkService.applyBenchmarkResult(
{
modelId: "anthropic:claude-3-5-sonnet",
output: {
projectDir: "/tmp/benchmark-123",
prdContent: "# My Project\n\nThis is the PRD...",
tasks: [/* task objects */],
stats: { /* stats */ }
}
},
"./existing-project",
{
projectName: "existing-project",
prdMethod: "ai",
generateTasks: true
}
);
console.log("Applied benchmark results to existing project");#### validateInput
validateInput(input: any): input is WorkflowBenchmarkInputParameters:
input(any, required): Input to validate
Returns: Boolean indicating if input is valid WorkflowBenchmarkInput
Error Conditions: None - validation only returns boolean
Example: Input Validation
const userInput = JSON.parse(process.argv[2] || "{}");
if (workflowBenchmarkService.validateInput(userInput)) {
console.log("Valid benchmark input");
// Proceed with benchmark
} else {
console.error("Invalid benchmark input format");
process.exit(1);
}
// Example of valid input structure
const validInput = {
collectedResponses: {
projectName: "test-project",
initMethod: "ai",
projectDescription: "Test project description",
prdMethod: "ai",
prdDescription: "Test PRD description"
}
};
console.log("Input valid:", workflowBenchmarkService.validateInput(validInput)); // true#### createTempProjectDir (Private)
private createTempProjectDir(
tempBase: string,
projectName: string,
provider: string,
model: string
): stringParameters:
tempBase(string, required): Base directory for temp filesprojectName(string, required): Project nameprovider(string, required): AI provider namemodel(string, required): AI model name
Returns: Created temporary directory path
Error Conditions:
- File system errors during directory creation
Example: Temp Directory Creation
// This is a private method, shown for understanding
const tempDir = workflowBenchmarkService.createTempProjectDir(
"/tmp",
"my-project",
"anthropic",
"claude-3-5-sonnet"
);
console.log(`Created temp directory: ${tempDir}`);
// Output: /tmp/task-o-matic-benchmark/benchmark-my-project-anthropic-claude-3-5-sonnet-1234567890#### cleanupTempProjectDir (Private)
private cleanupTempProjectDir(projectDir: string): voidParameters:
projectDir(string, required): Directory to clean up
Returns: void
Error Conditions: Cleanup errors are logged but don't throw
Example: Cleanup Operation
// This is a private method, shown for understanding
workflowBenchmarkService.cleanupTempProjectDir("/tmp/task-o-matic-benchmark/benchmark-test-123");
// Directory is removed if it contains "task-o-matic-benchmark"
// Errors during cleanup are logged as warningsINTEGRATION PROTOCOLS
Temporary Directory Structure:
/tmp/task-o-matic-benchmark/
└── benchmark-{projectName}-{provider}-{model}-{timestamp}/
├── .task-o-matic/
│ ├── config.json
│ ├── tasks/
│ ├── prd/
│ └── logs/
├── [bootstrapped project files]
└── [generated PRD and tasks]Workflow Execution Steps:
- Project Initialization: Setup temporary project and AI configuration
- PRD Generation: Create PRD using specified method
- PRD Refinement: Apply feedback if requested
- Task Generation: Extract tasks from PRD
- Task Splitting: Break down complex tasks if requested
- Cleanup: Remove temporary directories
Performance Metrics Collection:
interface BenchmarkStats {
initDuration: number; // Project setup time
prdGenerationDuration: number; // PRD creation time
prdRefinementDuration: number; // PRD improvement time
taskGenerationDuration: number; // Task extraction time
taskSplittingDuration: number; // Task breakdown time
totalTasks: number; // Tasks created
tasksWithSubtasks: number; // Tasks split further
avgTaskComplexity: number; // Complexity score
prdSize: number; // PRD character count
totalSteps: number; // Workflow steps attempted
successfulSteps: number; // Steps completed successfully
}Result Application Process:
- Extract model configuration from modelId
- Initialize target project with selected model
- Copy PRD content if available
- Import tasks into target project
- Provide feedback on application success
SURVIVAL SCENARIOS
Scenario 1: Model Performance Comparison
const models = [
{ aiProvider: "anthropic", aiModel: "claude-3-5-sonnet", name: "Claude 3.5 Sonnet" },
{ aiProvider: "openai", aiModel: "gpt-4", name: "GPT-4" },
{ aiProvider: "openrouter", aiModel: "anthropic/claude-3-opus", name: "Claude 3 Opus" }
];
const benchmarkInput = {
collectedResponses: {
projectName: "model-comparison",
initMethod: "ai",
projectDescription: "AI-powered code review and collaboration platform",
prdMethod: "ai",
prdDescription: "Real-time code review with AI assistance, team collaboration, and automated analysis",
generateTasks: true,
splitTasks: true
}
};
console.log("Running model comparison benchmark...");
const results = await Promise.all(
models.map(async (model, index) => {
console.log(`Testing ${model.name}...`);
const startTime = Date.now();
const result = await workflowBenchmarkService.executeWorkflow(
benchmarkInput,
{ aiProvider: model.aiProvider, aiModel: model.aiModel }
);
const duration = Date.now() - startTime;
console.log(`${model.name} completed in ${duration}ms`);
return {
model: model.name,
provider: model.aiProvider,
modelId: model.aiModel,
result,
duration,
stats: result.stats
};
})
);
// Analyze results
console.log("\n=== BENCHMARK RESULTS ===");
results.forEach(result => {
const { stats } = result;
console.log(`\n${result.model} (${result.provider}/${result.modelId}):`);
console.log(` Total Tasks: ${stats.totalTasks}`);
console.log(` Tasks Split: ${stats.tasksWithSubtasks}`);
console.log(` PRD Size: ${stats.prdSize} chars`);
console.log(` Success Rate: ${stats.successfulSteps}/${stats.totalSteps} (${Math.round(stats.successfulSteps/stats.totalSteps*100)}%)`);
console.log(` Timings:`);
console.log(` - Init: ${stats.initDuration}ms`);
console.log(` - PRD: ${stats.prdGenerationDuration}ms`);
console.log(` - Tasks: ${stats.taskGenerationDuration}ms`);
console.log(` - Splitting: ${stats.taskSplittingDuration}ms`);
console.log(` - Total: ${result.duration}ms`);
});
// Select best model based on custom criteria
const bestModel = results.reduce((best, current) => {
const bestScore = calculatePerformanceScore(best.stats);
const currentScore = calculatePerformanceScore(current.stats);
console.log(`${current.model} Score: ${currentScore}`);
return currentScore > bestScore ? current : best;
});
console.log(`\nBest model: ${bestModel.model}`);
console.log(`Applying results to project...`);
// Apply best results
const application = await workflowBenchmarkService.applyBenchmarkResult(
{
modelId: `${bestModel.provider}:${bestModel.modelId}`,
output: bestModel.result
},
"./my-production-project",
benchmarkInput.collectedResponses
);
console.log(application.message);
function calculatePerformanceScore(stats) {
// Weighted scoring system
const taskScore = stats.totalTasks * 20; // More tasks = better
const qualityScore = stats.avgTaskComplexity * 10; // Higher complexity = better analysis
const speedScore = 100000 / (stats.initDuration + stats.prdGenerationDuration + stats.taskGenerationDuration); // Faster = better
const successRate = stats.successfulSteps / stats.totalSteps; // Higher success rate = better
return taskScore + qualityScore + speedScore + (successRate * 5000);
}Scenario 2: Iterative Benchmark Testing
const testScenarios = [
{
name: "Simple Project",
description: "Basic todo list application",
complexity: "low"
},
{
name: "Complex Platform",
description: "Enterprise collaboration platform with AI features, real-time editing, and advanced analytics",
complexity: "high"
},
{
name: "API-First",
description: "RESTful API service with documentation and testing",
complexity: "medium"
}
];
const model = { aiProvider: "anthropic", aiModel: "claude-3-5-sonnet" };
for (const scenario of testScenarios) {
console.log(`\n=== Testing: ${scenario.name} (${scenario.complexity} complexity) ===`);
const benchmarkInput = {
collectedResponses: {
projectName: `test-${scenario.name.toLowerCase().replace(/\s+/g, '-')}`,
initMethod: "ai",
projectDescription: scenario.description,
prdMethod: "ai",
prdDescription: scenario.description,
generateTasks: true,
splitTasks: true
}
};
const result = await workflowBenchmarkService.executeWorkflow(
benchmarkInput,
model
);
const { stats } = result;
console.log(`Results for ${scenario.name}:`);
console.log(` Tasks: ${stats.totalTasks}`);
console.log(` PRD Size: ${stats.prdSize} chars`);
console.log(` Total Time: ${stats.initDuration + stats.prdGenerationDuration + stats.taskGenerationDuration}ms`);
console.log(` Success Rate: ${Math.round(stats.successfulSteps/stats.totalSteps*100)}%`);
// Store results for comparison
await fs.writeFile(
`./benchmark-results-${scenario.name.toLowerCase().replace(/\s+/g, '-')}.json`,
JSON.stringify({ scenario, model: model.model, stats, result }, null, 2)
);
}
console.log("\nAll benchmark tests completed. Results saved to individual files.");Scenario 3: Production Model Selection
// Run comprehensive benchmark before production deployment
const productionCandidates = [
{ aiProvider: "anthropic", aiModel: "claude-3-5-sonnet", cost: 3.0, speed: "fast" },
{ aiProvider: "openai", aiModel: "gpt-4", cost: 6.0, speed: "medium" },
{ aiProvider: "openrouter", aiModel: "anthropic/claude-3-haiku", cost: 1.5, speed: "very-fast" }
];
const productionScenario = {
collectedResponses: {
projectName: "production-app",
initMethod: "ai",
projectDescription: "Customer support chatbot with knowledge base integration",
prdMethod: "ai",
prdDescription: "AI-powered customer support with natural language processing, knowledge base search, and multi-channel support",
generateTasks: true,
splitTasks: true
}
};
console.log("Running production model selection benchmark...");
const benchmarkResults = await Promise.all(
productionCandidates.map(async (candidate) => {
const result = await workflowBenchmarkService.executeWorkflow(
productionScenario,
{ aiProvider: candidate.aiProvider, aiModel: candidate.aiModel }
);
return {
...candidate,
result,
stats: result.stats,
performanceScore: calculateProductionScore(result.stats, candidate)
};
})
);
// Select optimal model for production
const productionChoice = benchmarkResults.reduce((best, current) =>
current.performanceScore > best.performanceScore ? current : best
);
console.log(`\n=== PRODUCTION SELECTION ===`);
console.log(`Selected: ${productionChoice.aiProvider}/${productionChoice.aiModel}`);
console.log(`Performance Score: ${productionChoice.performanceScore}`);
console.log(`Cost per 1M tokens: $${productionChoice.cost}`);
console.log(`Speed Rating: ${productionChoice.speed}`);
// Apply to production project
const productionApplication = await workflowBenchmarkService.applyBenchmarkResult(
{
modelId: `${productionChoice.aiProvider}:${productionChoice.aiModel}`,
output: productionChoice.result
},
"./production-customer-support",
productionScenario.collectedResponses
);
if (productionApplication.success) {
console.log("✅ Production project setup complete");
console.log(productionApplication.message);
} else {
console.error("❌ Production setup failed:", productionApplication.message);
}
function calculateProductionScore(stats, candidate) {
// Production-optimized scoring
const reliabilityScore = (stats.successfulSteps / stats.totalSteps) * 1000;
const productivityScore = stats.totalTasks * 25;
const speedScore = candidate.speed === "very-fast" ? 500 : candidate.speed === "fast" ? 300 : 100;
const costPenalty = candidate.cost * 50; // Lower cost is better
return reliabilityScore + productivityScore + speedScore - costPenalty;
}TECHNICAL SPECIFICATIONS
Benchmark Isolation Strategy:
- Unique temporary directories for each run
- Clean separation from user projects
- No interference between concurrent benchmarks
- Automatic cleanup to prevent disk space issues
Performance Measurement:
- High-resolution timing with Date.now()
- Step-by-step duration tracking
- Success/failure rate calculation
- Quality metrics (task count, complexity, PRD size)
Error Handling:
- Non-fatal error handling during cleanup
- Detailed error logging without interrupting benchmarks
- Graceful degradation when steps fail
- Comprehensive error reporting
Resource Management:
- Memory-efficient temporary file handling
- Disk space monitoring and cleanup
- Process isolation for concurrent benchmarks
- Automatic resource release on completion
Data Collection:
- Comprehensive workflow execution data
- Model performance characteristics
- Quality indicators and metrics
- Timing and success rate statistics
Integration Points:
- WorkflowService for complete workflow execution
- File system for temporary environment management
- Storage layer for data persistence
- Better-T-Stack CLI for project bootstrapping
Security Considerations:
- Isolated execution environments
- No access to user project data during benchmarks
- Secure temporary file handling
- Clean separation of test and production data
Remember: Citizen, WorkflowBenchmarkService is your testing ground in the wasteland of AI model selection. Use it to evaluate performance before committing to production, but remember that benchmarks are controlled conditions. Real-world performance may vary. Test thoroughly, choose wisely, and your projects will stand strong against the challenges of the post-deadline world.