SERVICES DOCUMENTATION
TECHNICAL BULLETIN NO. 006
BENCHMARK SERVICE - PERFORMANCE MEASUREMENT SURVIVAL SYSTEM
task-o-matic-benchmark-v2All PersonnelMANDATORY COMPLIANCE: Yes
⚠️ CRITICAL SURVIVAL NOTICE
Citizen, BenchmarkService is your performance monitoring system in the post-deadline wasteland. Without proper benchmarking, your AI model performance will be as mysterious as radioactive fog. This service provides standardized performance measurement and comparison capabilities for all task-o-matic operations.
SYSTEM ARCHITECTURE OVERVIEW
The BenchmarkService serves as the central performance measurement and storage system for task-o-matic operations. It provides a unified interface for running benchmarks, storing results, and analyzing performance across different AI models and configurations.
Core Components:
- BenchmarkService: Main benchmark execution service with multiple benchmark types
- WorkflowBenchmarkService: Specialized service for complete workflow benchmarks
- BenchmarkRunner: Core benchmark execution engine for registered operations
- BenchmarkStorage: Persistent result storage and retrieval
- BenchmarkRegistry: Operation registration and discovery
Benchmark Capabilities:
- Operation Benchmarking: Run standardized benchmarks on any registered operation
- Task Execution Benchmarking: Isolated task execution with git branch isolation
- Execute Loop Benchmarking: Batch task execution with model comparison
- Workflow Benchmarking: Complete project lifecycle benchmarking
- Performance Tracking: Measure timing, token usage, throughput, and latency
- Result Storage: Persistent storage of benchmark results
- Historical Analysis: Compare performance across multiple runs
BENCHMARKSERVICE API
runBenchmark
Run a benchmark on a registered operation using multiple AI models.
async runBenchmark(
operationId: string,
input: any,
config: BenchmarkConfig,
onProgress?: (event: BenchmarkProgressEvent) => void
): Promise<BenchmarkRun>Parameters:
operationId(string, required): Unique identifier for the benchmark operation
- Available operations: prd-parse, prd-rework, prd-create, prd-combine, prd-question, prd-refine, task-breakdown, task-create, task-enhance, task-plan, task-document, workflow-full
input(any, required): Input data for the operation being benchmarked
- Varies by operation type (see Operation Registry below)
config(BenchmarkConfig, required): Benchmark configuration
- models (BenchmarkModelConfig[], required): Array of model configurations to benchmark
- concurrency (number, required): Maximum number of models to run in parallel
- delay (number, required): Delay in milliseconds between starting each model
onProgress(function, optional): Progress callback for real-time updates
Returns: BenchmarkRun containing:
id(string): Unique benchmark run identifiertimestamp(number): Start timestamp of the benchmarkcommand(string): Operation ID that was benchmarkedinput(any): Input data provided to the benchmarkconfig(BenchmarkConfig): Configuration used for the benchmarkresults(BenchmarkResult[]): Array of model results
- modelId (string): Provider:model[:reasoning] identifier
- output (any): Operation output
- duration (number): Duration in milliseconds
- error (string, optional): Error message if failed
- timestamp (number): Completion timestamp
- tokenUsage (object, optional): Token usage statistics
- prompt (number): Input tokens
- completion (number): Output tokens
- total (number): Total tokens
- responseSize (number, optional): Response size in bytes
- bps (number, optional): Bytes per second
- tps (number, optional): Tokens per second (output)
- timeToFirstToken (number, optional): Time to first token in ms
- cost (number, optional): Estimated cost in USD
Error Conditions:
- Invalid operation ID (operation not registered)
- Invalid input for operation
- Configuration validation errors
- Storage errors during result saving
Example: PRD Parsing Benchmark
const benchmarkResult = await benchmarkService.runBenchmark(
"prd-parse",
{
file: "./requirements.md",
workingDirectory: "/path/to/project"
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o" },
{ provider: "openrouter", model: "anthropic/claude-3-opus" }
],
concurrency: 2,
delay: 1000
},
(event) => {
console.log(`[${event.type}] ${event.modelId}`);
if (event.type === 'progress') {
console.log(` Progress: ${event.currentSize} bytes @ ${event.currentBps} bps`);
} else if (event.type === 'complete') {
console.log(` Completed in ${event.duration}ms`);
} else if (event.type === 'error') {
console.error(` Error: ${event.error}`);
}
}
);
console.log(`Benchmark completed: ${benchmarkResult.id}`);
benchmarkResult.results.forEach(result => {
console.log(`\n${result.modelId}:`);
console.log(` Duration: ${result.duration}ms`);
console.log(` Tokens: ${result.tokenUsage?.total || 'N/A'}`);
console.log(` Throughput: ${result.tps || 'N/A'} tps`);
console.log(` First Token: ${result.timeToFirstToken || 'N/A'}ms`);
console.log(` Status: ${result.error ? 'FAILED' : 'SUCCESS'}`);
});runExecutionBenchmark
Benchmark task execution with isolated git branches for each model.
async runExecutionBenchmark(
options: ExecutionBenchmarkOptions,
config: BenchmarkConfig,
onProgress?: (event: BenchmarkProgressEvent) => void
): Promise<BenchmarkRun>Parameters:
options(ExecutionBenchmarkOptions, required): Execution benchmark options
- taskId (string, required): Task ID to execute
- verificationCommands (string[], optional): Commands to verify successful execution
- maxRetries (number, optional): Maximum retry attempts (default: 3)
- keepBranches (boolean, optional): Keep git branches after benchmark (default: true)
config(BenchmarkConfig, required): Benchmark configurationonProgress(function, optional): Progress callback
Safety Requirements:
- Working directory must be clean (no uncommitted changes)
- Git repository must be initialized
Behavior:
- Creates isolated git branch for each model:
bench/{taskId}/{model}-{timestamp} - Switches to model-specific branch
- Executes task with verification
- Captures results including branch name
- Returns to base branch after each model
Example: Task Execution Benchmark
const benchmarkResult = await benchmarkService.runExecutionBenchmark(
{
taskId: "42",
verificationCommands: ["bun test", "bun run build"],
maxRetries: 3,
keepBranches: true
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o" }
],
concurrency: 1, // Sequential execution recommended for task benchmarks
delay: 0
},
(event) => {
console.log(`[${event.type}] ${event.modelId}`);
if (event.type === 'start') {
console.log(` Starting execution...`);
} else if (event.type === 'complete') {
console.log(` Execution complete: ${event.duration}ms`);
} else if (event.type === 'error') {
console.error(` Execution failed: ${event.error}`);
}
}
);
console.log(`\nBenchmark Results:`);
benchmarkResult.results.forEach(result => {
const { modelId, output, duration, error } = result;
console.log(`\n${modelId}:`);
console.log(` Status: ${error ? 'FAILED' : output?.status}`);
console.log(` Duration: ${duration}ms`);
console.log(` Branch: ${output?.branch}`);
if (error) console.log(` Error: ${error}`);
});runExecuteLoopBenchmark
Benchmark batch task execution with model comparison.
async runExecuteLoopBenchmark(
options: {
loopOptions: ExecuteLoopOptions;
keepBranches?: boolean;
},
config: BenchmarkConfig,
onProgress?: (event: BenchmarkProgressEvent) => void
): Promise<BenchmarkRun>Parameters:
options(object, required):
- loopOptions (ExecuteLoopOptions, required): Loop execution options
- keepBranches (boolean, optional): Keep git branches after benchmark (default: true)
config(BenchmarkConfig, required): Benchmark configurationonProgress(function, optional): Progress callback
Safety Requirements:
- Working directory must be clean
- Git repository must be initialized
Behavior:
- Creates isolated git branch for each model:
bench/loop/{model}-{timestamp} - Executes complete task loop for each model
- Captures overall loop results including failed task count
- Returns to base branch after each model
Example: Execute Loop Benchmark
const benchmarkResult = await benchmarkService.runExecuteLoopBenchmark(
{
loopOptions: {
status: "todo",
maxRetries: 3,
tool: "opencode",
config: {
maxRetries: 3,
autoCommit: true
},
filters: {},
dry: false
},
keepBranches: false
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o" }
],
concurrency: 1,
delay: 5000
},
(event) => {
console.log(`[${event.type}] ${event.modelId}`);
if (event.type === 'complete') {
console.log(` Loop complete in ${event.duration}ms`);
} else if (event.type === 'error') {
console.error(` Loop failed: ${event.error}`);
}
}
);
console.log(`\nLoop Benchmark Results:`);
benchmarkResult.results.forEach(result => {
const { modelId, output, duration, error } = result;
console.log(`\n${modelId}:`);
console.log(` Status: ${error ? 'FAILED' : output?.status}`);
console.log(` Duration: ${duration}ms`);
console.log(` Branch: ${output?.branch}`);
console.log(` Failed Tasks: ${output?.failedTasks || 0}`);
});runWorkflowBenchmark
Benchmark complete project workflow from initialization through task splitting.
async runWorkflowBenchmark(
input: WorkflowBenchmarkInput,
config: BenchmarkConfig,
onProgress?: (event: BenchmarkProgressEvent) => void
): Promise<BenchmarkRun>Parameters:
input(WorkflowBenchmarkInput, required): Workflow benchmark input
- collectedResponses (object, required): User responses for consistent execution
- projectName (string): Project name
- initMethod ("quick" | "custom" | "ai"): Initialization method
- projectDescription (string, optional): Project description
- stackConfig (object, optional): Technology stack configuration
- frontend (string, optional): Frontend framework
- backend (string, optional): Backend framework
- database (string, optional): Database type
- auth (boolean, optional): Include authentication
- prdMethod ("upload" | "manual" | "ai" | "skip"): PRD creation method
- prdContent (string, optional): PRD content
- prdDescription (string, optional): PRD description
- prdFile (string, optional): Path to PRD file
- refinePrd (boolean, optional): Whether to refine PRD
- refineFeedback (string, optional): Feedback for PRD refinement
- generateTasks (boolean, optional): Whether to generate tasks
- customInstructions (string, optional): Custom instructions for task generation
- splitTasks (boolean, optional): Whether to split tasks
- tasksToSplit (string[], optional): Specific task IDs to split
- splitInstructions (string, optional): Instructions for task splitting
- workflowOptions (WorkflowAutomationOptions, required): Workflow automation options
- projectDir (string, optional): Target project directory
- tempDirBase (string, optional): Base directory for temporary projects (default: ./benchmarks)
config(BenchmarkConfig, required): Benchmark configurationonProgress(function, optional): Progress callback
Behavior:
- Creates isolated project directory for each model:
{projectName}-{model}-{timestamp} - Executes complete workflow: init → PRD → tasks → splitting
- Captures detailed timing for each step
- Stores results in temporary directories
Example: Complete Workflow Benchmark
const benchmarkResult = await benchmarkService.runWorkflowBenchmark(
{
collectedResponses: {
projectName: "vault-manager",
initMethod: "quick",
projectDescription: "A system for managing underground bunkers",
stackConfig: {
frontend: "next",
backend: "hono",
database: "postgres",
auth: true
},
prdMethod: "ai",
prdDescription: "Build a comprehensive vault management system",
generateTasks: true,
splitTasks: true
},
workflowOptions: {
executeTasks: false // Don't execute tasks during benchmark
},
tempDirBase: "/tmp/benchmarks"
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o" },
{ provider: "openrouter", model: "anthropic/claude-3-opus" }
],
concurrency: 1, // Sequential for workflow benchmarks
delay: 0
},
(event) => {
console.log(`[${event.type}] ${event.modelId}`);
if (event.type === 'start') {
console.log(` Starting workflow...`);
} else if (event.type === 'complete') {
console.log(` Workflow complete in ${event.duration}ms`);
} else if (event.type === 'error') {
console.error(` Workflow error: ${event.error}`);
}
}
);
console.log(`\nWorkflow Benchmark Results:`);
benchmarkResult.results.forEach(result => {
const { modelId, output, duration, error } = result;
console.log(`\n${modelId}:`);
console.log(` Status: ${error ? 'FAILED' : 'SUCCESS'}`);
console.log(` Duration: ${duration}ms`);
console.log(` Project Dir: ${output?.projectDir}`);
console.log(` Total Tasks: ${output?.stats?.totalTasks || 0}`);
console.log(` Steps: ${output?.stats?.successfulSteps}/${output?.stats?.totalSteps}`);
});getRun
Retrieve a previously saved benchmark run.
getRun(id: string): BenchmarkRun | nullParameters:
id(string, required): Benchmark run identifier
Returns: BenchmarkRun object or null if not found
Error Conditions: Storage access errors
Example: Retrieve Specific Benchmark
const benchmarkRun = benchmarkService.getRun("run-1736990400000");
if (benchmarkRun) {
console.log(`Found benchmark: ${benchmarkRun.id}`);
console.log(`Operation: ${benchmarkRun.command}`);
console.log(`Date: ${new Date(benchmarkRun.timestamp).toISOString()}`);
console.log(`Models: ${benchmarkRun.results.length}`);
benchmarkRun.results.forEach(result => {
console.log(`\n${result.modelId}:`);
console.log(` Duration: ${result.duration}ms`);
console.log(` Tokens: ${result.tokenUsage?.total || 'N/A'}`);
console.log(` Status: ${result.error ? 'FAILED' : 'SUCCESS'}`);
});
} else {
console.log("Benchmark run not found");
}listRuns
List all saved benchmark runs.
listRuns(): Array<{
id: string;
timestamp: number;
command: string;
}>Returns: Array of benchmark run metadata, sorted by timestamp (newest first)
Error Conditions: Storage access errors
Example: List All Benchmarks
const runs = benchmarkService.listRuns();
console.log(`Found ${runs.length} benchmark runs:\n`);
runs.forEach(run => {
const date = new Date(run.timestamp);
console.log(`${run.id}: ${run.command} (${date.toLocaleDateString()})`);
});
// Filter by operation type
const prdBenchmarks = runs.filter(run => run.command.startsWith("prd"));
console.log(`\nPRD benchmarks: ${prdBenchmarks.length}`);
// Filter by date range
const recentRuns = runs.filter(run => {
const runDate = new Date(run.timestamp);
const weekAgo = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000);
return runDate > weekAgo;
});
console.log(`\nRecent benchmarks (last 7 days): ${recentRuns.length}`);WORKFLOWBENCHMARKSERVICE API
executeWorkflow
Execute a complete workflow for benchmarking purposes. Used internally by runWorkflowBenchmark.
async executeWorkflow(
input: WorkflowBenchmarkInput,
aiOptions: AIOptions,
streamingOptions?: StreamingOptions
): Promise<WorkflowBenchmarkResult["output"]>Parameters:
input(WorkflowBenchmarkInput, required): Workflow benchmark inputaiOptions(AIOptions, required): AI configurationstreamingOptions(StreamingOptions, optional): Streaming configuration
Returns: Workflow output with detailed statistics
Note: This method creates temporary project directories and cleans them up automatically. Use applyBenchmarkResult to apply results to an actual project.
applyBenchmarkResult
Apply the results from a selected benchmark to an actual project.
async applyBenchmarkResult(
selectedResult: WorkflowBenchmarkResult,
targetProjectDir: string,
originalResponses: WorkflowBenchmarkInput["collectedResponses"]
): Promise<{ success: boolean; message: string }>Parameters:
selectedResult(WorkflowBenchmarkResult, required): Selected benchmark resulttargetProjectDir(string, required): Target project directoryoriginalResponses(object, required): Original user responses
Returns: Success status and message
Behavior:
- Initializes actual project with selected model
- Copies PRD content if available
- Imports tasks from benchmark result
Example: Apply Benchmark Results
const success = await workflowBenchmarkService.applyBenchmarkResult(
selectedBenchmarkResult,
"/path/to/actual/project",
originalUserResponses
);
if (success.success) {
console.log(`✅ ${success.message}`);
} else {
console.error(`❌ ${success.message}`);
}validateInput
Validate workflow benchmark input.
validateInput(input: any): input is WorkflowBenchmarkInputParameters:
input(any): Input to validate
Returns: Type guard indicating if input is valid WorkflowBenchmarkInput
TYPE DEFINITIONS
BenchmarkConfig
interface BenchmarkConfig {
models: BenchmarkModelConfig[];
concurrency: number;
delay: number;
}BenchmarkModelConfig
interface BenchmarkModelConfig {
provider: string;
model: string;
reasoningTokens?: number;
}BenchmarkProgressEvent
type BenchmarkProgressEvent =
| { type: "start"; modelId: string }
| { type: "progress"; modelId: string; currentSize: number; currentBps: number; chunk?: string; duration: number }
| { type: "complete"; modelId: string; duration: number }
| { type: "error"; modelId: string; error: string };BenchmarkResult
interface BenchmarkResult {
modelId: string;
output: any;
duration: number;
error?: string;
timestamp: number;
tokenUsage?: {
prompt: number;
completion: number;
total: number;
};
responseSize?: number;
bps?: number;
tps?: number;
timeToFirstToken?: number;
cost?: number;
}BenchmarkRun
interface BenchmarkRun {
id: string;
timestamp: number;
command: string;
input: any;
config: BenchmarkConfig;
results: BenchmarkResult[];
}ExecutionBenchmarkOptions
interface ExecutionBenchmarkOptions {
taskId: string;
verificationCommands?: string[];
maxRetries?: number;
keepBranches?: boolean;
}WorkflowBenchmarkInput
interface WorkflowBenchmarkInput {
collectedResponses: {
projectName: string;
initMethod: "quick" | "custom" | "ai";
projectDescription?: string;
stackConfig?: {
frontend?: string;
backend?: string;
database?: string;
auth?: boolean;
};
prdMethod: "upload" | "manual" | "ai" | "skip";
prdContent?: string;
prdDescription?: string;
prdFile?: string;
refinePrd?: boolean;
refineFeedback?: string;
generateTasks?: boolean;
customInstructions?: string;
splitTasks?: boolean;
tasksToSplit?: string[];
splitInstructions?: string;
};
workflowOptions: WorkflowAutomationOptions;
projectDir?: string;
tempDirBase?: string;
}OPERATION REGISTRY
The following operations are registered for benchmarking via runBenchmark():
| Operation ID | Name | Description | Required Input Fields |
|-------------|------|-------------|----------------------|
| prd-parse | PRD Parsing | Parse a PRD file into tasks | file |
| prd-rework | PRD Rework | Rework a PRD based on feedback | file, feedback |
| prd-create | PRD Creation | Generate PRD from description | description |
| prd-combine | PRD Combination | Combine multiple PRD files | prds, originalDescription |
| prd-question | PRD Question Generation | Generate clarifying questions | file |
| prd-refine | PRD Refinement | Refine PRD with questions | file, questionMode, answers? |
| task-breakdown | Task Breakdown | Break down task into subtasks | taskId |
| task-create | Task Creation | Create task with AI enhancement | title |
| task-enhance | Task Enhancement | Enhance existing task | taskId |
| task-plan | Task Planning | Create implementation plan | taskId |
| task-document | Task Documentation | Generate task documentation | taskId |
| workflow-full | Complete Workflow | Execute full project workflow | Full WorkflowBenchmarkInput |
STORAGE FORMAT
Benchmark results are stored in .task-o-matic/benchmarks/{runId}/ with the following structure:
.task-o-matic/benchmarks/
└── run-{timestamp}/
├── metadata.json # Run metadata
├── input.json # Input data
└── results/ # Individual model results
├── {modelId}.json
├── {modelId}.json
└── ...metadata.json:
{
"id": "run-1736990400000",
"timestamp": 1736990400000,
"command": "prd-parse",
"config": {
"models": [
{ "provider": "anthropic", "model": "claude-3-5-sonnet" },
{ "provider": "openai", "model": "gpt-4o" }
],
"concurrency": 2,
"delay": 1000
}
}results/{modelId}.json:
{
"modelId": "anthropic:claude-3-5-sonnet",
"output": {
"tasks": [...],
"success": true
},
"duration": 5234,
"timestamp": 1736990405234,
"tokenUsage": {
"prompt": 1500,
"completion": 800,
"total": 2300
},
"responseSize": 45678,
"bps": 8725,
"tps": 153,
"timeToFirstToken": 234
}SURVIVAL SCENARIOS
Scenario 1: Model Performance Comparison
Compare AI models on PRD parsing performance:
const benchmarkResult = await benchmarkService.runBenchmark(
"prd-parse",
{
file: "./requirements.md",
workingDirectory: "/path/to/project"
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o" },
{ provider: "openrouter", model: "qwen-2.5" }
],
concurrency: 3,
delay: 0
}
);
// Find best performing model
const successfulResults = benchmarkResult.results.filter(r => !r.error);
const bestModel = successfulResults.reduce((best, current) =>
current.duration < best.duration ? current : best
);
console.log(`🏆 Best Model: ${bestModel.modelId}`);
console.log(` Duration: ${bestModel.duration}ms`);
console.log(` Tokens: ${bestModel.tokenUsage?.total}`);
console.log(` Throughput: ${bestModel.tps} tps`);
// Calculate cost per task (if pricing data available)
const costs = successfulResults.map(r => ({
modelId: r.modelId,
cost: r.cost || (r.tokenUsage?.total || 0) * 0.00001 // Rough estimate
}));
console.log(`\nCost Comparison:`);
costs.forEach(c => {
console.log(` ${c.modelId}: $${c.cost.toFixed(4)}`);
});Scenario 2: Task Execution Regression Testing
Detect performance regressions in task execution:
const baseline = await benchmarkService.runExecutionBenchmark(
{ taskId: "42", maxRetries: 3, keepBranches: false },
{ models: [{ provider: "anthropic", model: "claude-3-5-sonnet" }], concurrency: 1, delay: 0 }
);
// After code changes, run regression test
const regressionTest = await benchmarkService.runExecutionBenchmark(
{ taskId: "42", maxRetries: 3, keepBranches: false },
{ models: [{ provider: "anthropic", model: "claude-3-5-sonnet" }], concurrency: 1, delay: 0 }
);
const baselineDuration = baseline.results[0].duration;
const currentDuration = regressionTest.results[0].duration;
const percentChange = ((currentDuration - baselineDuration) / baselineDuration) * 100;
if (percentChange > 20) {
console.error(`🚨 REGRESSION: Task execution is ${percentChange.toFixed(1)}% slower`);
} else if (percentChange < -10) {
console.log(`✅ IMPROVEMENT: Task execution is ${Math.abs(percentChange).toFixed(1)}% faster`);
} else {
console.log(`✅ STABLE: Performance within acceptable range`);
}Scenario 3: Workflow Optimization
Find optimal model for complete workflow:
const benchmarkResult = await benchmarkService.runWorkflowBenchmark(
{
collectedResponses: {
projectName: "bunker-manager",
initMethod: "quick",
projectDescription: "Bunker management system",
prdMethod: "ai",
prdDescription: "Build bunker management",
generateTasks: true,
splitTasks: true
},
workflowOptions: { executeTasks: false },
tempDirBase: "/tmp/benchmarks"
},
{
models: [
{ provider: "anthropic", model: "claude-3-5-sonnet" },
{ provider: "openai", model: "gpt-4o-mini" }
],
concurrency: 1,
delay: 0
}
);
// Score models based on multiple factors
const scores = benchmarkResult.results.map(result => {
const stats = result.output?.stats || {};
const speedScore = 10000 / (result.duration || 1); // Lower is better
const taskCountScore = stats.totalTasks || 0;
const subtasksScore = stats.tasksWithSubtasks || 0;
return {
modelId: result.modelId,
totalScore: speedScore + taskCountScore + subtasksScore,
speedScore,
taskCountScore,
subtasksScore,
duration: result.duration,
totalTasks: stats.totalTasks
};
}).sort((a, b) => b.totalScore - a.totalScore);
console.log(`\nWorkflow Model Rankings:`);
scores.forEach((s, i) => {
console.log(`\n${i + 1}. ${s.modelId}`);
console.log(` Total Score: ${s.totalScore.toFixed(0)}`);
console.log(` Speed Score: ${s.speedScore.toFixed(0)}`);
console.log(` Tasks Generated: ${s.totalTasks}`);
console.log(` Duration: ${s.duration}ms`);
});
// Apply best result to actual project
const bestResult = benchmarkResult.results.find(r => r.modelId === scores[0].modelId);
if (bestResult) {
const applyResult = await workflowBenchmarkService.applyBenchmarkResult(
bestResult,
"/path/to/actual/project",
benchmarkResult.input.collectedResponses
);
if (applyResult.success) {
console.log(`\n✅ Applied best result to project`);
}
}TECHNICAL SPECIFICATIONS
Performance Metrics:
- Duration: Total execution time in milliseconds
- Token Usage: AI token consumption (prompt, completion, total)
- Response Size: Output size in bytes
- bps: Bytes per second (transfer rate)
- tps: Tokens per second (generation rate)
- Time to First Token: Latency to first output token
- Cost: Estimated cost in USD (if pricing data available)
Concurrency Model:
- Benchmarks support parallel execution via
concurrencyparameter - Models are queued and processed up to concurrency limit
- Delay between starting models to prevent rate limiting
- Progress events emitted for real-time monitoring
Isolation Strategy:
- Execution benchmarks use git branches for isolation
- Each model gets dedicated branch:
bench/{taskId}/{model}-{timestamp} - Workflow benchmarks use temporary directories
- Automatic cleanup unless
keepBranches: true
Error Handling:
- Per-model error tracking (doesn't fail entire benchmark)
- Detailed error messages stored in results
- Graceful degradation for partial failures
- Critical failures abort benchmark and restore git state
Storage:
- JSON-based storage in
.task-o-matic/benchmarks/ - Separate files for metadata, input, and results
- Efficient retrieval via ID lookup
- Sorted listing by timestamp
OPEN QUESTIONS / TODO
- [ ] Cost calculation: Need model-specific pricing data for accurate cost estimation
- [ ] Comparison feature: Historical comparison across runs (mentioned in original docs but not implemented)
- [ ] Summary statistics: Aggregated metrics across models (min, max, average, std dev)
- [ ] Benchmark templates: Pre-configured benchmark scenarios
- [ ] Result visualization: Charts and graphs for performance analysis
- [ ] Alert thresholds: Automatic alerts for performance regressions
- [ ] CI/CD integration: Automated benchmarking in pipelines
Remember: Citizen, BenchmarkService is your performance compass in the wasteland of AI operations. Use it to track performance, compare models, detect regressions, and make data-driven decisions about model selection. Regular benchmarking will keep your projects performing optimally even as the technical landscape shifts. Measure everything, trust the data, and your systems will remain efficient and reliable.
END OF TECHNICAL BULLETIN