Solution Guide (Interviewer Reference)
Bug 1: Missing Environment Variable (Easy)
File: docker-compose.yml
Issue: DEVICE_SERVICE_URL not set for workflow-service
Symptom: workflow-service crashes on startup with KeyError: 'DEVICE_SERVICE_URL'
Fix: Add - DEVICE_SERVICE_URL=http://device-service:5001 to workflow-service environment
Bug 2: Wrong API Endpoint Path (Easy)
File: services/workflow-service/app.py, line ~134
Issue: Uses /device/{id}/reserve instead of /devices/{id}/book
Symptom: Starting a workflow returns 404 from device service
Fix: Change to f"{DEVICE_SERVICE_URL}/devices/{device_id}/book"
Bug 3: Race Condition (Medium)
File: services/device-service/app.py, book_device()
Issue: time.sleep(0.1) between status check and update, no DB locking
Symptom: ./test-race-condition.sh shows multiple bookings for same device
Fix: Use SELECT ... FOR UPDATE:
with db .session .begin ():
device = Device .query .filter_by (id = device_id ).with_for_update ().one ()
if device .status != "available" :
return error
device .status = "busy"
Bug 4: N+1 Query Problem (Medium)
File: services/workflow-service/app.py, list_workflows()
Issue: For each workflow, individually queries WorkflowSample then Sample
Symptom: GET /workflows generates ~150+ SQL queries (visible with SQLALCHEMY_ECHO=true)
Fix: Use eager loading or a joined query:
from sqlalchemy .orm import joinedload
workflows = Workflow .query .options (
joinedload (Workflow .steps )
).all ()
# Then batch-load samples with a single IN query
Bug 5: Missing Index (Medium)
File: services/device-service/models.py, ExecutionLog.started_at
Issue: No index on started_at column, 500k rows
Symptom: GET /execution-logs/search?start=2025-01-01&end=2025-01-02 takes 5-15 seconds
Fix: Add index=True to the column definition or create index via SQL
Bug 6: OOM - Unbounded Query (Hard)
File: services/device-service/app.py, get_all_execution_logs()
Issue: ExecutionLog.query.all() loads 500k rows into memory
Symptom: GET /execution-logs crashes the container (256M memory limit)
Fix: Add pagination:
page = request .args .get ("page" , 1 , type = int )
per_page = min (request .args .get ("per_page" , 50 , type = int ), 100 )
pagination = ExecutionLog .query .order_by (
ExecutionLog .started_at .desc ()
).paginate (page = page , per_page = per_page )
Phase 2: Device Queue Implementation
CREATE TABLE device_queue (
id SERIAL PRIMARY KEY ,
device_id VARCHAR (50 ) NOT NULL ,
workflow_id VARCHAR (36 ) NOT NULL ,
priority VARCHAR (10 ) NOT NULL DEFAULT ' normal' ,
priority_value INTEGER NOT NULL DEFAULT 5 ,
status VARCHAR (20 ) NOT NULL DEFAULT ' queued' ,
queued_at TIMESTAMP NOT NULL DEFAULT NOW(),
booked_at TIMESTAMP
);
POST /devices/{id}/queue - Queue a booking request with priority
GET /devices/{id}/queue - View current queue
GET /devices/{id}/wait-time - Estimated wait time
Proper ordering: priority_value ASC, queued_at ASC
Transaction safety on release + auto-book
Handling edge cases (cancelled queue entries, device in maintenance)
Concurrency handling on queue operations
Systematic approach: Do they check logs first? Follow the error trail?
Understanding: Can they explain WHY each bug exists?
DB knowledge: Do they recognize N+1 pattern? Know about indexes?
Fix quality: Are fixes correct and complete?
Schema design: Appropriate columns, types, constraints?
Priority logic: Correct ordering algorithm?
Concurrency: Do they consider race conditions in the queue?
Code quality: Clean, well-structured, handles edge cases?