Skip to content

Latest commit

 

History

History
101 lines (87 loc) · 3.91 KB

File metadata and controls

101 lines (87 loc) · 3.91 KB

Solution Guide (Interviewer Reference)

Bug Summary

Bug 1: Missing Environment Variable (Easy)

  • File: docker-compose.yml
  • Issue: DEVICE_SERVICE_URL not set for workflow-service
  • Symptom: workflow-service crashes on startup with KeyError: 'DEVICE_SERVICE_URL'
  • Fix: Add - DEVICE_SERVICE_URL=http://device-service:5001 to workflow-service environment

Bug 2: Wrong API Endpoint Path (Easy)

  • File: services/workflow-service/app.py, line ~134
  • Issue: Uses /device/{id}/reserve instead of /devices/{id}/book
  • Symptom: Starting a workflow returns 404 from device service
  • Fix: Change to f"{DEVICE_SERVICE_URL}/devices/{device_id}/book"

Bug 3: Race Condition (Medium)

  • File: services/device-service/app.py, book_device()
  • Issue: time.sleep(0.1) between status check and update, no DB locking
  • Symptom: ./test-race-condition.sh shows multiple bookings for same device
  • Fix: Use SELECT ... FOR UPDATE:
with db.session.begin():
    device = Device.query.filter_by(id=device_id).with_for_update().one()
    if device.status != "available":
        return error
    device.status = "busy"

Bug 4: N+1 Query Problem (Medium)

  • File: services/workflow-service/app.py, list_workflows()
  • Issue: For each workflow, individually queries WorkflowSample then Sample
  • Symptom: GET /workflows generates ~150+ SQL queries (visible with SQLALCHEMY_ECHO=true)
  • Fix: Use eager loading or a joined query:
from sqlalchemy.orm import joinedload
workflows = Workflow.query.options(
    joinedload(Workflow.steps)
).all()
# Then batch-load samples with a single IN query

Bug 5: Missing Index (Medium)

  • File: services/device-service/models.py, ExecutionLog.started_at
  • Issue: No index on started_at column, 500k rows
  • Symptom: GET /execution-logs/search?start=2025-01-01&end=2025-01-02 takes 5-15 seconds
  • Fix: Add index=True to the column definition or create index via SQL

Bug 6: OOM - Unbounded Query (Hard)

  • File: services/device-service/app.py, get_all_execution_logs()
  • Issue: ExecutionLog.query.all() loads 500k rows into memory
  • Symptom: GET /execution-logs crashes the container (256M memory limit)
  • Fix: Add pagination:
page = request.args.get("page", 1, type=int)
per_page = min(request.args.get("per_page", 50, type=int), 100)
pagination = ExecutionLog.query.order_by(
    ExecutionLog.started_at.desc()
).paginate(page=page, per_page=per_page)

Phase 2: Device Queue Implementation

Expected Schema

CREATE TABLE device_queue (
    id SERIAL PRIMARY KEY,
    device_id VARCHAR(50) NOT NULL,
    workflow_id VARCHAR(36) NOT NULL,
    priority VARCHAR(10) NOT NULL DEFAULT 'normal',
    priority_value INTEGER NOT NULL DEFAULT 5,
    status VARCHAR(20) NOT NULL DEFAULT 'queued',
    queued_at TIMESTAMP NOT NULL DEFAULT NOW(),
    booked_at TIMESTAMP
);

Expected Endpoints

  • POST /devices/{id}/queue - Queue a booking request with priority
  • GET /devices/{id}/queue - View current queue
  • GET /devices/{id}/wait-time - Estimated wait time

Key Things to Look For

  • Proper ordering: priority_value ASC, queued_at ASC
  • Transaction safety on release + auto-book
  • Handling edge cases (cancelled queue entries, device in maintenance)
  • Concurrency handling on queue operations

Evaluation Criteria

Phase 1 (Debugging)

  • Systematic approach: Do they check logs first? Follow the error trail?
  • Understanding: Can they explain WHY each bug exists?
  • DB knowledge: Do they recognize N+1 pattern? Know about indexes?
  • Fix quality: Are fixes correct and complete?

Phase 2 (Feature)

  • Schema design: Appropriate columns, types, constraints?
  • Priority logic: Correct ordering algorithm?
  • Concurrency: Do they consider race conditions in the queue?
  • Code quality: Clean, well-structured, handles edge cases?