40 lines (35 loc) · 1.75 KB

Video Course: VisionEngine

Episode 1: Two-Layer Analysis Pipeline (12 min)

Architecture: GoCV (mechanical) + LLM Vision (intelligent)
Why two layers: cost, speed, accuracy tradeoffs
Data flow: screenshot → GoCV bounding boxes → LLM context → ScreenAnalysis
Demo: analyzing a sample UI screenshot

Episode 2: GoCV Operations Deep Dive (15 min)

SSIM screenshot diffing and change masks
Edge detection (Canny) for UI element bounds
Contour detection for element bounding boxes
Color analysis: dominant colors, contrast ratios
Build tags: //go:build vision and stub pattern
Demo: detecting UI changes between screenshots

Episode 3: LLM Vision Providers (12 min)

VisionProvider interface: AnalyzeImage, CompareImages
OpenAI GPT-4o adapter: base64 image encoding, prompt structure
Anthropic Claude adapter: messages API with image content
Gemini adapter: multimodal content parts
Qwen-VL adapter: vision-language model
FallbackProvider: score-ranked provider chain
Demo: comparing provider responses for same screenshot

Episode 4: Navigation Graph Building (15 min)

NavigationGraph interface: screens as nodes, actions as edges
Adding screens with visual similarity hashing
BFS pathfinding: PathTo() shortest path
Coverage tracking: visited vs discovered screens
Export formats: DOT (Graphviz), JSON, Mermaid
Thread safety with sync.RWMutex
Demo: building a navigation graph from app exploration

Episode 5: Video Frame Extraction (10 min)

VideoProcessor interface: frame extraction, scene changes
Key frame detection at screen transitions
Thumbnail generation for timeline
Integration with SessionRecorder for timestamp linking
Demo: extracting key frames from a test session recording