An AI coding skill that makes your AI assistant production-safe when writing Hive, Impala, and Spark ETL code on HDFS/YARN.
It targets the class of bugs that are invisible — row counts look correct, no errors thrown, but data is silently wrong or performance collapses.
npx skills add Oak-B/bigdata-analysis-skill@bigdata-analysis| Rule | Problem It Prevents |
|---|---|
| Rule 0 | DESCRIBE before coding — never guess column names or types |
| Rule 1 | Never hard-code table names in Spark source |
| Rule 2 | Keep long-text fields out of GROUP BY (control characters cause silent row explosion) |
| Rule 3 | Filter first, then aggregate — prevents OOM on billion-row tables |
| Rule 4 | Use Spark SQL, not DataFrame API (real benchmark: 3h → 15min) |
| Rule 5 | Control broadcast JOIN threshold — prevents task explosion |
| Rule 6 | Never use SELECT * in INSERT — prevents silent column shifts |
| Rule 7 | Use LEFT JOIN for optional fields — prevents silent row loss |
| Rule 8 | Refresh metadata after Spark write |
| Rule 9 | UDF type safety — nested collection return types crash at runtime |
Plus: date window off-by-one, Scala string interpolation pitfalls, regex engine differences, and more.
| Mode | Behavior |
|---|---|
| Analysis | Run SQL → present numbers → ask the user before making decisions |
| Coding | Follow the 10 rules strictly; never guess types, column order, or table names |
| Symptom | Likely Root Cause |
|---|---|
| New column all NULL / field values shifted | SELECT * + schema change (Rule 6) |
| 45+ Spark Jobs, 3-hour runtime | DataFrame API + multiple .count() (Rule 4) |
| Job timeout, 26k+ tasks | Auto-broadcast on medium table (Rule 5) |
| Row explosion, field misalignment | Control characters in GROUP BY field (Rule 2) |
| OOM on aggregation | Direct GROUP BY on billion-row table (Rule 3) |
| Silent row loss after JOIN | INNER JOIN on optional field (Rule 7) |
| Hive/Impala sees no data after write | Metadata not refreshed (Rule 8) |
UDF NoClassDefFoundError |
Nested Scala collection return type (Rule 9) |
bigdata-analysis/
├── SKILL.md # Main skill instructions (10 rules + quick reference)
└── references/
├── spark-pitfalls.md # Deep-dive: root cause analysis & extended examples
└── sql-patterns.md # AI-specific SQL anti-patterns
- Data engineers writing Hive/Impala SQL or Spark Scala ETL jobs
- Anyone using AI coding assistants for big data workflows
- Teams that have been bitten by "data looks right but isn't" bugs
MIT