Prototype: MCP server to aid debugging#6
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a prototype MCP (Model Context Protocol) server that enables AI-assisted diagnosis of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server, and connect-server components to detect error patterns and provide recovery suggestions.
Changes:
- Added new
sparkctl.mcp_servermodule with tools for log retrieval, failure analysis, recovery suggestions, and application listing - Implemented error pattern matching for common Spark failures (OOM, shuffle, connection, disk, etc.)
- Added recovery suggestion engine with prioritized remediation steps
- Created comprehensive documentation for using the MCP server with AI assistants
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sparkctl/mcp_server/server.py | MCP server implementation with FastMCP, tool wrappers, and entry point |
| src/sparkctl/mcp_server/tools.py | Core tool implementations for log analysis and diagnostics |
| src/sparkctl/mcp_server/models.py | Pydantic models for responses and error categorization |
| src/sparkctl/mcp_server/log_parser.py | Log file discovery and parsing with multiline support |
| src/sparkctl/mcp_server/error_patterns.py | Error pattern registry for classifying Spark failures |
| src/sparkctl/mcp_server/recovery.py | Recovery suggestion engine with strategies per error type |
| src/sparkctl/mcp_server/init.py | Module exports and documentation |
| pyproject.toml | Added mcp optional dependency and sparkctl-mcp-server entry point |
| docs/how_tos/debugging/mcp_server.md | How-to guide for using the MCP server |
| docs/reference/mcp_server_api.md | API reference documentation |
| docs/how_tos/debugging/index.md | Updated debugging index to include MCP server |
| docs/reference/index.md | Updated reference index to include MCP server API |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| for log_file in log_files: | ||
| try: | ||
| lines = parser._read_lines(log_file.path, tail_lines) |
There was a problem hiding this comment.
Accessing a private method (_read_lines) from SparkLogParser. This breaks encapsulation and makes the code harder to maintain. Consider making this method public or providing a public interface for reading log lines with tail support.
| def get_recovery_suggestions( | ||
| error_types: list[str], | ||
| current_config: dict[str, str] | None = None, | ||
| ): |
There was a problem hiding this comment.
The function signature declares a return type but does not include a type annotation for the return value. Add the return type annotation explicitly for consistency with the rest of the codebase.
| ): | |
| ) -> "RecoverySuggestions": |
| def list_spark_applications( | ||
| spark_scratch: str, | ||
| include_completed: bool = True, | ||
| include_failed: bool = True, | ||
| ) -> SparkApplicationList: |
There was a problem hiding this comment.
The parameters include_completed and include_failed are defined but never used in the function implementation. Either implement the filtering logic or remove these unused parameters from the function signature.
| except Exception: | ||
| logger.exception("Failed to analyze file {}", log_file) | ||
| continue |
There was a problem hiding this comment.
The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.
|
|
||
| # Find the highest priority error that occurred | ||
| root_cause = None | ||
| best_priority = float("inf") |
There was a problem hiding this comment.
Using float("inf") for comparison may produce confusing behavior. Consider using a more explicit sentinel value like sys.maxsize or a large integer constant to represent "no priority" or "lowest priority".
| except Exception: | ||
| # Skip files we can't read |
There was a problem hiding this comment.
The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.
| except Exception: | |
| # Skip files we can't read | |
| except OSError as exc: | |
| # Skip files we can't read, but log the failure for debugging | |
| logger.warning( | |
| "Failed to read Spark log file {path}: {error}", | |
| path=log_file.path, | |
| error=str(exc), | |
| ) |
| try: | ||
| start_time = datetime.fromtimestamp(app_dir.stat().st_ctime) | ||
| except OSError: | ||
| pass |
There was a problem hiding this comment.
The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit. Catch specific exception types like OSError or use Exception to avoid suppressing critical system exceptions.
| pass | |
| logger.debug( | |
| "Unable to read creation time for Spark application directory %s (app_id=%s)", | |
| app_dir, | |
| app_id, | |
| exc_info=True, | |
| ) |
| # Handle year ambiguity - assume 20xx | ||
| timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S") |
There was a problem hiding this comment.
Hardcoded assumption that the year prefix is "20" (representing 2000-2099). This will break for years beyond 2099. Consider using the current year or a more robust timestamp parsing approach that doesn't make assumptions about the century.
| # Handle year ambiguity - assume 20xx | |
| timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S") | |
| # Parse two-digit year using datetime's %y handling (00-68 -> 2000-2068, 69-99 -> 1969-1999) | |
| timestamp = datetime.strptime(ts_str, "%y/%m/%d %H:%M:%S") |
| log_type: Literal["master", "worker", "executor", "connect", "thrift", "all"] = "all", | ||
| app_id: str | None = None, | ||
| executor_id: str | None = None, | ||
| tail_lines: int = 500, |
There was a problem hiding this comment.
The tail_lines parameter lacks validation for negative or zero values. Consider adding input validation to ensure tail_lines is positive, or handle negative/zero values appropriately (e.g., treating 0 or negative as "no limit").
| spark_scratch: str, | ||
| app_id: str | None = None, | ||
| include_stack_traces: bool = True, | ||
| max_errors: int = 50, |
There was a problem hiding this comment.
The max_errors parameter lacks validation for negative or zero values. Consider adding input validation to ensure max_errors is positive, or document the behavior when a non-positive value is provided.
This is currently an auto-generated, untested prototype from Claude. Tasks: