-
Notifications
You must be signed in to change notification settings - Fork 1
Prototype: MCP server to aid debugging #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
daniel-thom
wants to merge
1
commit into
main
Choose a base branch
from
feat/mcp-server
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,146 @@ | ||
| (mcp-server)= | ||
| # MCP Server for Log Analysis | ||
|
|
||
| sparkctl includes an MCP (Model Context Protocol) server that provides AI-assisted diagnosis | ||
| of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server, | ||
| and connect-server components to detect error patterns and suggest recovery actions. | ||
|
|
||
| ## Installation | ||
|
|
||
| The MCP server requires the optional `mcp` dependency: | ||
|
|
||
| ```console | ||
| $ pip install 'sparkctl[mcp]' | ||
| ``` | ||
|
|
||
| ## Running the Server | ||
|
|
||
| Start the MCP server: | ||
|
|
||
| ```console | ||
| $ sparkctl-mcp-server | ||
| ``` | ||
|
|
||
| The server communicates over stdio using the MCP protocol. It is designed to be used with | ||
| AI assistants like Claude that support MCP. | ||
|
|
||
| ## Available Tools | ||
|
|
||
| The MCP server provides four tools: | ||
|
|
||
| ### get_spark_logs | ||
|
|
||
| Retrieve and aggregate Spark logs from the cluster. | ||
|
|
||
| **Parameters:** | ||
| - `spark_scratch` (required): Path to the spark_scratch directory | ||
| - `log_type`: One of "master", "worker", "executor", "connect", "thrift", or "all" (default: "all") | ||
| - `app_id`: Filter executor logs by application ID | ||
| - `executor_id`: Filter by specific executor ID | ||
| - `tail_lines`: Number of lines from end of each log (default: 500) | ||
|
|
||
| **Example use case:** "Show me the last 100 lines of executor logs for app-20240115120000-0000" | ||
|
|
||
| ### analyze_spark_failure | ||
|
|
||
| Analyze logs for error patterns and provide diagnosis. This is the primary diagnostic tool. | ||
|
|
||
| **Parameters:** | ||
| - `spark_scratch` (required): Path to the spark_scratch directory | ||
| - `app_id`: Specific application to analyze | ||
| - `include_stack_traces`: Include full stack traces (default: true) | ||
| - `max_errors`: Maximum errors to return (default: 50) | ||
|
|
||
| **Detected error patterns:** | ||
| - Out of memory (OOM) | ||
| - Shuffle failures (FetchFailedException) | ||
| - Stage and task failures | ||
| - Resource exhaustion | ||
| - Connection/network issues | ||
| - Disk space issues | ||
| - Serialization errors | ||
| - Timeout errors | ||
|
|
||
| **Example use case:** "Analyze why my Spark job failed" | ||
|
|
||
| ### get_recovery_suggestions | ||
|
|
||
| Get prioritized recovery suggestions based on detected errors. | ||
|
|
||
| **Parameters:** | ||
| - `error_types` (required): List of error types from analyze_spark_failure | ||
| - `current_config`: Current Spark configuration (optional) | ||
|
|
||
| **Example use case:** "How do I fix the OOM errors you found?" | ||
|
|
||
| ### list_spark_applications | ||
|
|
||
| List Spark applications found in spark_scratch. | ||
|
|
||
| **Parameters:** | ||
| - `spark_scratch` (required): Path to the spark_scratch directory | ||
|
|
||
| **Example use case:** "What applications have run in this cluster?" | ||
|
|
||
| ## Integration with torc | ||
|
|
||
| The sparkctl MCP server is designed to work alongside [torc](https://github.com/NREL/torc)'s | ||
| `analyze_workflow_logs` tool for full-stack diagnostics: | ||
|
|
||
| | Layer | Tool | Diagnostics | | ||
| |-------|------|-------------| | ||
| | Application | sparkctl MCP | Spark-specific: OOM, shuffle, stage failures, serialization | | ||
| | Infrastructure | torc MCP | System-level: Slurm errors, node failures, filesystem issues | | ||
|
|
||
| When sparkctl detects system-level issues (Slurm cancellation, filesystem errors, node health | ||
| problems), it will recommend using torc's analyze_workflow_logs tool for further investigation. | ||
|
|
||
| ## Example Workflow | ||
|
|
||
| 1. A Spark job fails on your HPC cluster | ||
| 2. Ask your AI assistant: "Analyze my failed Spark job in ./spark_scratch" | ||
| 3. The assistant uses `analyze_spark_failure` to detect error patterns | ||
| 4. It identifies OOM errors in executors and shuffle failures | ||
| 5. The assistant uses `get_recovery_suggestions` to get fixes | ||
| 6. You apply the suggested configuration changes and rerun | ||
|
|
||
| ## Direct Python Usage | ||
|
|
||
| The MCP tools can also be used directly in Python without the MCP server: | ||
|
|
||
| ```python | ||
| from sparkctl.mcp_server import ( | ||
| analyze_spark_failure, | ||
| get_recovery_suggestions, | ||
| get_spark_logs, | ||
| list_spark_applications, | ||
| ) | ||
|
|
||
| # Analyze failures | ||
| analysis = analyze_spark_failure("./spark_scratch") | ||
| print(f"Root cause: {analysis.likely_root_cause}") | ||
| print(f"Errors: {analysis.error_summary}") | ||
|
|
||
| # Get recovery suggestions | ||
| suggestions = get_recovery_suggestions(list(analysis.error_summary.keys())) | ||
| for s in suggestions.suggestions: | ||
| print(f"[{s.priority}] {s.title}") | ||
| if s.sparkctl_command: | ||
| print(f" Run: {s.sparkctl_command}") | ||
| ``` | ||
|
|
||
| ## Claude Code Configuration | ||
|
|
||
| To use the sparkctl MCP server with Claude Code, add it to your MCP configuration. The server | ||
| requires no arguments and communicates over stdio. | ||
|
|
||
| ```json | ||
| { | ||
| "mcpServers": { | ||
| "sparkctl": { | ||
| "command": "sparkctl-mcp-server", | ||
| "args": [] | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,7 @@ | |
| :hidden: | ||
|
|
||
| sparkctl_api | ||
| mcp_server_api | ||
| hpc/index | ||
| cli_reference | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| (mcp-server-api)= | ||
|
|
||
| # MCP Server API | ||
|
|
||
| The MCP server module provides tools for AI-assisted diagnosis of Spark job failures. | ||
|
|
||
| ## Tools | ||
|
|
||
| These functions can be used directly in Python or through the MCP server. | ||
|
|
||
| ```{eval-rst} | ||
| .. autofunction:: sparkctl.mcp_server.get_spark_logs | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autofunction:: sparkctl.mcp_server.analyze_spark_failure | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autofunction:: sparkctl.mcp_server.get_recovery_suggestions | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autofunction:: sparkctl.mcp_server.list_spark_applications | ||
| ``` | ||
|
|
||
| ## Response Models | ||
|
|
||
| ```{eval-rst} | ||
| .. autopydantic_model:: sparkctl.mcp_server.models.SparkLogsResponse | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autopydantic_model:: sparkctl.mcp_server.models.SparkFailureAnalysis | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autopydantic_model:: sparkctl.mcp_server.models.RecoverySuggestions | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autopydantic_model:: sparkctl.mcp_server.models.SparkApplicationList | ||
| :members: | ||
| ``` | ||
|
|
||
| ## Utilities | ||
|
|
||
| ```{eval-rst} | ||
| .. autoclass:: sparkctl.mcp_server.SparkLogParser | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autoclass:: sparkctl.mcp_server.SparkLogLocator | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autoclass:: sparkctl.mcp_server.ErrorPatternRegistry | ||
| :members: | ||
| ``` | ||
|
|
||
| ```{eval-rst} | ||
| .. autoclass:: sparkctl.mcp_server.RecoveryEngine | ||
| :members: | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| """MCP Server for sparkctl Spark job failure diagnostics. | ||
|
|
||
| This module provides an MCP (Model Context Protocol) server that diagnoses | ||
| Spark job failures by analyzing logs from master, worker, executor, | ||
| thrift-server, and connect-server components. | ||
|
|
||
| Tools provided: | ||
| - get_spark_logs: Retrieve and aggregate Spark logs | ||
| - analyze_spark_failure: Detect error patterns and diagnose issues | ||
| - get_recovery_suggestions: Get remediation suggestions for detected errors | ||
| - list_spark_applications: List Spark applications in spark_scratch | ||
|
|
||
| Usage: | ||
| Run the MCP server with: sparkctl-mcp-server | ||
|
|
||
| The server is designed to work alongside torc's analyze_workflow_logs tool | ||
| for full-stack diagnostics (Spark + Slurm/system-level). | ||
| """ | ||
|
|
||
| from sparkctl.mcp_server.error_patterns import ErrorPatternRegistry | ||
| from sparkctl.mcp_server.log_parser import SparkLogLocator, SparkLogParser | ||
| from sparkctl.mcp_server.models import ( | ||
| ErrorCategory, | ||
| ErrorOccurrence, | ||
| LogEntry, | ||
| RecoverySuggestions, | ||
| SparkApplication, | ||
| SparkApplicationList, | ||
| SparkFailureAnalysis, | ||
| SparkLogsResponse, | ||
| Suggestion, | ||
| ) | ||
| from sparkctl.mcp_server.recovery import RecoveryEngine | ||
| from sparkctl.mcp_server.tools import ( | ||
| analyze_spark_failure, | ||
| get_recovery_suggestions, | ||
| get_spark_logs, | ||
| list_spark_applications, | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| # Models | ||
| "ErrorCategory", | ||
| "ErrorOccurrence", | ||
| "LogEntry", | ||
| "RecoverySuggestions", | ||
| "SparkApplication", | ||
| "SparkApplicationList", | ||
| "SparkFailureAnalysis", | ||
| "SparkLogsResponse", | ||
| "Suggestion", | ||
| # Tools (can be used directly without MCP) | ||
| "analyze_spark_failure", | ||
| "get_recovery_suggestions", | ||
| "get_spark_logs", | ||
| "list_spark_applications", | ||
| # Utilities | ||
| "ErrorPatternRegistry", | ||
| "RecoveryEngine", | ||
| "SparkLogLocator", | ||
| "SparkLogParser", | ||
| ] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation example shows using 'Claude Code', but the configuration snippet refers to a generic Claude configuration. Consider clarifying whether this is specifically for Claude Desktop, Claude Code, or other MCP-compatible clients, and providing configuration instructions for each supported client.