Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions docs/how_tos/debugging/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@
# Debugging
This page describes how to debug certain Spark errors when using sparkctl.

```{eval-rst}
.. toctree::
:maxdepth: 1

mcp_server
```

(ai-assisted-debugging)=
## AI-Assisted Debugging
sparkctl includes an MCP server that provides AI-assisted diagnosis of Spark failures.
See {ref}`mcp-server` for details on using the MCP server with AI assistants like Claude.

(spark-web-ui)=
## Spark web UI
The web UI is a good first place to look for problems. Connect to ports 8080 and 4040 on the nodes
Expand Down
146 changes: 146 additions & 0 deletions docs/how_tos/debugging/mcp_server.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
(mcp-server)=
# MCP Server for Log Analysis

sparkctl includes an MCP (Model Context Protocol) server that provides AI-assisted diagnosis
of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server,
and connect-server components to detect error patterns and suggest recovery actions.

## Installation

The MCP server requires the optional `mcp` dependency:

```console
$ pip install 'sparkctl[mcp]'
```

## Running the Server

Start the MCP server:

```console
$ sparkctl-mcp-server
```

The server communicates over stdio using the MCP protocol. It is designed to be used with
AI assistants like Claude that support MCP.

## Available Tools

The MCP server provides four tools:

### get_spark_logs

Retrieve and aggregate Spark logs from the cluster.

**Parameters:**
- `spark_scratch` (required): Path to the spark_scratch directory
- `log_type`: One of "master", "worker", "executor", "connect", "thrift", or "all" (default: "all")
- `app_id`: Filter executor logs by application ID
- `executor_id`: Filter by specific executor ID
- `tail_lines`: Number of lines from end of each log (default: 500)

**Example use case:** "Show me the last 100 lines of executor logs for app-20240115120000-0000"

### analyze_spark_failure

Analyze logs for error patterns and provide diagnosis. This is the primary diagnostic tool.

**Parameters:**
- `spark_scratch` (required): Path to the spark_scratch directory
- `app_id`: Specific application to analyze
- `include_stack_traces`: Include full stack traces (default: true)
- `max_errors`: Maximum errors to return (default: 50)

**Detected error patterns:**
- Out of memory (OOM)
- Shuffle failures (FetchFailedException)
- Stage and task failures
- Resource exhaustion
- Connection/network issues
- Disk space issues
- Serialization errors
- Timeout errors

**Example use case:** "Analyze why my Spark job failed"

### get_recovery_suggestions

Get prioritized recovery suggestions based on detected errors.

**Parameters:**
- `error_types` (required): List of error types from analyze_spark_failure
- `current_config`: Current Spark configuration (optional)

**Example use case:** "How do I fix the OOM errors you found?"

### list_spark_applications

List Spark applications found in spark_scratch.

**Parameters:**
- `spark_scratch` (required): Path to the spark_scratch directory

**Example use case:** "What applications have run in this cluster?"

## Integration with torc

The sparkctl MCP server is designed to work alongside [torc](https://github.com/NREL/torc)'s
`analyze_workflow_logs` tool for full-stack diagnostics:

| Layer | Tool | Diagnostics |
|-------|------|-------------|
| Application | sparkctl MCP | Spark-specific: OOM, shuffle, stage failures, serialization |
| Infrastructure | torc MCP | System-level: Slurm errors, node failures, filesystem issues |

When sparkctl detects system-level issues (Slurm cancellation, filesystem errors, node health
problems), it will recommend using torc's analyze_workflow_logs tool for further investigation.

## Example Workflow

1. A Spark job fails on your HPC cluster
2. Ask your AI assistant: "Analyze my failed Spark job in ./spark_scratch"
3. The assistant uses `analyze_spark_failure` to detect error patterns
4. It identifies OOM errors in executors and shuffle failures
5. The assistant uses `get_recovery_suggestions` to get fixes
6. You apply the suggested configuration changes and rerun

## Direct Python Usage

The MCP tools can also be used directly in Python without the MCP server:

```python
from sparkctl.mcp_server import (
analyze_spark_failure,
get_recovery_suggestions,
get_spark_logs,
list_spark_applications,
)

# Analyze failures
analysis = analyze_spark_failure("./spark_scratch")
print(f"Root cause: {analysis.likely_root_cause}")
print(f"Errors: {analysis.error_summary}")

# Get recovery suggestions
suggestions = get_recovery_suggestions(list(analysis.error_summary.keys()))
for s in suggestions.suggestions:
print(f"[{s.priority}] {s.title}")
if s.sparkctl_command:
print(f" Run: {s.sparkctl_command}")
```

## Claude Code Configuration

To use the sparkctl MCP server with Claude Code, add it to your MCP configuration. The server
requires no arguments and communicates over stdio.

```json
{
"mcpServers": {
"sparkctl": {
"command": "sparkctl-mcp-server",
"args": []
}
}
}
```
Comment on lines +132 to +146

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example shows using 'Claude Code', but the configuration snippet refers to a generic Claude configuration. Consider clarifying whether this is specifically for Claude Desktop, Claude Code, or other MCP-compatible clients, and providing configuration instructions for each supported client.

Copilot uses AI. Check for mistakes.
1 change: 1 addition & 0 deletions docs/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
:hidden:

sparkctl_api
mcp_server_api
hpc/index
cli_reference
```
69 changes: 69 additions & 0 deletions docs/reference/mcp_server_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
(mcp-server-api)=

# MCP Server API

The MCP server module provides tools for AI-assisted diagnosis of Spark job failures.

## Tools

These functions can be used directly in Python or through the MCP server.

```{eval-rst}
.. autofunction:: sparkctl.mcp_server.get_spark_logs
```

```{eval-rst}
.. autofunction:: sparkctl.mcp_server.analyze_spark_failure
```

```{eval-rst}
.. autofunction:: sparkctl.mcp_server.get_recovery_suggestions
```

```{eval-rst}
.. autofunction:: sparkctl.mcp_server.list_spark_applications
```

## Response Models

```{eval-rst}
.. autopydantic_model:: sparkctl.mcp_server.models.SparkLogsResponse
:members:
```

```{eval-rst}
.. autopydantic_model:: sparkctl.mcp_server.models.SparkFailureAnalysis
:members:
```

```{eval-rst}
.. autopydantic_model:: sparkctl.mcp_server.models.RecoverySuggestions
:members:
```

```{eval-rst}
.. autopydantic_model:: sparkctl.mcp_server.models.SparkApplicationList
:members:
```

## Utilities

```{eval-rst}
.. autoclass:: sparkctl.mcp_server.SparkLogParser
:members:
```

```{eval-rst}
.. autoclass:: sparkctl.mcp_server.SparkLogLocator
:members:
```

```{eval-rst}
.. autoclass:: sparkctl.mcp_server.ErrorPatternRegistry
:members:
```

```{eval-rst}
.. autoclass:: sparkctl.mcp_server.RecoveryEngine
:members:
```
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ dependencies = [
pyspark = [
"pyspark == 4.0.0",
]
mcp = [
"mcp >= 1.7.0",
]
dev = [
"furo",
"mypy >=1.13, < 2",
Expand All @@ -64,6 +67,7 @@ Source = "https://github.com/NREL/sparkctl"

[project.scripts]
sparkctl = "sparkctl.cli.sparkctl:cli"
sparkctl-mcp-server = "sparkctl.mcp_server.server:main"

[tool.setuptools.packages.find]
where = ["src"]
Expand Down
62 changes: 62 additions & 0 deletions src/sparkctl/mcp_server/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
"""MCP Server for sparkctl Spark job failure diagnostics.

This module provides an MCP (Model Context Protocol) server that diagnoses
Spark job failures by analyzing logs from master, worker, executor,
thrift-server, and connect-server components.

Tools provided:
- get_spark_logs: Retrieve and aggregate Spark logs
- analyze_spark_failure: Detect error patterns and diagnose issues
- get_recovery_suggestions: Get remediation suggestions for detected errors
- list_spark_applications: List Spark applications in spark_scratch

Usage:
Run the MCP server with: sparkctl-mcp-server

The server is designed to work alongside torc's analyze_workflow_logs tool
for full-stack diagnostics (Spark + Slurm/system-level).
"""

from sparkctl.mcp_server.error_patterns import ErrorPatternRegistry
from sparkctl.mcp_server.log_parser import SparkLogLocator, SparkLogParser
from sparkctl.mcp_server.models import (
ErrorCategory,
ErrorOccurrence,
LogEntry,
RecoverySuggestions,
SparkApplication,
SparkApplicationList,
SparkFailureAnalysis,
SparkLogsResponse,
Suggestion,
)
from sparkctl.mcp_server.recovery import RecoveryEngine
from sparkctl.mcp_server.tools import (
analyze_spark_failure,
get_recovery_suggestions,
get_spark_logs,
list_spark_applications,
)

__all__ = [
# Models
"ErrorCategory",
"ErrorOccurrence",
"LogEntry",
"RecoverySuggestions",
"SparkApplication",
"SparkApplicationList",
"SparkFailureAnalysis",
"SparkLogsResponse",
"Suggestion",
# Tools (can be used directly without MCP)
"analyze_spark_failure",
"get_recovery_suggestions",
"get_spark_logs",
"list_spark_applications",
# Utilities
"ErrorPatternRegistry",
"RecoveryEngine",
"SparkLogLocator",
"SparkLogParser",
]
Loading
Loading