NatLabRockies · daniel-thom · Jan 12, 2026 · Copilot · Jan 12, 2026
diff --git a/docs/how_tos/debugging/index.md b/docs/how_tos/debugging/index.md
@@ -2,6 +2,18 @@
 # Debugging
 This page describes how to debug certain Spark errors when using sparkctl.
 
+```{eval-rst}
+.. toctree::
+    :maxdepth: 1
+
+    mcp_server
+```
+
+(ai-assisted-debugging)=
+## AI-Assisted Debugging
+sparkctl includes an MCP server that provides AI-assisted diagnosis of Spark failures.
+See {ref}`mcp-server` for details on using the MCP server with AI assistants like Claude.
+
 (spark-web-ui)=
 ## Spark web UI
 The web UI is a good first place to look for problems. Connect to ports 8080 and 4040 on the nodes

diff --git a/docs/how_tos/debugging/mcp_server.md b/docs/how_tos/debugging/mcp_server.md
@@ -0,0 +1,146 @@
+(mcp-server)=
+# MCP Server for Log Analysis
+
+sparkctl includes an MCP (Model Context Protocol) server that provides AI-assisted diagnosis
+of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server,
+and connect-server components to detect error patterns and suggest recovery actions.
+
+## Installation
+
+The MCP server requires the optional `mcp` dependency:
+
+```console
+$ pip install 'sparkctl[mcp]'
+```
+
+## Running the Server
+
+Start the MCP server:
+
+```console
+$ sparkctl-mcp-server
+```
+
+The server communicates over stdio using the MCP protocol. It is designed to be used with
+AI assistants like Claude that support MCP.
+
+## Available Tools
+
+The MCP server provides four tools:
+
+### get_spark_logs
+
+Retrieve and aggregate Spark logs from the cluster.
+
+**Parameters:**
+- `spark_scratch` (required): Path to the spark_scratch directory
+- `log_type`: One of "master", "worker", "executor", "connect", "thrift", or "all" (default: "all")
+- `app_id`: Filter executor logs by application ID
+- `executor_id`: Filter by specific executor ID
+- `tail_lines`: Number of lines from end of each log (default: 500)
+
+**Example use case:** "Show me the last 100 lines of executor logs for app-20240115120000-0000"
+
+### analyze_spark_failure
+
+Analyze logs for error patterns and provide diagnosis. This is the primary diagnostic tool.
+
+**Parameters:**
+- `spark_scratch` (required): Path to the spark_scratch directory
+- `app_id`: Specific application to analyze
+- `include_stack_traces`: Include full stack traces (default: true)
+- `max_errors`: Maximum errors to return (default: 50)
+
+**Detected error patterns:**
+- Out of memory (OOM)
+- Shuffle failures (FetchFailedException)
+- Stage and task failures
+- Resource exhaustion
+- Connection/network issues
+- Disk space issues
+- Serialization errors
+- Timeout errors
+
+**Example use case:** "Analyze why my Spark job failed"
+
+### get_recovery_suggestions
+
+Get prioritized recovery suggestions based on detected errors.
+
+**Parameters:**
+- `error_types` (required): List of error types from analyze_spark_failure
+- `current_config`: Current Spark configuration (optional)
+
+**Example use case:** "How do I fix the OOM errors you found?"
+
+### list_spark_applications
+
+List Spark applications found in spark_scratch.
+
+**Parameters:**
+- `spark_scratch` (required): Path to the spark_scratch directory
+
+**Example use case:** "What applications have run in this cluster?"
+
+## Integration with torc
+
+The sparkctl MCP server is designed to work alongside [torc](https://github.com/NREL/torc)'s
+`analyze_workflow_logs` tool for full-stack diagnostics:
+
+| Layer | Tool | Diagnostics |
+|-------|------|-------------|
+| Application | sparkctl MCP | Spark-specific: OOM, shuffle, stage failures, serialization |
+| Infrastructure | torc MCP | System-level: Slurm errors, node failures, filesystem issues |
+
+When sparkctl detects system-level issues (Slurm cancellation, filesystem errors, node health
+problems), it will recommend using torc's analyze_workflow_logs tool for further investigation.
+
+## Example Workflow
+
+1. A Spark job fails on your HPC cluster
+2. Ask your AI assistant: "Analyze my failed Spark job in ./spark_scratch"
+3. The assistant uses `analyze_spark_failure` to detect error patterns
+4. It identifies OOM errors in executors and shuffle failures
+5. The assistant uses `get_recovery_suggestions` to get fixes
+6. You apply the suggested configuration changes and rerun
+
+## Direct Python Usage
+
+The MCP tools can also be used directly in Python without the MCP server:
+
+```python
+from sparkctl.mcp_server import (
+    analyze_spark_failure,
+    get_recovery_suggestions,
+    get_spark_logs,
+    list_spark_applications,
+)
+
+# Analyze failures
+analysis = analyze_spark_failure("./spark_scratch")
+print(f"Root cause: {analysis.likely_root_cause}")
+print(f"Errors: {analysis.error_summary}")
+
+# Get recovery suggestions
+suggestions = get_recovery_suggestions(list(analysis.error_summary.keys()))
+for s in suggestions.suggestions:
+    print(f"[{s.priority}] {s.title}")
+    if s.sparkctl_command:
+        print(f"    Run: {s.sparkctl_command}")
+```
+
+## Claude Code Configuration
+
+To use the sparkctl MCP server with Claude Code, add it to your MCP configuration. The server
+requires no arguments and communicates over stdio.
+
+```json
+{
+  "mcpServers": {
+    "sparkctl": {
+      "command": "sparkctl-mcp-server",
+      "args": []
+    }
+  }
+}
+```
diff --git a/docs/reference/index.md b/docs/reference/index.md
@@ -8,6 +8,7 @@
     :hidden:
 
     sparkctl_api
+    mcp_server_api
     hpc/index
     cli_reference
 ```
diff --git a/docs/reference/mcp_server_api.md b/docs/reference/mcp_server_api.md
@@ -0,0 +1,69 @@
+(mcp-server-api)=
+
+# MCP Server API
+
+The MCP server module provides tools for AI-assisted diagnosis of Spark job failures.
+
+## Tools
+
+These functions can be used directly in Python or through the MCP server.
+
+```{eval-rst}
+.. autofunction:: sparkctl.mcp_server.get_spark_logs
+```
+
+```{eval-rst}
+.. autofunction:: sparkctl.mcp_server.analyze_spark_failure
+```
+
+```{eval-rst}
+.. autofunction:: sparkctl.mcp_server.get_recovery_suggestions
+```
+
+```{eval-rst}
+.. autofunction:: sparkctl.mcp_server.list_spark_applications
+```
+
+## Response Models
+
+```{eval-rst}
+.. autopydantic_model:: sparkctl.mcp_server.models.SparkLogsResponse
+   :members:
+```
+
+```{eval-rst}
+.. autopydantic_model:: sparkctl.mcp_server.models.SparkFailureAnalysis
+   :members:
+```
+
+```{eval-rst}
+.. autopydantic_model:: sparkctl.mcp_server.models.RecoverySuggestions
+   :members:
+```
+
+```{eval-rst}
+.. autopydantic_model:: sparkctl.mcp_server.models.SparkApplicationList
+   :members:
+```
+
+## Utilities
+
+```{eval-rst}
+.. autoclass:: sparkctl.mcp_server.SparkLogParser
+   :members:
+```
+
+```{eval-rst}
+.. autoclass:: sparkctl.mcp_server.SparkLogLocator
+   :members:
+```
+
+```{eval-rst}
+.. autoclass:: sparkctl.mcp_server.ErrorPatternRegistry
+   :members:
+```
+
+```{eval-rst}
+.. autoclass:: sparkctl.mcp_server.RecoveryEngine
+   :members:
+```
diff --git a/pyproject.toml b/pyproject.toml
@@ -39,6 +39,9 @@ dependencies = [
 pyspark = [
     "pyspark == 4.0.0",
 ]
+mcp = [
+    "mcp >= 1.7.0",
+]
 dev = [
     "furo",
     "mypy >=1.13, < 2",
@@ -64,6 +67,7 @@ Source = "https://github.com/NREL/sparkctl"
 
 [project.scripts]
 sparkctl = "sparkctl.cli.sparkctl:cli"
+sparkctl-mcp-server = "sparkctl.mcp_server.server:main"
 
 [tool.setuptools.packages.find]
 where = ["src"]

diff --git a/src/sparkctl/mcp_server/__init__.py b/src/sparkctl/mcp_server/__init__.py
@@ -0,0 +1,62 @@
+"""MCP Server for sparkctl Spark job failure diagnostics.
+
+This module provides an MCP (Model Context Protocol) server that diagnoses
+Spark job failures by analyzing logs from master, worker, executor,
+thrift-server, and connect-server components.
+
+Tools provided:
+- get_spark_logs: Retrieve and aggregate Spark logs
+- analyze_spark_failure: Detect error patterns and diagnose issues
+- get_recovery_suggestions: Get remediation suggestions for detected errors
+- list_spark_applications: List Spark applications in spark_scratch
+
+Usage:
+    Run the MCP server with: sparkctl-mcp-server
+
+The server is designed to work alongside torc's analyze_workflow_logs tool
+for full-stack diagnostics (Spark + Slurm/system-level).
+"""
+
+from sparkctl.mcp_server.error_patterns import ErrorPatternRegistry
+from sparkctl.mcp_server.log_parser import SparkLogLocator, SparkLogParser
+from sparkctl.mcp_server.models import (
+    ErrorCategory,
+    ErrorOccurrence,
+    LogEntry,
+    RecoverySuggestions,
+    SparkApplication,
+    SparkApplicationList,
+    SparkFailureAnalysis,
+    SparkLogsResponse,
+    Suggestion,
+)
+from sparkctl.mcp_server.recovery import RecoveryEngine
+from sparkctl.mcp_server.tools import (
+    analyze_spark_failure,
+    get_recovery_suggestions,
+    get_spark_logs,
+    list_spark_applications,
+)
+
+__all__ = [
+    # Models
+    "ErrorCategory",
+    "ErrorOccurrence",
+    "LogEntry",
+    "RecoverySuggestions",
+    "SparkApplication",
+    "SparkApplicationList",
+    "SparkFailureAnalysis",
+    "SparkLogsResponse",
+    "Suggestion",
+    # Tools (can be used directly without MCP)
+    "analyze_spark_failure",
+    "get_recovery_suggestions",
+    "get_spark_logs",
+    "list_spark_applications",
+    # Utilities
+    "ErrorPatternRegistry",
+    "RecoveryEngine",
+    "SparkLogLocator",
+    "SparkLogParser",
+]