Prototype: MCP server to aid debugging by daniel-thom · Pull Request #6 · NatLabRockies/sparkctl

daniel-thom · 2026-01-12T15:31:53Z

This is currently an auto-generated, untested prototype from Claude. Tasks:

Copilot

Pull request overview

This PR introduces a prototype MCP (Model Context Protocol) server that enables AI-assisted diagnosis of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server, and connect-server components to detect error patterns and provide recovery suggestions.

Changes:

Added new sparkctl.mcp_server module with tools for log retrieval, failure analysis, recovery suggestions, and application listing
Implemented error pattern matching for common Spark failures (OOM, shuffle, connection, disk, etc.)
Added recovery suggestion engine with prioritized remediation steps
Created comprehensive documentation for using the MCP server with AI assistants

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/sparkctl/mcp_server/server.py	MCP server implementation with FastMCP, tool wrappers, and entry point
src/sparkctl/mcp_server/tools.py	Core tool implementations for log analysis and diagnostics
src/sparkctl/mcp_server/models.py	Pydantic models for responses and error categorization
src/sparkctl/mcp_server/log_parser.py	Log file discovery and parsing with multiline support
src/sparkctl/mcp_server/error_patterns.py	Error pattern registry for classifying Spark failures
src/sparkctl/mcp_server/recovery.py	Recovery suggestion engine with strategies per error type
src/sparkctl/mcp_server/init.py	Module exports and documentation
pyproject.toml	Added mcp optional dependency and sparkctl-mcp-server entry point
docs/how_tos/debugging/mcp_server.md	How-to guide for using the MCP server
docs/reference/mcp_server_api.md	API reference documentation
docs/how_tos/debugging/index.md	Updated debugging index to include MCP server
docs/reference/index.md	Updated reference index to include MCP server API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-12T15:41:04Z

+
+    for log_file in log_files:
+        try:
+            lines = parser._read_lines(log_file.path, tail_lines)


Accessing a private method (_read_lines) from SparkLogParser. This breaks encapsulation and makes the code harder to maintain. Consider making this method public or providing a public interface for reading log lines with tail support.

Copilot · 2026-01-12T15:41:04Z

+def get_recovery_suggestions(
+    error_types: list[str],
+    current_config: dict[str, str] | None = None,
+):


The function signature declares a return type but does not include a type annotation for the return value. Add the return type annotation explicitly for consistency with the rest of the codebase.

Suggested change

):

) -> "RecoverySuggestions":

Copilot · 2026-01-12T15:41:05Z

+def list_spark_applications(
+    spark_scratch: str,
+    include_completed: bool = True,
+    include_failed: bool = True,
+) -> SparkApplicationList:


The parameters include_completed and include_failed are defined but never used in the function implementation. Either implement the filtering logic or remove these unused parameters from the function signature.

Copilot · 2026-01-12T15:41:05Z

+        except Exception:
+            logger.exception("Failed to analyze file {}", log_file)
+            continue


The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.

Copilot · 2026-01-12T15:41:05Z

+
+    # Find the highest priority error that occurred
+    root_cause = None
+    best_priority = float("inf")


Using float("inf") for comparison may produce confusing behavior. Consider using a more explicit sentinel value like sys.maxsize or a large integer constant to represent "no priority" or "lowest priority".

Copilot · 2026-01-12T15:41:07Z

+        except Exception:
+            # Skip files we can't read


The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.

Suggested change

except Exception:

# Skip files we can't read

except OSError as exc:

# Skip files we can't read, but log the failure for debugging

logger.warning(

"Failed to read Spark log file {path}: {error}",

path=log_file.path,

error=str(exc),

)

Copilot · 2026-01-12T15:41:07Z

+            try:
+                start_time = datetime.fromtimestamp(app_dir.stat().st_ctime)
+            except OSError:
+                pass


The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit. Catch specific exception types like OSError or use Exception to avoid suppressing critical system exceptions.

Suggested change

pass

logger.debug(

"Unable to read creation time for Spark application directory %s (app_id=%s)",

app_dir,

app_id,

exc_info=True,

)

Copilot · 2026-01-12T15:41:08Z

+                # Handle year ambiguity - assume 20xx
+                timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S")


Hardcoded assumption that the year prefix is "20" (representing 2000-2099). This will break for years beyond 2099. Consider using the current year or a more robust timestamp parsing approach that doesn't make assumptions about the century.

Suggested change

# Handle year ambiguity - assume 20xx

timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S")

# Parse two-digit year using datetime's %y handling (00-68 -> 2000-2068, 69-99 -> 1969-1999)

timestamp = datetime.strptime(ts_str, "%y/%m/%d %H:%M:%S")

Copilot · 2026-01-12T15:41:08Z

+    log_type: Literal["master", "worker", "executor", "connect", "thrift", "all"] = "all",
+    app_id: str | None = None,
+    executor_id: str | None = None,
+    tail_lines: int = 500,


The tail_lines parameter lacks validation for negative or zero values. Consider adding input validation to ensure tail_lines is positive, or handle negative/zero values appropriately (e.g., treating 0 or negative as "no limit").

Copilot · 2026-01-12T15:41:09Z

+    spark_scratch: str,
+    app_id: str | None = None,
+    include_stack_traces: bool = True,
+    max_errors: int = 50,


The max_errors parameter lacks validation for negative or zero values. Consider adding input validation to ensure max_errors is positive, or document the behavior when a non-positive value is provided.

Prototype: MCP server to aid debugging

f30c990

daniel-thom requested a review from Copilot January 12, 2026 15:31

Copilot started reviewing on behalf of daniel-thom January 12, 2026 15:32 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: MCP server to aid debugging#6

Prototype: MCP server to aid debugging#6
daniel-thom wants to merge 1 commit into
mainfrom
feat/mcp-server

daniel-thom commented Jan 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Copilot AI Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        except Exception:
-            # Skip files we can't read
+        except OSError as exc:
+            # Skip files we can't read, but log the failure for debugging
+            logger.warning(
+                "Failed to read Spark log file {path}: {error}",
+                path=log_file.path,
+                error=str(exc),
+            )

-                pass
+                logger.debug(
+                    "Unable to read creation time for Spark application directory %s (app_id=%s)",
+                    app_dir,
+                    app_id,
+                    exc_info=True,
+                )

		# Handle year ambiguity - assume 20xx
		timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S")

Conversation

daniel-thom commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-thom commented Jan 12, 2026 •

edited

Loading