Skip to content

Prototype: MCP server to aid debugging#6

Draft
daniel-thom wants to merge 1 commit into
mainfrom
feat/mcp-server
Draft

Prototype: MCP server to aid debugging#6
daniel-thom wants to merge 1 commit into
mainfrom
feat/mcp-server

Conversation

@daniel-thom

@daniel-thom daniel-thom commented Jan 12, 2026

Copy link
Copy Markdown
Collaborator

This is currently an auto-generated, untested prototype from Claude. Tasks:

  • Test with real Spark failures
  • Test with torc
  • Enhance the MCP tools
  • Improve the code
  • Address Copilot's comments

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a prototype MCP (Model Context Protocol) server that enables AI-assisted diagnosis of Spark job failures. The server analyzes logs from master, worker, executor, thrift-server, and connect-server components to detect error patterns and provide recovery suggestions.

Changes:

  • Added new sparkctl.mcp_server module with tools for log retrieval, failure analysis, recovery suggestions, and application listing
  • Implemented error pattern matching for common Spark failures (OOM, shuffle, connection, disk, etc.)
  • Added recovery suggestion engine with prioritized remediation steps
  • Created comprehensive documentation for using the MCP server with AI assistants

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/sparkctl/mcp_server/server.py MCP server implementation with FastMCP, tool wrappers, and entry point
src/sparkctl/mcp_server/tools.py Core tool implementations for log analysis and diagnostics
src/sparkctl/mcp_server/models.py Pydantic models for responses and error categorization
src/sparkctl/mcp_server/log_parser.py Log file discovery and parsing with multiline support
src/sparkctl/mcp_server/error_patterns.py Error pattern registry for classifying Spark failures
src/sparkctl/mcp_server/recovery.py Recovery suggestion engine with strategies per error type
src/sparkctl/mcp_server/init.py Module exports and documentation
pyproject.toml Added mcp optional dependency and sparkctl-mcp-server entry point
docs/how_tos/debugging/mcp_server.md How-to guide for using the MCP server
docs/reference/mcp_server_api.md API reference documentation
docs/how_tos/debugging/index.md Updated debugging index to include MCP server
docs/reference/index.md Updated reference index to include MCP server API

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


for log_file in log_files:
try:
lines = parser._read_lines(log_file.path, tail_lines)

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing a private method (_read_lines) from SparkLogParser. This breaks encapsulation and makes the code harder to maintain. Consider making this method public or providing a public interface for reading log lines with tail support.

Copilot uses AI. Check for mistakes.
def get_recovery_suggestions(
error_types: list[str],
current_config: dict[str, str] | None = None,
):

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature declares a return type but does not include a type annotation for the return value. Add the return type annotation explicitly for consistency with the rest of the codebase.

Suggested change
):
) -> "RecoverySuggestions":

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +343
def list_spark_applications(
spark_scratch: str,
include_completed: bool = True,
include_failed: bool = True,
) -> SparkApplicationList:

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters include_completed and include_failed are defined but never used in the function implementation. Either implement the filtering logic or remove these unused parameters from the function signature.

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +245
except Exception:
logger.exception("Failed to analyze file {}", log_file)
continue

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.

Copilot uses AI. Check for mistakes.

# Find the highest priority error that occurred
root_cause = None
best_priority = float("inf")

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using float("inf") for comparison may produce confusing behavior. Consider using a more explicit sentinel value like sys.maxsize or a large integer constant to represent "no priority" or "lowest priority".

Copilot uses AI. Check for mistakes.
Comment on lines +93 to +94
except Exception:
# Skip files we can't read

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit, which can make debugging difficult. Catch specific exception types instead, such as Exception or OSError, to avoid suppressing critical system exceptions.

Suggested change
except Exception:
# Skip files we can't read
except OSError as exc:
# Skip files we can't read, but log the failure for debugging
logger.warning(
"Failed to read Spark log file {path}: {error}",
path=log_file.path,
error=str(exc),
)

Copilot uses AI. Check for mistakes.
try:
start_time = datetime.fromtimestamp(app_dir.stat().st_ctime)
except OSError:
pass

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bare except clause suppresses all exceptions, including KeyboardInterrupt and SystemExit. Catch specific exception types like OSError or use Exception to avoid suppressing critical system exceptions.

Suggested change
pass
logger.debug(
"Unable to read creation time for Spark application directory %s (app_id=%s)",
app_dir,
app_id,
exc_info=True,
)

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +256
# Handle year ambiguity - assume 20xx
timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S")

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded assumption that the year prefix is "20" (representing 2000-2099). This will break for years beyond 2099. Consider using the current year or a more robust timestamp parsing approach that doesn't make assumptions about the century.

Suggested change
# Handle year ambiguity - assume 20xx
timestamp = datetime.strptime(f"20{ts_str}", "%Y/%m/%d %H:%M:%S")
# Parse two-digit year using datetime's %y handling (00-68 -> 2000-2068, 69-99 -> 1969-1999)
timestamp = datetime.strptime(ts_str, "%y/%m/%d %H:%M:%S")

Copilot uses AI. Check for mistakes.
log_type: Literal["master", "worker", "executor", "connect", "thrift", "all"] = "all",
app_id: str | None = None,
executor_id: str | None = None,
tail_lines: int = 500,

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tail_lines parameter lacks validation for negative or zero values. Consider adding input validation to ensure tail_lines is positive, or handle negative/zero values appropriately (e.g., treating 0 or negative as "no limit").

Copilot uses AI. Check for mistakes.
spark_scratch: str,
app_id: str | None = None,
include_stack_traces: bool = True,
max_errors: int = 50,

Copilot AI Jan 12, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_errors parameter lacks validation for negative or zero values. Consider adding input validation to ensure max_errors is positive, or document the behavior when a non-positive value is provided.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants