Skip to content

Feat: Implement hierarchical split functionality for PDF bookmarks#799

Open
Chapoly1305 wants to merge 2 commits intotorakiki:masterfrom
Chapoly1305:pdfsplit-tree
Open

Feat: Implement hierarchical split functionality for PDF bookmarks#799
Chapoly1305 wants to merge 2 commits intotorakiki:masterfrom
Chapoly1305:pdfsplit-tree

Conversation

@Chapoly1305
Copy link
Copy Markdown

This pull request introduces support for hierarchical splitting of PDF files by bookmarks, enabling the creation of directory structures based on bookmark levels. The main changes include the addition of a new task and parameters to handle hierarchical output, enhancements to the task execution controller to route hierarchical split requests, and updates to module dependencies.

Demo
In Settings.
Screenshot 2025-11-13 at 12 47 50 PM

Exported PDF Destination Folders.
image

Tested on macOS.

Hierarchical Split Functionality

  • Added HierarchicalSplitByBookmarksTask, which splits PDFs by bookmarks and creates directories for parent bookmarks, placing child splits within them. It supports overlap and auto-detection of overlapping pages for more accurate splits.
  • Introduced HierarchicalSplitByOutlineLevelParameters, an extension of SplitByOutlineLevelParameters, to configure hierarchical output, overlap pages, and auto-detect overlap options.
  • Updated SplitByOutlineLevelParametersBuilder to include options for hierarchical output, overlap pages, and auto-detect overlap.

Task Execution Routing

  • Modified TaskExecutionController to detect and route hierarchical split requests to the new task using reflection, avoiding direct dependencies and circular references. [1] [2]

Dependency and Module Configuration

  • Added necessary dependencies (sejda-core, sejda-commons, sejda-sambox) to pdfsam-split-by-bookmarks and updated the module descriptor to require these modules, ensuring the new functionality can access required PDF processing features. [1] [2]

Copilot AI review requested due to automatic review settings March 12, 2026 17:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “hierarchical output” mode to the Split by bookmarks tool, including optional overlap-page handling, by extending the parameters and introducing a custom split task that creates a folder per parent bookmark.

Changes:

  • Adds UI controls to enable hierarchical directory output and configure overlap pages, persisting these in workspace state.
  • Extends the split-by-bookmarks parameters builder to emit a new HierarchicalSplitByOutlineLevelParameters type.
  • Introduces HierarchicalSplitByBookmarksTask and routes execution via a special-case path in TaskExecutionController using reflection.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/SplitOptionsPane.java Adds hierarchical output + overlap UI, reset/apply logic, and workspace persistence.
pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/SplitByOutlineLevelParametersBuilder.java Adds hierarchical/overlap fields and builds hierarchical parameters when enabled.
pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/HierarchicalSplitByOutlineLevelParameters.java New parameters subtype carrying hierarchical + overlap settings.
pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/HierarchicalSplitByBookmarksTask.java New task implementing hierarchical splitting and overlap logic.
pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/module-info.java Adds module requirements for Sejda/SAMBox + logging needed by the new task.
pdfsam-tools/pdfsam-split-by-bookmarks/pom.xml Adds Sejda dependencies needed by the new task implementation.
pdfsam-service/src/main/java/org/pdfsam/service/task/TaskExecutionController.java Special-cases hierarchical params and executes the hierarchical task via reflection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +106 to +112
var overlapLabel = new Label(i18n().tr("Additional overlap pages:"));
GridPane.setValignment(overlapLabel, VPos.BOTTOM);
GridPane.setHalignment(overlapLabel, HPos.LEFT);
add(overlapLabel, 0, 3);
overlapPagesField.setId("overlapPages");
overlapPagesField.setPromptText("0");
overlapPagesField.setPrefWidth(100);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overlapLabel is not associated with overlapPagesField (missing setLabelFor), which makes the new input harder to use with assistive technologies and breaks the pattern used for the other fields in this pane (e.g., regexLabel.setLabelFor(regexpField)).

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +156
builder.hierarchicalOutput(hierarchicalOutputCheck.isSelected());

// Auto-detect is now always enabled when hierarchical output is selected
builder.autoDetectOverlap(hierarchicalOutputCheck.isSelected());

if (isNotBlank(overlapPagesField.getText())) {
try {
int overlap = Integer.parseInt(overlapPagesField.getText().trim());
if (overlap >= 0) {
builder.overlapPages(overlap);
} else {
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap pages input is parsed and applied to the builder even when hierarchical output is not selected. However, SplitByOutlineLevelParametersBuilder.build() ignores overlapPages unless hierarchicalOutput is true, so user input can be silently dropped. Consider disabling/hiding this field unless hierarchical output is enabled, or only calling builder.overlapPages(...) when hierarchicalOutputCheck.isSelected() is true (and possibly surfacing a validation error if the user entered a value while hierarchical mode is off).

Copilot uses AI. Check for mistakes.
if (overlap >= 0) {
builder.overlapPages(overlap);
} else {
onError.accept(i18n().tr("Overlap pages must be a positive number"));
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation/message for negative overlap values is inconsistent: the code accepts overlap >= 0 (so 0 is valid) but the error says "must be a positive number" (which usually excludes 0). Either change the check to > 0 or update the message to explicitly allow 0 (e.g., non-negative / zero or positive).

Suggested change
onError.accept(i18n().tr("Overlap pages must be a positive number"));
onError.accept(i18n().tr("Overlap pages must be zero or a positive number"));

Copilot uses AI. Check for mistakes.
SplitByOutlineLevelParameters params = new SplitByOutlineLevelParameters(level);
SplitByOutlineLevelParameters params;
if (hierarchicalOutput) {
params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build() passes a constant true for the hierarchicalOutput constructor parameter (new HierarchicalSplitByOutlineLevelParameters(level, true, ...)). This makes the builder state and constructor API inconsistent and can hide bugs if the constructor argument is later used for behavior. Pass the hierarchicalOutput field instead, or remove the constructor parameter/field entirely since the type already implies hierarchical mode.

Suggested change
params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap);
params = new HierarchicalSplitByOutlineLevelParameters(level, hierarchicalOutput, overlapPages, autoDetectOverlap);

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +77
void hierarchicalOutput(boolean hierarchicalOutput) {
this.hierarchicalOutput = hierarchicalOutput;
}

boolean isHierarchicalOutput() {
return hierarchicalOutput;
}

void overlapPages(int overlapPages) {
this.overlapPages = overlapPages;
}

int getOverlapPages() {
return overlapPages;
}

void autoDetectOverlap(boolean autoDetectOverlap) {
this.autoDetectOverlap = autoDetectOverlap;
}

boolean isAutoDetectOverlap() {
return autoDetectOverlap;
}

@Override
public SplitByOutlineLevelParameters build() {
SplitByOutlineLevelParameters params = new SplitByOutlineLevelParameters(level);
SplitByOutlineLevelParameters params;
if (hierarchicalOutput) {
params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap);
} else {
params = new SplitByOutlineLevelParameters(level);
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is existing unit coverage for this builder, but the newly added hierarchical settings are untested. Please add tests that verify: (1) hierarchicalOutput(true) causes build() to return HierarchicalSplitByOutlineLevelParameters, and (2) overlapPages/autoDetectOverlap are propagated as expected.

Copilot uses AI. Check for mistakes.
Comment on lines +309 to +319
// Calculate the actual end page with overlap
int totalPages = documentHandler.getUnderlyingPDDocument().getNumberOfPages();
int actualEndPage = Math.min(section.endPage + overlapPages, totalPages);

LOG.debug("Splitting section '{}' from page {} to {} (with {} pages overlap)",
section.title, section.startPage, actualEndPage, overlapPages);

// Extract pages for this section (including overlap)
for (int pageNum = section.startPage; pageNum < actualEndPage; pageNum++) {
handler.addPage(documentHandler.getUnderlyingPDDocument().getPage(pageNum - 1)); // SAMBox uses 0-based indexing
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end-page math in splitSection is inconsistent with the loop bounds: actualEndPage is clamped to totalPages, and the loop uses < actualEndPage, so page totalPages is never included. This becomes visible for the last section and also when overlap pushes the end past the last page. Use a consistent convention (exclusive end bound usually implies clamping to totalPages + 1 and iterating < endExclusive, or inclusive bound implies clamping to totalPages and iterating <= endInclusive).

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +105
File outputDirectory = parameters.getOutput().getDestination();

// Get overlap configuration if this is a hierarchical split
int overlapPages = 0;
boolean autoDetectOverlap = false;
if (parameters instanceof HierarchicalSplitByOutlineLevelParameters) {
HierarchicalSplitByOutlineLevelParameters hierarchicalParams = (HierarchicalSplitByOutlineLevelParameters) parameters;
overlapPages = hierarchicalParams.getOverlapPages();
autoDetectOverlap = hierarchicalParams.isAutoDetectOverlap();
LOG.debug("Using overlap of {} pages, auto-detect: {}", overlapPages, autoDetectOverlap);
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task writes output PDFs directly to parentDir using sanitizeFilename(section.title) and does not apply several SplitByOutlineLevelParameters settings that the UI/builder sets (e.g., ExistingOutputPolicy, outputPrefix, and parameters.getOutput().accept(...) handling). This can lead to overwriting files even when the policy is SKIP/FAIL and to filenames that ignore the configured prefix. Consider using the same output writing/name-generation approach used by other tasks in pdfsam-service (e.g., OutputWriters.newMultipleOutputWriter(...) + NameGenerator), or otherwise explicitly honor the parameters here.

Copilot uses AI. Check for mistakes.
Comment on lines +300 to +336
private void splitSection(BookmarkSection section, File parentDir, SplitByOutlineLevelParameters parameters,
int currentStep, int overlapPages) throws TaskException {
try {
PDDocumentHandler handler = new PDDocumentHandler();
handler.setCreatorOnPDDocument();
handler.setVersionOnPDDocument(
parameters.getVersion() != null ? parameters.getVersion() : PdfVersion.VERSION_1_6);
handler.setCompress(parameters.isCompress());

// Calculate the actual end page with overlap
int totalPages = documentHandler.getUnderlyingPDDocument().getNumberOfPages();
int actualEndPage = Math.min(section.endPage + overlapPages, totalPages);

LOG.debug("Splitting section '{}' from page {} to {} (with {} pages overlap)",
section.title, section.startPage, actualEndPage, overlapPages);

// Extract pages for this section (including overlap)
for (int pageNum = section.startPage; pageNum < actualEndPage; pageNum++) {
handler.addPage(documentHandler.getUnderlyingPDDocument().getPage(pageNum - 1)); // SAMBox uses 0-based indexing
}

String filename = sanitizeFilename(section.title) + ".pdf";
File outputFile = new File(parentDir, filename);

handler.savePDDocument(outputFile);
handler.close();

LOG.debug("Created file: {}", outputFile);

// Report progress
BigDecimal percentComplete = BigDecimal.valueOf((currentStep / (double) totalSteps) * 100);
GlobalNotificationContext.getContext()
.notifyListeners(new PercentageOfWorkDoneChangedEvent(percentComplete, NotifiableTaskMetadata.NULL));
} catch (Exception e) {
throw new TaskException("Failed to split section: " + section.title, e);
}
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitSection only closes the PDDocumentHandler handler on the success path. If any exception is thrown while adding pages or saving, the handler/document may leak. Use a try/finally (or try-with-resources if supported) to ensure the handler is always closed.

Copilot uses AI. Check for mistakes.
// Replace invalid filename characters with underscores
String sanitized = filename.replaceAll("[\\\\/:*?\"<>|]", "_");
// Remove leading/trailing spaces and dots
sanitized = sanitized.trim().replaceAll("^\\.+", "");
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment in sanitizeFilename says it removes "leading/trailing spaces and dots", but the implementation only trims spaces and removes leading dots (replaceAll("^\\.+", "")). Either update the implementation to also strip trailing dots (and possibly trailing spaces already handled by trim()), or correct the comment so it matches behavior.

Suggested change
sanitized = sanitized.trim().replaceAll("^\\.+", "");
sanitized = sanitized.trim().replaceAll("^\\.+", "").replaceAll("\\.+$", "");

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +133
// Use reflection to instantiate and execute the hierarchical task
Class<?> taskClass = Class.forName(
"org.pdfsam.tools.splitbybookmarks.HierarchicalSplitByBookmarksTask");
Object task = taskClass.getDeclaredConstructor().newInstance();

// Execute the task
java.lang.reflect.Method executeMethod = taskClass.getMethod("execute",
SplitByOutlineLevelParameters.class);
executeMethod.invoke(task, params);
} catch (Exception e) {
LOG.error("Failed to execute hierarchical split task", e);
GlobalNotificationContext.getContext()
.notifyListeners(new TaskExecutionFailedEvent(e, NotifiableTaskMetadata.NULL));
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeHierarchicalSplit invokes the task via reflection but only calls execute(...). Since HierarchicalSplitByBookmarksTask relies on BaseTask.before(...) to compile the regex (titleMatchingPattern) and on after() to close the opened documentHandler, skipping the lifecycle means: regex filtering won’t work and the input document may never be closed (resource leak). Also, failures will likely emit duplicate TaskExecutionFailedEvents because the task already notifies failures and the controller notifies again when the reflective call throws. Please route hierarchical execution through the standard TaskExecutionService (preferred) or, if reflection is unavoidable, explicitly call before(...) and after() in a finally block and avoid double-notifying on failures (unwrap InvocationTargetException).

Suggested change
// Use reflection to instantiate and execute the hierarchical task
Class<?> taskClass = Class.forName(
"org.pdfsam.tools.splitbybookmarks.HierarchicalSplitByBookmarksTask");
Object task = taskClass.getDeclaredConstructor().newInstance();
// Execute the task
java.lang.reflect.Method executeMethod = taskClass.getMethod("execute",
SplitByOutlineLevelParameters.class);
executeMethod.invoke(task, params);
} catch (Exception e) {
LOG.error("Failed to execute hierarchical split task", e);
GlobalNotificationContext.getContext()
.notifyListeners(new TaskExecutionFailedEvent(e, NotifiableTaskMetadata.NULL));
// Route hierarchical split through the standard TaskExecutionService so that
// task lifecycle (before/after) and notifications are handled correctly.
executionService.execute(params);
} catch (Exception e) {
// Let TaskExecutionService / task infrastructure handle failure notifications
// to avoid duplicate TaskExecutionFailedEvent emissions.
LOG.error("Failed to execute hierarchical split task", e);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants