Feat: Implement hierarchical split functionality for PDF bookmarks#799
Feat: Implement hierarchical split functionality for PDF bookmarks#799Chapoly1305 wants to merge 2 commits intotorakiki:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “hierarchical output” mode to the Split by bookmarks tool, including optional overlap-page handling, by extending the parameters and introducing a custom split task that creates a folder per parent bookmark.
Changes:
- Adds UI controls to enable hierarchical directory output and configure overlap pages, persisting these in workspace state.
- Extends the split-by-bookmarks parameters builder to emit a new
HierarchicalSplitByOutlineLevelParameterstype. - Introduces
HierarchicalSplitByBookmarksTaskand routes execution via a special-case path inTaskExecutionControllerusing reflection.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/SplitOptionsPane.java | Adds hierarchical output + overlap UI, reset/apply logic, and workspace persistence. |
| pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/SplitByOutlineLevelParametersBuilder.java | Adds hierarchical/overlap fields and builds hierarchical parameters when enabled. |
| pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/HierarchicalSplitByOutlineLevelParameters.java | New parameters subtype carrying hierarchical + overlap settings. |
| pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/org/pdfsam/tools/splitbybookmarks/HierarchicalSplitByBookmarksTask.java | New task implementing hierarchical splitting and overlap logic. |
| pdfsam-tools/pdfsam-split-by-bookmarks/src/main/java/module-info.java | Adds module requirements for Sejda/SAMBox + logging needed by the new task. |
| pdfsam-tools/pdfsam-split-by-bookmarks/pom.xml | Adds Sejda dependencies needed by the new task implementation. |
| pdfsam-service/src/main/java/org/pdfsam/service/task/TaskExecutionController.java | Special-cases hierarchical params and executes the hierarchical task via reflection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var overlapLabel = new Label(i18n().tr("Additional overlap pages:")); | ||
| GridPane.setValignment(overlapLabel, VPos.BOTTOM); | ||
| GridPane.setHalignment(overlapLabel, HPos.LEFT); | ||
| add(overlapLabel, 0, 3); | ||
| overlapPagesField.setId("overlapPages"); | ||
| overlapPagesField.setPromptText("0"); | ||
| overlapPagesField.setPrefWidth(100); |
There was a problem hiding this comment.
overlapLabel is not associated with overlapPagesField (missing setLabelFor), which makes the new input harder to use with assistive technologies and breaks the pattern used for the other fields in this pane (e.g., regexLabel.setLabelFor(regexpField)).
| builder.hierarchicalOutput(hierarchicalOutputCheck.isSelected()); | ||
|
|
||
| // Auto-detect is now always enabled when hierarchical output is selected | ||
| builder.autoDetectOverlap(hierarchicalOutputCheck.isSelected()); | ||
|
|
||
| if (isNotBlank(overlapPagesField.getText())) { | ||
| try { | ||
| int overlap = Integer.parseInt(overlapPagesField.getText().trim()); | ||
| if (overlap >= 0) { | ||
| builder.overlapPages(overlap); | ||
| } else { |
There was a problem hiding this comment.
The overlap pages input is parsed and applied to the builder even when hierarchical output is not selected. However, SplitByOutlineLevelParametersBuilder.build() ignores overlapPages unless hierarchicalOutput is true, so user input can be silently dropped. Consider disabling/hiding this field unless hierarchical output is enabled, or only calling builder.overlapPages(...) when hierarchicalOutputCheck.isSelected() is true (and possibly surfacing a validation error if the user entered a value while hierarchical mode is off).
| if (overlap >= 0) { | ||
| builder.overlapPages(overlap); | ||
| } else { | ||
| onError.accept(i18n().tr("Overlap pages must be a positive number")); |
There was a problem hiding this comment.
The validation/message for negative overlap values is inconsistent: the code accepts overlap >= 0 (so 0 is valid) but the error says "must be a positive number" (which usually excludes 0). Either change the check to > 0 or update the message to explicitly allow 0 (e.g., non-negative / zero or positive).
| onError.accept(i18n().tr("Overlap pages must be a positive number")); | |
| onError.accept(i18n().tr("Overlap pages must be zero or a positive number")); |
| SplitByOutlineLevelParameters params = new SplitByOutlineLevelParameters(level); | ||
| SplitByOutlineLevelParameters params; | ||
| if (hierarchicalOutput) { | ||
| params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap); |
There was a problem hiding this comment.
build() passes a constant true for the hierarchicalOutput constructor parameter (new HierarchicalSplitByOutlineLevelParameters(level, true, ...)). This makes the builder state and constructor API inconsistent and can hide bugs if the constructor argument is later used for behavior. Pass the hierarchicalOutput field instead, or remove the constructor parameter/field entirely since the type already implies hierarchical mode.
| params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap); | |
| params = new HierarchicalSplitByOutlineLevelParameters(level, hierarchicalOutput, overlapPages, autoDetectOverlap); |
| void hierarchicalOutput(boolean hierarchicalOutput) { | ||
| this.hierarchicalOutput = hierarchicalOutput; | ||
| } | ||
|
|
||
| boolean isHierarchicalOutput() { | ||
| return hierarchicalOutput; | ||
| } | ||
|
|
||
| void overlapPages(int overlapPages) { | ||
| this.overlapPages = overlapPages; | ||
| } | ||
|
|
||
| int getOverlapPages() { | ||
| return overlapPages; | ||
| } | ||
|
|
||
| void autoDetectOverlap(boolean autoDetectOverlap) { | ||
| this.autoDetectOverlap = autoDetectOverlap; | ||
| } | ||
|
|
||
| boolean isAutoDetectOverlap() { | ||
| return autoDetectOverlap; | ||
| } | ||
|
|
||
| @Override | ||
| public SplitByOutlineLevelParameters build() { | ||
| SplitByOutlineLevelParameters params = new SplitByOutlineLevelParameters(level); | ||
| SplitByOutlineLevelParameters params; | ||
| if (hierarchicalOutput) { | ||
| params = new HierarchicalSplitByOutlineLevelParameters(level, true, overlapPages, autoDetectOverlap); | ||
| } else { | ||
| params = new SplitByOutlineLevelParameters(level); | ||
| } |
There was a problem hiding this comment.
There is existing unit coverage for this builder, but the newly added hierarchical settings are untested. Please add tests that verify: (1) hierarchicalOutput(true) causes build() to return HierarchicalSplitByOutlineLevelParameters, and (2) overlapPages/autoDetectOverlap are propagated as expected.
| // Calculate the actual end page with overlap | ||
| int totalPages = documentHandler.getUnderlyingPDDocument().getNumberOfPages(); | ||
| int actualEndPage = Math.min(section.endPage + overlapPages, totalPages); | ||
|
|
||
| LOG.debug("Splitting section '{}' from page {} to {} (with {} pages overlap)", | ||
| section.title, section.startPage, actualEndPage, overlapPages); | ||
|
|
||
| // Extract pages for this section (including overlap) | ||
| for (int pageNum = section.startPage; pageNum < actualEndPage; pageNum++) { | ||
| handler.addPage(documentHandler.getUnderlyingPDDocument().getPage(pageNum - 1)); // SAMBox uses 0-based indexing | ||
| } |
There was a problem hiding this comment.
The end-page math in splitSection is inconsistent with the loop bounds: actualEndPage is clamped to totalPages, and the loop uses < actualEndPage, so page totalPages is never included. This becomes visible for the last section and also when overlap pushes the end past the last page. Use a consistent convention (exclusive end bound usually implies clamping to totalPages + 1 and iterating < endExclusive, or inclusive bound implies clamping to totalPages and iterating <= endInclusive).
| File outputDirectory = parameters.getOutput().getDestination(); | ||
|
|
||
| // Get overlap configuration if this is a hierarchical split | ||
| int overlapPages = 0; | ||
| boolean autoDetectOverlap = false; | ||
| if (parameters instanceof HierarchicalSplitByOutlineLevelParameters) { | ||
| HierarchicalSplitByOutlineLevelParameters hierarchicalParams = (HierarchicalSplitByOutlineLevelParameters) parameters; | ||
| overlapPages = hierarchicalParams.getOverlapPages(); | ||
| autoDetectOverlap = hierarchicalParams.isAutoDetectOverlap(); | ||
| LOG.debug("Using overlap of {} pages, auto-detect: {}", overlapPages, autoDetectOverlap); | ||
| } |
There was a problem hiding this comment.
This task writes output PDFs directly to parentDir using sanitizeFilename(section.title) and does not apply several SplitByOutlineLevelParameters settings that the UI/builder sets (e.g., ExistingOutputPolicy, outputPrefix, and parameters.getOutput().accept(...) handling). This can lead to overwriting files even when the policy is SKIP/FAIL and to filenames that ignore the configured prefix. Consider using the same output writing/name-generation approach used by other tasks in pdfsam-service (e.g., OutputWriters.newMultipleOutputWriter(...) + NameGenerator), or otherwise explicitly honor the parameters here.
| private void splitSection(BookmarkSection section, File parentDir, SplitByOutlineLevelParameters parameters, | ||
| int currentStep, int overlapPages) throws TaskException { | ||
| try { | ||
| PDDocumentHandler handler = new PDDocumentHandler(); | ||
| handler.setCreatorOnPDDocument(); | ||
| handler.setVersionOnPDDocument( | ||
| parameters.getVersion() != null ? parameters.getVersion() : PdfVersion.VERSION_1_6); | ||
| handler.setCompress(parameters.isCompress()); | ||
|
|
||
| // Calculate the actual end page with overlap | ||
| int totalPages = documentHandler.getUnderlyingPDDocument().getNumberOfPages(); | ||
| int actualEndPage = Math.min(section.endPage + overlapPages, totalPages); | ||
|
|
||
| LOG.debug("Splitting section '{}' from page {} to {} (with {} pages overlap)", | ||
| section.title, section.startPage, actualEndPage, overlapPages); | ||
|
|
||
| // Extract pages for this section (including overlap) | ||
| for (int pageNum = section.startPage; pageNum < actualEndPage; pageNum++) { | ||
| handler.addPage(documentHandler.getUnderlyingPDDocument().getPage(pageNum - 1)); // SAMBox uses 0-based indexing | ||
| } | ||
|
|
||
| String filename = sanitizeFilename(section.title) + ".pdf"; | ||
| File outputFile = new File(parentDir, filename); | ||
|
|
||
| handler.savePDDocument(outputFile); | ||
| handler.close(); | ||
|
|
||
| LOG.debug("Created file: {}", outputFile); | ||
|
|
||
| // Report progress | ||
| BigDecimal percentComplete = BigDecimal.valueOf((currentStep / (double) totalSteps) * 100); | ||
| GlobalNotificationContext.getContext() | ||
| .notifyListeners(new PercentageOfWorkDoneChangedEvent(percentComplete, NotifiableTaskMetadata.NULL)); | ||
| } catch (Exception e) { | ||
| throw new TaskException("Failed to split section: " + section.title, e); | ||
| } | ||
| } |
There was a problem hiding this comment.
splitSection only closes the PDDocumentHandler handler on the success path. If any exception is thrown while adding pages or saving, the handler/document may leak. Use a try/finally (or try-with-resources if supported) to ensure the handler is always closed.
| // Replace invalid filename characters with underscores | ||
| String sanitized = filename.replaceAll("[\\\\/:*?\"<>|]", "_"); | ||
| // Remove leading/trailing spaces and dots | ||
| sanitized = sanitized.trim().replaceAll("^\\.+", ""); |
There was a problem hiding this comment.
The comment in sanitizeFilename says it removes "leading/trailing spaces and dots", but the implementation only trims spaces and removes leading dots (replaceAll("^\\.+", "")). Either update the implementation to also strip trailing dots (and possibly trailing spaces already handled by trim()), or correct the comment so it matches behavior.
| sanitized = sanitized.trim().replaceAll("^\\.+", ""); | |
| sanitized = sanitized.trim().replaceAll("^\\.+", "").replaceAll("\\.+$", ""); |
| // Use reflection to instantiate and execute the hierarchical task | ||
| Class<?> taskClass = Class.forName( | ||
| "org.pdfsam.tools.splitbybookmarks.HierarchicalSplitByBookmarksTask"); | ||
| Object task = taskClass.getDeclaredConstructor().newInstance(); | ||
|
|
||
| // Execute the task | ||
| java.lang.reflect.Method executeMethod = taskClass.getMethod("execute", | ||
| SplitByOutlineLevelParameters.class); | ||
| executeMethod.invoke(task, params); | ||
| } catch (Exception e) { | ||
| LOG.error("Failed to execute hierarchical split task", e); | ||
| GlobalNotificationContext.getContext() | ||
| .notifyListeners(new TaskExecutionFailedEvent(e, NotifiableTaskMetadata.NULL)); |
There was a problem hiding this comment.
executeHierarchicalSplit invokes the task via reflection but only calls execute(...). Since HierarchicalSplitByBookmarksTask relies on BaseTask.before(...) to compile the regex (titleMatchingPattern) and on after() to close the opened documentHandler, skipping the lifecycle means: regex filtering won’t work and the input document may never be closed (resource leak). Also, failures will likely emit duplicate TaskExecutionFailedEvents because the task already notifies failures and the controller notifies again when the reflective call throws. Please route hierarchical execution through the standard TaskExecutionService (preferred) or, if reflection is unavoidable, explicitly call before(...) and after() in a finally block and avoid double-notifying on failures (unwrap InvocationTargetException).
| // Use reflection to instantiate and execute the hierarchical task | |
| Class<?> taskClass = Class.forName( | |
| "org.pdfsam.tools.splitbybookmarks.HierarchicalSplitByBookmarksTask"); | |
| Object task = taskClass.getDeclaredConstructor().newInstance(); | |
| // Execute the task | |
| java.lang.reflect.Method executeMethod = taskClass.getMethod("execute", | |
| SplitByOutlineLevelParameters.class); | |
| executeMethod.invoke(task, params); | |
| } catch (Exception e) { | |
| LOG.error("Failed to execute hierarchical split task", e); | |
| GlobalNotificationContext.getContext() | |
| .notifyListeners(new TaskExecutionFailedEvent(e, NotifiableTaskMetadata.NULL)); | |
| // Route hierarchical split through the standard TaskExecutionService so that | |
| // task lifecycle (before/after) and notifications are handled correctly. | |
| executionService.execute(params); | |
| } catch (Exception e) { | |
| // Let TaskExecutionService / task infrastructure handle failure notifications | |
| // to avoid duplicate TaskExecutionFailedEvent emissions. | |
| LOG.error("Failed to execute hierarchical split task", e); |
This pull request introduces support for hierarchical splitting of PDF files by bookmarks, enabling the creation of directory structures based on bookmark levels. The main changes include the addition of a new task and parameters to handle hierarchical output, enhancements to the task execution controller to route hierarchical split requests, and updates to module dependencies.
Demo

In Settings.
Exported PDF Destination Folders.

Tested on macOS.
Hierarchical Split Functionality
HierarchicalSplitByBookmarksTask, which splits PDFs by bookmarks and creates directories for parent bookmarks, placing child splits within them. It supports overlap and auto-detection of overlapping pages for more accurate splits.HierarchicalSplitByOutlineLevelParameters, an extension ofSplitByOutlineLevelParameters, to configure hierarchical output, overlap pages, and auto-detect overlap options.SplitByOutlineLevelParametersBuilderto include options for hierarchical output, overlap pages, and auto-detect overlap.Task Execution Routing
TaskExecutionControllerto detect and route hierarchical split requests to the new task using reflection, avoiding direct dependencies and circular references. [1] [2]Dependency and Module Configuration
sejda-core,sejda-commons,sejda-sambox) topdfsam-split-by-bookmarksand updated the module descriptor to require these modules, ensuring the new functionality can access required PDF processing features. [1] [2]