Skip to content

feat(components): add MinerU Document Loader with flash and precision modes#6063

Open
chaserRen wants to merge 4 commits intoFlowiseAI:mainfrom
chaserRen:feat/mineru
Open

feat(components): add MinerU Document Loader with flash and precision modes#6063
chaserRen wants to merge 4 commits intoFlowiseAI:mainfrom
chaserRen:feat/mineru

Conversation

@chaserRen
Copy link
Copy Markdown

Summary
Add MinerU Document Loader to Flowise for parsing documents via MinerU APIs in flash (token-free) and precision (token-required) modes.

Changes
Add new node implementation at packages/components/nodes/documentloaders/MinerU/MinerU.ts and icon at packages/components/nodes/documentloaders/MinerU/mineru.svg
Support two input modes: URL and file upload
Support MinerU flash and precision workflows
Support precision options: model (vlm/pipeline/html), OCR, formula, table, language, page range, timeout
Support split-pages behavior for PDF sources when page range is provided
Return Flowise Document/Text outputs with metadata (source, mode, language, model, page info)
Add env fallbacks for configuration: MINERU_TOKEN, MINERU_FLASH_BASE_URL, MINERU_API_BASE_URL, MINERU_SOURCE_HEADER
Include polling, timeout, and explicit error handling for MinerU task lifecycle

Why
This integration enables Flowise users to parse PDF/image/office/html documents with MinerU directly in Document Loaders, covering both fast extraction and higher-accuracy extraction
scenarios.

Testing
Confirmed branch diff only adds MinerU loader source and icon files
Verified code paths enforce mode-specific token/file-type requirements and produce clear errors for invalid inputs/timeouts
Verified node outputs and metadata mapping logic for both flash and precision modes

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new MinerU Document Loader into Flowise, significantly expanding its document processing capabilities. Users can now leverage MinerU's powerful API to extract content from a wide array of document formats, choosing between a rapid, token-free 'Flash' mode or a more detailed, token-required 'Precision' mode with advanced configuration options. This integration streamlines the ingestion of complex documents into Flowise workflows, providing greater flexibility and accuracy in data extraction.

Highlights

  • New Document Loader: Added the MinerU Document Loader to Flowise, enabling parsing of various document types (PDF, image, office, HTML) via MinerU APIs.
  • Dual Modes Supported: Implemented support for both 'Flash' (token-free, fast extraction) and 'Precision' (token-required, higher-accuracy extraction) modes.
  • Flexible Input Options: Provided options for document input via URL or file upload.
  • Advanced Precision Configuration: Included extensive configuration options for precision mode, such as model selection (VLM, Pipeline, HTML), OCR, formula and table extraction, language hints, page range specification, and timeout settings.
  • Enhanced Document Output: Documents are returned as Flowise Document/Text outputs, enriched with metadata including source, mode, language, model, and page information.
  • Robust API Integration: Incorporated environment variable fallbacks for API configuration (token, base URLs, source header) and implemented polling, timeout, and explicit error handling for the MinerU task lifecycle.
  • PDF Page Splitting: Added functionality to split PDF sources into individual documents per page when a page range is provided.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new MinerU Document Loader, enabling document parsing in both 'flash' and 'precision' modes, supporting file uploads and URL inputs for various document types. The review identifies a high-severity security concern due to the custom ZIP parsing implementation, recommending the use of a robust, battle-tested library. Additionally, it suggests improving performance by parallelizing task processing and addressing the non-deterministic generation of filenames.

Comment on lines +321 to +325
const results: MinerUTaskResult[] = []
for (const task of tasks) {
const result = await this.runTask(task, config)
results.push(result)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation processes tasks sequentially using a for...of loop. This can be inefficient when there are multiple files or URLs to process. To improve performance, you can execute these tasks in parallel using Promise.all.

Suggested change
const results: MinerUTaskResult[] = []
for (const task of tasks) {
const result = await this.runTask(task, config)
results.push(result)
}
const results: MinerUTaskResult[] = await Promise.all(tasks.map((task) => this.runTask(task, config)))


const mimeMatch = contentPart.match(/^data:([^;]+);base64,/i)
const ext = mimeMatch?.[1] ? MIME_EXTENSION_MAP[mimeMatch[1].toLowerCase()] || 'bin' : 'bin'
const guessedName = fileNameFromPayload || `upload_${Date.now()}_${index}.${ext}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Date.now() to generate filenames introduces non-determinism, which can make testing more difficult and less reliable. While collisions are unlikely, it's a good practice to use a more deterministic approach. Consider using a cryptographic hash of the file content for a unique and deterministic name, or a simpler counter if the scope is limited.

@chaserRen
Copy link
Copy Markdown
Author

Pushed fixes for the MinerU review feedback in MinerU.ts:

  1. ZIP parsing hardening (Security - High Priority)
    Replaced the custom ZIP parsing logic with yauzl (battle-tested library) in the precision markdown
    extraction path.
    Fix details:
  • Removed custom low-level ZIP parsing methods.
  • Added safe ZIP handling with entry-based reading.
  • Added size guards:
    • max ZIP buffer size (MAX_ZIP_BUFFER_BYTES)
    • max markdown entry size (MAX_MARKDOWN_BYTES)
  • Only .md entries are accepted for extraction.
  1. Task execution performance (Medium Priority)
    Changed sequential task processing to controlled parallel execution.
    Fix details:
  • Replaced the for...of + await flow with batched concurrent execution via runTasksWithConcurrency(...).
  • Added default concurrency limit (DEFAULT_TASK_CONCURRENCY = 3) to avoid unbounded Promise.all fan-out.
  1. Deterministic fallback filename generation (Medium Priority)
    Removed Date.now()-based fallback naming in inline upload decoding.
    Fix details:
  • Added buildDeterministicUploadName(...) using SHA-256 hash of file content + index + extension.
  • Keeps filenames stable and test-friendly while preserving existing explicit filename behavior.

Additional:

  • Added direct dependencies for ZIP parsing support in packages/components/package.json:
    • yauzl
    • @types/yauzl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant