A complete styled PDF generation system has been added to PullData, featuring LLM-powered data structuring and three professional visual styles.
File: pulldata/synthesis/report_models.py
-
ReportData: Main schema for structured reports
title,subtitle,summarymetrics: List of MetricItem (label, value, icon)sections: List of ReportSection (heading, content, subsections)references: List of Reference (title, url, page, score)metadata: Additional fields
-
System Prompts: Two variants for LLM structuring
REPORT_STRUCTURING_PROMPT: General text structuringREPORT_STRUCTURING_PROMPT_WITH_SOURCES: For RAG results with sources
Template: pulldata/synthesis/templates/report_base.html
- Single master template that adapts based on CSS
- Semantic HTML structure
- Support for all ReportData fields
Styles: pulldata/synthesis/templates/styles/
- Purpose: Executive summaries, board presentations
- Design: Clean, minimalist, blue/grey palette
- Features: Generous whitespace, emphasized summary box, professional metrics
- Best for: C-level audiences, time-constrained readers
- Purpose: Marketing materials, innovation reports
- Design: Bold typography, high contrast, dark accents
- Features: Orange/red gradients, asymmetric layout, impact-focused
- Best for: Creative content, making strong impressions
- Purpose: Research reports, white papers
- Design: Traditional serif fonts, two-column layout
- Features: Dense information, classic formatting, bibliography style
- Best for: Formal documentation, technical reports
File: pulldata/synthesis/formatters/styled_pdf.py
StyledPDFFormatter(OutputFormatter)
- Extends existing PullData formatter architecture
- Integrates with Jinja2 and WeasyPrint
- Optional LLM integration for data structuring
Key Methods:
def format(data: OutputData) -> bytes
# Convert standard OutputData to PDF
def render_styled_pdf(data: ReportData, style_name: str) -> bytes
# Render ReportData with specified style
def structure_with_llm(raw_text: str, query: str, sources: list) -> ReportData
# Use LLM to structure raw RAG text into ReportDatadef render_styled_pdf(data: ReportData, style_name: str) -> bytes
# Quick PDF generation without creating formatter instanceUpdated Files:
pulldata/synthesis/__init__.py- Exports new models and formatterspulldata/synthesis/formatters/__init__.py- Exports StyledPDFFormatterrequirements.txt- Added weasyprint and jinja2
Documentation: docs/STYLED_PDF_GUIDE.md
- Complete guide (6,000+ words)
- Quick start, API reference, troubleshooting
- Style comparison, customization guide
Examples: examples/styled_pdf_example.py
- 4 complete working examples:
- Manual ReportData creation
- LLM-powered structuring
- PullData query integration
- Style comparison
from pulldata.synthesis import ReportData, MetricItem, ReportSection, render_styled_pdf
report = ReportData(
title="Q3 Report",
summary="Strong performance with 15% growth",
metrics=[
MetricItem(label="Revenue", value="$10.5M", icon="💰"),
],
sections=[
ReportSection(heading="Overview", content="...")
]
)
pdf_bytes = render_styled_pdf(report, style_name="executive")from pulldata import PullData
from pulldata.synthesis import StyledPDFFormatter
pd = PullData(project="my_project")
formatter = StyledPDFFormatter(style="modernist", llm=pd._llm)
structured = formatter.structure_with_llm(
raw_text="Q3 revenue was $10.5M, up 15%...",
query="What was Q3 performance?"
)
pdf_bytes = formatter.render_styled_pdf(structured)# Perform query
result = pd.query("What were Q3 metrics?", k=5, generate_answer=True)
# Structure and generate PDF
formatter = StyledPDFFormatter(style="academic", llm=pd._llm)
raw_text = f"{result.llm_response.text}\n\n"
raw_text += "\n".join([chunk.chunk.text for chunk in result.retrieved_chunks])
structured = formatter.structure_with_llm(raw_text, query=result.query)
pdf_bytes = formatter.render_styled_pdf(structured)- Converts raw RAG text → structured JSON
- Automatic metric extraction
- Smart section organization
- Source citation handling
- Temperature-controlled (0.3 default)
page-break-inside: avoidfor headerspage-break-after: avoidfor headings- Intelligent pagination for content sections
- Optional markdown processing in content
- Converts markdown to HTML before PDF generation
- Supports: headers, lists, code blocks, tables
- Automatic reference list generation
- Relevance score display
- Page number tracking
- URL linking
- weasyprint: PDF generation from HTML/CSS
- jinja2: Template rendering
- markdown2: Markdown processing (optional)
- pydantic: Data validation
- Jinja2 for HTML generation
- Autoescape enabled for security
- FileSystemLoader for template directory
- Separate CSS file per style
- Loaded at formatter initialization
- Injected into template at render time
- Print media query optimizations
@page {
size: A4;
margin: 2.5cm;
@bottom-right {
content: counter(page);
}
}pulldata/synthesis/
├── report_models.py # Pydantic models + system prompts
├── templates/
│ ├── report_base.html # Master Jinja2 template
│ └── styles/
│ ├── executive.css # Style A
│ ├── modernist.css # Style B
│ └── academic.css # Style C
└── formatters/
├── styled_pdf.py # StyledPDFFormatter class
└── __init__.py # Updated exports
examples/
└── styled_pdf_example.py # 4 complete examples
docs/
└── STYLED_PDF_GUIDE.md # Complete documentation
Run the example to verify installation:
# Basic test (no LLM required)
python examples/styled_pdf_example.py
# This will generate:
./output/styled_pdfs/financial_report_executive.pdf
./output/styled_pdfs/financial_report_modernist.pdf
./output/styled_pdfs/financial_report_academic.pdf
./output/style_comparison/comparison_*.pdf- Create new CSS file in
templates/styles/ - Add to
AVAILABLE_STYLESdict instyled_pdf.py - Use immediately:
render_styled_pdf(data, style_name="custom")
- Create new Jinja2 template
- Use custom template environment:
formatter = StyledPDFFormatter(style="executive")
template = formatter.jinja_env.get_template("custom_report.html")Modify prompts in report_models.py:
REPORT_STRUCTURING_PROMPT- General structuringREPORT_STRUCTURING_PROMPT_WITH_SOURCES- With source metadata
- Typical render time: 1-3 seconds per PDF
- Depends on: content length, complexity, style
- WeasyPrint is single-threaded
- Depends on LLM latency (API or local)
- Temperature 0.3 for consistent JSON output
- Max tokens: 4000 (configurable)
- Cache ReportData objects for reuse
- Generate PDFs asynchronously in web apps
- Reuse formatter instances
- Disable markdown processing if not needed
- WeasyPrint Installation: Requires system dependencies on some platforms
- LLM JSON Parsing: Smaller models may struggle with consistent JSON output
- Two-Column Layout: Academic style may have pagination issues with very long tables
- Font Support: Limited to system-installed fonts
Potential additions (not yet implemented):
- Charts & Graphs: matplotlib/plotly integration
- Custom Fonts: Embedded font support
- Interactive PDFs: Form fields, hyperlinks
- Batch Generation: Multi-report generation
- Template Gallery: Pre-built templates for common use cases
- Cover Pages: Customizable cover page templates
- Watermarks: Security watermark support
- Digital Signatures: PDF signing capability
- Documentation:
docs/STYLED_PDF_GUIDE.md - Examples:
examples/styled_pdf_example.py - API Reference: See documentation
- Issues: GitHub issues for bug reports
Created: 2024-12-18 Author: PullData Development Team Version: 1.0
pip install weasyprint jinja2 markdown2from pulldata.synthesis import ReportData, render_styled_pdf
data = ReportData(title="Report", summary="...", sections=[...])
pdf = render_styled_pdf(data, style_name="executive")from pulldata.synthesis import StyledPDFFormatter
formatter = StyledPDFFormatter(style="modernist", llm=my_llm)
structured = formatter.structure_with_llm(raw_text)
pdf = formatter.render_styled_pdf(structured)executive- Clean & minimalmodernist- Bold & impactfulacademic- Traditional & formal
Status: ✅ Complete and ready for use