Styled PDF Synthesis System - Implementation Summary

Overview

A complete styled PDF generation system has been added to PullData, featuring LLM-powered data structuring and three professional visual styles.

What Was Created

1. Data Contract (Pydantic Models & System Prompt)

File: pulldata/synthesis/report_models.py

ReportData: Main schema for structured reports
- title, subtitle, summary
- metrics: List of MetricItem (label, value, icon)
- sections: List of ReportSection (heading, content, subsections)
- references: List of Reference (title, url, page, score)
- metadata: Additional fields
System Prompts: Two variants for LLM structuring
- REPORT_STRUCTURING_PROMPT: General text structuring
- REPORT_STRUCTURING_PROMPT_WITH_SOURCES: For RAG results with sources

2. Visual Styles (Jinja2 + CSS)

Template: pulldata/synthesis/templates/report_base.html

Single master template that adapts based on CSS
Semantic HTML structure
Support for all ReportData fields

Styles: pulldata/synthesis/templates/styles/

Style A: "The Executive" (`executive.css`)

Purpose: Executive summaries, board presentations
Design: Clean, minimalist, blue/grey palette
Features: Generous whitespace, emphasized summary box, professional metrics
Best for: C-level audiences, time-constrained readers

Style B: "The Modernist" (`modernist.css`)

Purpose: Marketing materials, innovation reports
Design: Bold typography, high contrast, dark accents
Features: Orange/red gradients, asymmetric layout, impact-focused
Best for: Creative content, making strong impressions

Style C: "The Academic" (`academic.css`)

Purpose: Research reports, white papers
Design: Traditional serif fonts, two-column layout
Features: Dense information, classic formatting, bibliography style
Best for: Formal documentation, technical reports

3. Python Synthesis Logic

File: pulldata/synthesis/formatters/styled_pdf.py

Main Classes

StyledPDFFormatter(OutputFormatter)

Extends existing PullData formatter architecture
Integrates with Jinja2 and WeasyPrint
Optional LLM integration for data structuring

Key Methods:

def format(data: OutputData) -> bytes
    # Convert standard OutputData to PDF

def render_styled_pdf(data: ReportData, style_name: str) -> bytes
    # Render ReportData with specified style

def structure_with_llm(raw_text: str, query: str, sources: list) -> ReportData
    # Use LLM to structure raw RAG text into ReportData

Convenience Function

def render_styled_pdf(data: ReportData, style_name: str) -> bytes
    # Quick PDF generation without creating formatter instance

4. Integration

Updated Files:

pulldata/synthesis/__init__.py - Exports new models and formatters
pulldata/synthesis/formatters/__init__.py - Exports StyledPDFFormatter
requirements.txt - Added weasyprint and jinja2

5. Documentation & Examples

Documentation: docs/STYLED_PDF_GUIDE.md

Complete guide (6,000+ words)
Quick start, API reference, troubleshooting
Style comparison, customization guide

Examples: examples/styled_pdf_example.py

4 complete working examples:
1. Manual ReportData creation
2. LLM-powered structuring
3. PullData query integration
4. Style comparison

Usage Patterns

Pattern 1: Manual Report Creation

from pulldata.synthesis import ReportData, MetricItem, ReportSection, render_styled_pdf

report = ReportData(
    title="Q3 Report",
    summary="Strong performance with 15% growth",
    metrics=[
        MetricItem(label="Revenue", value="$10.5M", icon="💰"),
    ],
    sections=[
        ReportSection(heading="Overview", content="...")
    ]
)

pdf_bytes = render_styled_pdf(report, style_name="executive")

Pattern 2: LLM Structuring

from pulldata import PullData
from pulldata.synthesis import StyledPDFFormatter

pd = PullData(project="my_project")
formatter = StyledPDFFormatter(style="modernist", llm=pd._llm)

structured = formatter.structure_with_llm(
    raw_text="Q3 revenue was $10.5M, up 15%...",
    query="What was Q3 performance?"
)

pdf_bytes = formatter.render_styled_pdf(structured)

Pattern 3: Integration with Queries

# Perform query
result = pd.query("What were Q3 metrics?", k=5, generate_answer=True)

# Structure and generate PDF
formatter = StyledPDFFormatter(style="academic", llm=pd._llm)

raw_text = f"{result.llm_response.text}\n\n"
raw_text += "\n".join([chunk.chunk.text for chunk in result.retrieved_chunks])

structured = formatter.structure_with_llm(raw_text, query=result.query)
pdf_bytes = formatter.render_styled_pdf(structured)

Key Features

LLM Data Structuring

Converts raw RAG text → structured JSON
Automatic metric extraction
Smart section organization
Source citation handling
Temperature-controlled (0.3 default)

Page Break Control

page-break-inside: avoid for headers
page-break-after: avoid for headings
Intelligent pagination for content sections

Markdown Support

Optional markdown processing in content
Converts markdown to HTML before PDF generation
Supports: headers, lists, code blocks, tables

Source Citations

Automatic reference list generation
Relevance score display
Page number tracking
URL linking

Technical Details

Dependencies

weasyprint: PDF generation from HTML/CSS
jinja2: Template rendering
markdown2: Markdown processing (optional)
pydantic: Data validation

Template Engine

Jinja2 for HTML generation
Autoescape enabled for security
FileSystemLoader for template directory

CSS Architecture

Separate CSS file per style
Loaded at formatter initialization
Injected into template at render time
Print media query optimizations

Page Setup

@page {
    size: A4;
    margin: 2.5cm;
    @bottom-right {
        content: counter(page);
    }
}

File Structure

pulldata/synthesis/
├── report_models.py              # Pydantic models + system prompts
├── templates/
│   ├── report_base.html          # Master Jinja2 template
│   └── styles/
│       ├── executive.css         # Style A
│       ├── modernist.css         # Style B
│       └── academic.css          # Style C
└── formatters/
    ├── styled_pdf.py             # StyledPDFFormatter class
    └── __init__.py               # Updated exports

examples/
└── styled_pdf_example.py         # 4 complete examples

docs/
└── STYLED_PDF_GUIDE.md           # Complete documentation

Testing

Run the example to verify installation:

# Basic test (no LLM required)
python examples/styled_pdf_example.py

# This will generate:
./output/styled_pdfs/financial_report_executive.pdf
./output/styled_pdfs/financial_report_modernist.pdf
./output/styled_pdfs/financial_report_academic.pdf
./output/style_comparison/comparison_*.pdf

Extension Points

Adding New Styles

Create new CSS file in templates/styles/
Add to AVAILABLE_STYLES dict in styled_pdf.py
Use immediately: render_styled_pdf(data, style_name="custom")

Custom Templates

Create new Jinja2 template
Use custom template environment:

formatter = StyledPDFFormatter(style="executive")
template = formatter.jinja_env.get_template("custom_report.html")

LLM Prompt Customization

Modify prompts in report_models.py:

REPORT_STRUCTURING_PROMPT - General structuring
REPORT_STRUCTURING_PROMPT_WITH_SOURCES - With source metadata

Performance Considerations

PDF Generation

Typical render time: 1-3 seconds per PDF
Depends on: content length, complexity, style
WeasyPrint is single-threaded

LLM Structuring

Depends on LLM latency (API or local)
Temperature 0.3 for consistent JSON output
Max tokens: 4000 (configurable)

Optimization Tips

Cache ReportData objects for reuse
Generate PDFs asynchronously in web apps
Reuse formatter instances
Disable markdown processing if not needed

Known Limitations

WeasyPrint Installation: Requires system dependencies on some platforms
LLM JSON Parsing: Smaller models may struggle with consistent JSON output
Two-Column Layout: Academic style may have pagination issues with very long tables
Font Support: Limited to system-installed fonts

Future Enhancements

Potential additions (not yet implemented):

Charts & Graphs: matplotlib/plotly integration
Custom Fonts: Embedded font support
Interactive PDFs: Form fields, hyperlinks
Batch Generation: Multi-report generation
Template Gallery: Pre-built templates for common use cases
Cover Pages: Customizable cover page templates
Watermarks: Security watermark support
Digital Signatures: PDF signing capability

Support & Resources

Documentation: docs/STYLED_PDF_GUIDE.md
Examples: examples/styled_pdf_example.py
API Reference: See documentation
Issues: GitHub issues for bug reports

Credits

Created: 2024-12-18 Author: PullData Development Team Version: 1.0

Quick Reference

Install

pip install weasyprint jinja2 markdown2

Generate PDF

from pulldata.synthesis import ReportData, render_styled_pdf

data = ReportData(title="Report", summary="...", sections=[...])
pdf = render_styled_pdf(data, style_name="executive")

With LLM

from pulldata.synthesis import StyledPDFFormatter

formatter = StyledPDFFormatter(style="modernist", llm=my_llm)
structured = formatter.structure_with_llm(raw_text)
pdf = formatter.render_styled_pdf(structured)

All Styles

executive - Clean & minimal
modernist - Bold & impactful
academic - Traditional & formal

Status: ✅ Complete and ready for use

FilesExpand file tree

STYLED_PDF_SUMMARY.md

Latest commit

History