Skip to content

Latest commit

 

History

History
376 lines (276 loc) · 9.85 KB

File metadata and controls

376 lines (276 loc) · 9.85 KB

Styled PDF Synthesis System - Implementation Summary

Overview

A complete styled PDF generation system has been added to PullData, featuring LLM-powered data structuring and three professional visual styles.

What Was Created

1. Data Contract (Pydantic Models & System Prompt)

File: pulldata/synthesis/report_models.py

  • ReportData: Main schema for structured reports

    • title, subtitle, summary
    • metrics: List of MetricItem (label, value, icon)
    • sections: List of ReportSection (heading, content, subsections)
    • references: List of Reference (title, url, page, score)
    • metadata: Additional fields
  • System Prompts: Two variants for LLM structuring

    • REPORT_STRUCTURING_PROMPT: General text structuring
    • REPORT_STRUCTURING_PROMPT_WITH_SOURCES: For RAG results with sources

2. Visual Styles (Jinja2 + CSS)

Template: pulldata/synthesis/templates/report_base.html

  • Single master template that adapts based on CSS
  • Semantic HTML structure
  • Support for all ReportData fields

Styles: pulldata/synthesis/templates/styles/

Style A: "The Executive" (executive.css)

  • Purpose: Executive summaries, board presentations
  • Design: Clean, minimalist, blue/grey palette
  • Features: Generous whitespace, emphasized summary box, professional metrics
  • Best for: C-level audiences, time-constrained readers

Style B: "The Modernist" (modernist.css)

  • Purpose: Marketing materials, innovation reports
  • Design: Bold typography, high contrast, dark accents
  • Features: Orange/red gradients, asymmetric layout, impact-focused
  • Best for: Creative content, making strong impressions

Style C: "The Academic" (academic.css)

  • Purpose: Research reports, white papers
  • Design: Traditional serif fonts, two-column layout
  • Features: Dense information, classic formatting, bibliography style
  • Best for: Formal documentation, technical reports

3. Python Synthesis Logic

File: pulldata/synthesis/formatters/styled_pdf.py

Main Classes

StyledPDFFormatter(OutputFormatter)

  • Extends existing PullData formatter architecture
  • Integrates with Jinja2 and WeasyPrint
  • Optional LLM integration for data structuring

Key Methods:

def format(data: OutputData) -> bytes
    # Convert standard OutputData to PDF

def render_styled_pdf(data: ReportData, style_name: str) -> bytes
    # Render ReportData with specified style

def structure_with_llm(raw_text: str, query: str, sources: list) -> ReportData
    # Use LLM to structure raw RAG text into ReportData

Convenience Function

def render_styled_pdf(data: ReportData, style_name: str) -> bytes
    # Quick PDF generation without creating formatter instance

4. Integration

Updated Files:

  • pulldata/synthesis/__init__.py - Exports new models and formatters
  • pulldata/synthesis/formatters/__init__.py - Exports StyledPDFFormatter
  • requirements.txt - Added weasyprint and jinja2

5. Documentation & Examples

Documentation: docs/STYLED_PDF_GUIDE.md

  • Complete guide (6,000+ words)
  • Quick start, API reference, troubleshooting
  • Style comparison, customization guide

Examples: examples/styled_pdf_example.py

  • 4 complete working examples:
    1. Manual ReportData creation
    2. LLM-powered structuring
    3. PullData query integration
    4. Style comparison

Usage Patterns

Pattern 1: Manual Report Creation

from pulldata.synthesis import ReportData, MetricItem, ReportSection, render_styled_pdf

report = ReportData(
    title="Q3 Report",
    summary="Strong performance with 15% growth",
    metrics=[
        MetricItem(label="Revenue", value="$10.5M", icon="💰"),
    ],
    sections=[
        ReportSection(heading="Overview", content="...")
    ]
)

pdf_bytes = render_styled_pdf(report, style_name="executive")

Pattern 2: LLM Structuring

from pulldata import PullData
from pulldata.synthesis import StyledPDFFormatter

pd = PullData(project="my_project")
formatter = StyledPDFFormatter(style="modernist", llm=pd._llm)

structured = formatter.structure_with_llm(
    raw_text="Q3 revenue was $10.5M, up 15%...",
    query="What was Q3 performance?"
)

pdf_bytes = formatter.render_styled_pdf(structured)

Pattern 3: Integration with Queries

# Perform query
result = pd.query("What were Q3 metrics?", k=5, generate_answer=True)

# Structure and generate PDF
formatter = StyledPDFFormatter(style="academic", llm=pd._llm)

raw_text = f"{result.llm_response.text}\n\n"
raw_text += "\n".join([chunk.chunk.text for chunk in result.retrieved_chunks])

structured = formatter.structure_with_llm(raw_text, query=result.query)
pdf_bytes = formatter.render_styled_pdf(structured)

Key Features

LLM Data Structuring

  • Converts raw RAG text → structured JSON
  • Automatic metric extraction
  • Smart section organization
  • Source citation handling
  • Temperature-controlled (0.3 default)

Page Break Control

  • page-break-inside: avoid for headers
  • page-break-after: avoid for headings
  • Intelligent pagination for content sections

Markdown Support

  • Optional markdown processing in content
  • Converts markdown to HTML before PDF generation
  • Supports: headers, lists, code blocks, tables

Source Citations

  • Automatic reference list generation
  • Relevance score display
  • Page number tracking
  • URL linking

Technical Details

Dependencies

  • weasyprint: PDF generation from HTML/CSS
  • jinja2: Template rendering
  • markdown2: Markdown processing (optional)
  • pydantic: Data validation

Template Engine

  • Jinja2 for HTML generation
  • Autoescape enabled for security
  • FileSystemLoader for template directory

CSS Architecture

  • Separate CSS file per style
  • Loaded at formatter initialization
  • Injected into template at render time
  • Print media query optimizations

Page Setup

@page {
    size: A4;
    margin: 2.5cm;
    @bottom-right {
        content: counter(page);
    }
}

File Structure

pulldata/synthesis/
├── report_models.py              # Pydantic models + system prompts
├── templates/
│   ├── report_base.html          # Master Jinja2 template
│   └── styles/
│       ├── executive.css         # Style A
│       ├── modernist.css         # Style B
│       └── academic.css          # Style C
└── formatters/
    ├── styled_pdf.py             # StyledPDFFormatter class
    └── __init__.py               # Updated exports

examples/
└── styled_pdf_example.py         # 4 complete examples

docs/
└── STYLED_PDF_GUIDE.md           # Complete documentation

Testing

Run the example to verify installation:

# Basic test (no LLM required)
python examples/styled_pdf_example.py

# This will generate:
./output/styled_pdfs/financial_report_executive.pdf
./output/styled_pdfs/financial_report_modernist.pdf
./output/styled_pdfs/financial_report_academic.pdf
./output/style_comparison/comparison_*.pdf

Extension Points

Adding New Styles

  1. Create new CSS file in templates/styles/
  2. Add to AVAILABLE_STYLES dict in styled_pdf.py
  3. Use immediately: render_styled_pdf(data, style_name="custom")

Custom Templates

  1. Create new Jinja2 template
  2. Use custom template environment:
formatter = StyledPDFFormatter(style="executive")
template = formatter.jinja_env.get_template("custom_report.html")

LLM Prompt Customization

Modify prompts in report_models.py:

  • REPORT_STRUCTURING_PROMPT - General structuring
  • REPORT_STRUCTURING_PROMPT_WITH_SOURCES - With source metadata

Performance Considerations

PDF Generation

  • Typical render time: 1-3 seconds per PDF
  • Depends on: content length, complexity, style
  • WeasyPrint is single-threaded

LLM Structuring

  • Depends on LLM latency (API or local)
  • Temperature 0.3 for consistent JSON output
  • Max tokens: 4000 (configurable)

Optimization Tips

  1. Cache ReportData objects for reuse
  2. Generate PDFs asynchronously in web apps
  3. Reuse formatter instances
  4. Disable markdown processing if not needed

Known Limitations

  1. WeasyPrint Installation: Requires system dependencies on some platforms
  2. LLM JSON Parsing: Smaller models may struggle with consistent JSON output
  3. Two-Column Layout: Academic style may have pagination issues with very long tables
  4. Font Support: Limited to system-installed fonts

Future Enhancements

Potential additions (not yet implemented):

  1. Charts & Graphs: matplotlib/plotly integration
  2. Custom Fonts: Embedded font support
  3. Interactive PDFs: Form fields, hyperlinks
  4. Batch Generation: Multi-report generation
  5. Template Gallery: Pre-built templates for common use cases
  6. Cover Pages: Customizable cover page templates
  7. Watermarks: Security watermark support
  8. Digital Signatures: PDF signing capability

Support & Resources

  • Documentation: docs/STYLED_PDF_GUIDE.md
  • Examples: examples/styled_pdf_example.py
  • API Reference: See documentation
  • Issues: GitHub issues for bug reports

Credits

Created: 2024-12-18 Author: PullData Development Team Version: 1.0


Quick Reference

Install

pip install weasyprint jinja2 markdown2

Generate PDF

from pulldata.synthesis import ReportData, render_styled_pdf

data = ReportData(title="Report", summary="...", sections=[...])
pdf = render_styled_pdf(data, style_name="executive")

With LLM

from pulldata.synthesis import StyledPDFFormatter

formatter = StyledPDFFormatter(style="modernist", llm=my_llm)
structured = formatter.structure_with_llm(raw_text)
pdf = formatter.render_styled_pdf(structured)

All Styles

  • executive - Clean & minimal
  • modernist - Bold & impactful
  • academic - Traditional & formal

Status: ✅ Complete and ready for use