Skip to content

eoberortner/news-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Biotech News Pipeline

A comprehensive automation system that transforms RSS feeds into professional content ready for distribution. This pipeline orchestrates the entire workflow from data collection to social media posting, with organized output storage and detailed logging.

πŸš€ Complete Workflow

4-Step Pipeline Process:

  1. πŸ“‘ RSS Feed Parsing β†’ Extract articles from biotech RSS feeds
  2. πŸ“… Date Range Filtering β†’ Focus on specific time periods
  3. πŸŽ™οΈ Podcast Generation β†’ Create professional podcast scripts
  4. πŸ’Ό LinkedIn Post Creation β†’ Generate social media content

Organized Output Structure:

output/
└── run_20250127_143022/
    β”œβ”€β”€ raw/                    # Original RSS parsed articles
    β”‚   └── articles_summary.txt
    β”œβ”€β”€ processed/              # Date-filtered articles
    β”‚   └── filtered_articles.txt
    β”œβ”€β”€ final/                  # Ready-to-use content
    β”‚   β”œβ”€β”€ podcast_script.txt
    β”‚   β”œβ”€β”€ linkedin_post.txt
    β”‚   └── linkedin_post_compact.txt
    β”œβ”€β”€ pipeline_log.txt        # Detailed execution log
    └── pipeline_summary.txt    # Run summary report

πŸ“‹ Scripts Overview

Core Pipeline Scripts:

Script Purpose Input Output
pipeline.py Main orchestrator RSS feeds, date range Complete pipeline output
rss_parser.py Parse RSS feeds sources.txt Articles with metadata
query_articles.py Filter by date range Articles summary Filtered articles
podcast_generator.py Generate podcast script Filtered articles Professional podcast
linkedin_extractor.py Create LinkedIn posts Podcast script Social media content

Supporting Files:

  • sources.txt - RSS feed URLs
  • PIPELINE_README.md - Detailed pipeline documentation
  • PODCAST_README.md - Podcast generator documentation
  • LINKEDIN_README.md - LinkedIn extractor documentation

🎯 Quick Start

Prerequisites:

Option 1: Using uv (Recommended - Fast & Modern)

# Install uv if you haven't already
# macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via pip:
pip install uv

# Install dependencies and create virtual environment
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On macOS/Linux
# or
.venv\Scripts\activate  # On Windows

# Set up environment variables
# Copy .env.example to .env and add your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# Ensure sources.txt contains your RSS feed URLs

Option 2: Using pip (Traditional)

# Install Python dependencies
pip install -r requirements.txt

# Set up environment variables
# Copy .env.example to .env and add your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# Ensure sources.txt contains your RSS feed URLs

Run Complete Pipeline:

# Default: Last 7 days
python pipeline.py

# Custom date range
python pipeline.py --start-date 2025-08-18 --end-date 2025-08-24

# Custom output directory
python pipeline.py --output my_reports --days 14

# Single day analysis
python pipeline.py --output daily_reports --days 1

Individual Scripts:

# RSS parsing only
python rss_parser.py

# Date filtering only
python query_articles.py 2025-08-18 2025-08-24

# Podcast generation only
python podcast_generator.py --input filtered.txt

# LinkedIn post only
python linkedin_extractor.py --podcast podcast.txt --articles filtered.txt

πŸ“Š Pipeline Features

RSS Feed Parsing (rss_parser.py)

  • Multi-source aggregation: Combines articles from multiple RSS feeds
  • Duplicate detection: Identifies and handles duplicate articles
  • Metadata extraction: Captures title, URL, publication date, content preview
  • Occurrence tracking: Counts article appearances across sources
  • Error handling: Graceful handling of network issues and malformed feeds

Date Range Filtering (query_articles.py)

  • Flexible date ranges: Custom start/end dates or relative periods
  • Source analysis: Provides statistics on article sources
  • Content statistics: Analyzes article length and distribution
  • Enhanced metadata: Adds content length and source diversity metrics
  • Filtered output: Clean, organized article list with metadata

Podcast Generation (podcast_generator.py)

  • Impact scoring: Ranks articles by relevance and importance
  • Topic diversity: Ensures coverage across biotech categories
  • Narrative flow: Structures content with opening, main stories, quick hits
  • Professional formatting: Creates broadcast-ready scripts
  • Source attribution: Lists all sources at the end
  • Duration control: Configurable target podcast length

LinkedIn Post Creation (linkedin_extractor.py)

  • Clickable titles: Article titles as markdown links
  • Multiple formats: Standard and compact versions
  • HTML cleaning: Removes HTML tags from titles
  • Professional hashtags: Industry-relevant tags included
  • Title-URL mapping: Automatic linking of titles to source URLs

πŸŽ™οΈ Podcast Features

Content Selection Algorithm:

  • Impact scoring based on keywords (clinical trial, FDA, breakthrough, etc.)
  • Topic classification into 10 biotech categories:
    • Therapeutics, Diagnostics, Research, Industry, Technology
    • Genetics, Microbiome, Cancer, Rare Disease, Infectious Disease
  • Hybrid selection balancing main stories and quick hits
  • Time-based structuring for optimal flow

Output Format:

πŸŽ™οΈ BIOTECH NEWS PODCAST
[Date Range]

=== OPENING ===
Welcome to this week's biotech news roundup...

=== MAIN STORIES ===
Story 1: [Title]
[Detailed summary with context and implications]

Story 2: [Title]
[Detailed summary with context and implications]

=== QUICK HITS ===
β€’ [Brief summary]
β€’ [Brief summary]

=== TRENDS & INSIGHTS ===
[Analysis of common themes and patterns]

=== CLOSING ===
[Wrap-up and forward-looking statements]

=== SOURCES SUMMARY ===
[Complete list of sources used]

πŸ’Ό LinkedIn Features

Post Formats:

Standard Format:

πŸ”¬ This Week's Top Biotech News

Here are the key developments in biotechnology this week:

1. [Article Title](URL)

2. [Article Title](URL)

#Biotech #Biotechnology #Science #Innovation #Healthcare #Research

Compact Format:

πŸ”¬ This Week's Top Biotech News

Key developments in biotechnology:

1. [Article Title](URL)
2. [Article Title](URL)

#Biotech #Biotechnology #Science #Innovation #Healthcare #Research

πŸ”§ Configuration

RSS Sources (sources.txt):

https://phys.org/rss-feed/biology-news/
https://www.sciencedaily.com/rss/health_medicine/biotechnology.xml
https://www.labiotech.eu/feed/
https://www.genengnews.com/feed/
https://endpoints.news/feed/
https://www.biopharmadive.com/rss/
https://www.fiercebiotech.com/rss/xml
https://bio.news/feed/
https://www.biotech.ca/news/feed/
https://www.technologyreview.com/topic/biotechnology/feed/
https://biotechexpressmag.com/feed/
https://o2h.com/feed/
https://bioengineer.org/feed/

Command Line Options:

# Pipeline options
--output, -o          Output directory (default: output)
--start-date, -s      Start date for filtering (YYYY-MM-DD)
--end-date, -e        End date for filtering (YYYY-MM-DD)
--days, -d            Number of days to look back (default: 7)

# Podcast options
--input, -i           Input articles file
--output, -o          Output podcast file
--duration, -d        Target duration in minutes (default: 10)

# LinkedIn options
--podcast, -p         Podcast script file
--articles, -a        Articles file with URLs
--output, -o          Output LinkedIn post file
--compact, -c         Generate compact format

πŸ“ˆ Usage Examples

Daily Reports:

# Generate daily report
python pipeline.py --output daily_reports --days 1

# Schedule with cron (daily at 9 AM)
0 9 * * * cd /path/to/news-podcast && python pipeline.py --output daily_reports --days 1

Weekly Summaries:

# Generate weekly summary
python pipeline.py --output weekly_reports --days 7

# Schedule with cron (every Monday at 8 AM)
0 8 * * 1 cd /path/to/news-podcast && python pipeline.py --output weekly_reports --days 7

Custom Analysis:

# Monthly analysis
python pipeline.py --output monthly_reports --start-date 2025-08-01 --end-date 2025-08-31

# Event-specific coverage
python pipeline.py --output conference_coverage --start-date 2025-08-15 --end-date 2025-08-20

# Recent developments (last 3 days)
python pipeline.py --output recent_news --days 3

Individual Component Usage:

# Parse RSS feeds only
python rss_parser.py

# Filter articles for specific date range
python query_articles.py 2025-08-18 2025-08-24 --output filtered_articles.txt

# Generate podcast from filtered articles
python podcast_generator.py --input filtered_articles.txt --output podcast_script.txt

# Create LinkedIn post from podcast
python linkedin_extractor.py --podcast podcast_script.txt --articles filtered_articles.txt --output linkedin_post.txt

πŸ› οΈ Technical Details

Dependencies:

  • Python 3.7+
  • feedparser - RSS feed parsing
  • requests - HTTP requests
  • beautifulsoup4 - HTML parsing
  • datetime - Date handling
  • pathlib - File path management

Installation:

# Clone repository
git clone <repository-url>
cd news-podcast

# Install dependencies
# Using uv (Recommended):
uv sync
source .venv/bin/activate  # macOS/Linux

# Or using pip:
pip install feedparser requests beautifulsoup4

# Set up RSS sources
# Edit sources.txt with your preferred RSS feeds

File Structure:

news-podcast/
β”œβ”€β”€ pipeline.py              # Main pipeline orchestrator
β”œβ”€β”€ rss_parser.py            # RSS feed parser
β”œβ”€β”€ query_articles.py        # Date range filter
β”œβ”€β”€ podcast_generator.py     # Podcast script generator
β”œβ”€β”€ linkedin_extractor.py    # LinkedIn post creator
β”œβ”€β”€ sources.txt              # RSS feed URLs
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ README.md               # This file
β”œβ”€β”€ PIPELINE_README.md      # Detailed pipeline docs
β”œβ”€β”€ PODCAST_README.md       # Podcast generator docs
β”œβ”€β”€ LINKEDIN_README.md      # LinkedIn extractor docs
β”œβ”€β”€ CLEANUP_SUMMARY.md      # Cleanup documentation
└── example_output/         # Example pipeline run
    └── run_20250827_121133/
        β”œβ”€β”€ raw/
        β”œβ”€β”€ processed/
        β”œβ”€β”€ final/
        β”œβ”€β”€ pipeline_log.txt
        └── pipeline_summary.txt

🎯 Benefits

βœ… Complete Automation: End-to-end workflow from RSS to social media βœ… Professional Quality: Production-ready content generation βœ… Organized Output: Structured file organization with clear naming βœ… Comprehensive Logging: Detailed execution tracking and error reporting βœ… Flexible Configuration: Customizable date ranges and output locations βœ… Easy Integration: Works with existing scripts and workflows βœ… Scalable: Handles multiple runs with unique identifiers βœ… Reproducible: Consistent results with detailed logging

πŸ”„ Automation & Scheduling

Cron Job Examples:

# Daily reports at 9 AM
0 9 * * * cd /path/to/news-podcast && python pipeline.py --output daily_reports --days 1

# Weekly summaries every Monday at 8 AM
0 8 * * 1 cd /path/to/news-podcast && python pipeline.py --output weekly_reports --days 7

# Monthly analysis on the 1st at 7 AM
0 7 1 * * cd /path/to/news-podcast && python pipeline.py --output monthly_reports --start-date $(date -d '1 month ago' +%Y-%m-01) --end-date $(date +%Y-%m-%d)

Batch Processing:

# Process multiple date ranges
for days in 1 3 7 14 30; do
    python pipeline.py --output batch_reports --days $days
done

πŸ“ Best Practices

For Regular Use:

  1. Schedule runs: Use cron jobs for automated daily/weekly reports
  2. Monitor logs: Check pipeline_log.txt for any issues
  3. Archive results: Keep historical runs for comparison
  4. Customize output: Use meaningful output directory names

For Content Strategy:

  1. Consistent timing: Run at regular intervals for audience expectations
  2. Quality review: Always review generated content before posting
  3. Engagement tracking: Monitor LinkedIn post performance
  4. Iterative improvement: Use feedback to refine the pipeline

For Technical Maintenance:

  1. Update sources: Regularly review and update RSS feed sources
  2. Monitor dependencies: Keep Python packages updated
  3. Backup data: Archive important pipeline runs
  4. Performance monitoring: Track execution times and resource usage

πŸš€ Getting Started

Step-by-Step Setup:

  1. Install Dependencies:

    Using uv (Recommended):

    uv sync
    source .venv/bin/activate  # macOS/Linux

    Using pip:

    pip install feedparser requests beautifulsoup4
  2. Configure RSS Sources:

    • Edit sources.txt with your preferred RSS feed URLs
    • Ensure URLs are accessible and valid
  3. Test the Pipeline:

    python pipeline.py --output test_run --days 1
  4. Review Output:

    • Check test_run/ directory for generated content
    • Review pipeline_log.txt for execution details
    • Examine pipeline_summary.txt for run overview
  5. Customize Settings:

    • Adjust date ranges as needed
    • Modify output directory names
    • Configure podcast duration preferences
  6. Set Up Automation:

    • Create cron jobs for regular execution
    • Monitor logs for any issues
    • Archive important runs

Example First Run:

# Run pipeline for last 3 days
python pipeline.py --output first_run --days 3

# Check results
ls first_run/
cat first_run/*/pipeline_summary.txt
cat first_run/*/final/linkedin_post.txt

πŸ“Š Example Output

The example_output/ directory contains a complete pipeline run demonstrating:

  • 98 articles processed from 1 day
  • Professional podcast script (13.3 minutes)
  • LinkedIn posts with clickable titles (standard & compact)
  • Complete logging and execution summary

This serves as a reference for expected output format and quality.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“ Changelog

See CHANGELOG.md for a detailed history of changes and features.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

The Biotech News Pipeline transforms raw RSS feeds into professional, engagement-ready content with complete automation and organization! πŸŽ‰

About

code for BioTech Weekly newsletter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages