A comprehensive automation system that transforms RSS feeds into professional content ready for distribution. This pipeline orchestrates the entire workflow from data collection to social media posting, with organized output storage and detailed logging.
- π‘ RSS Feed Parsing β Extract articles from biotech RSS feeds
- π Date Range Filtering β Focus on specific time periods
- ποΈ Podcast Generation β Create professional podcast scripts
- πΌ LinkedIn Post Creation β Generate social media content
output/
βββ run_20250127_143022/
βββ raw/ # Original RSS parsed articles
β βββ articles_summary.txt
βββ processed/ # Date-filtered articles
β βββ filtered_articles.txt
βββ final/ # Ready-to-use content
β βββ podcast_script.txt
β βββ linkedin_post.txt
β βββ linkedin_post_compact.txt
βββ pipeline_log.txt # Detailed execution log
βββ pipeline_summary.txt # Run summary report
| Script | Purpose | Input | Output |
|---|---|---|---|
pipeline.py |
Main orchestrator | RSS feeds, date range | Complete pipeline output |
rss_parser.py |
Parse RSS feeds | sources.txt |
Articles with metadata |
query_articles.py |
Filter by date range | Articles summary | Filtered articles |
podcast_generator.py |
Generate podcast script | Filtered articles | Professional podcast |
linkedin_extractor.py |
Create LinkedIn posts | Podcast script | Social media content |
sources.txt- RSS feed URLsPIPELINE_README.md- Detailed pipeline documentationPODCAST_README.md- Podcast generator documentationLINKEDIN_README.md- LinkedIn extractor documentation
# Install uv if you haven't already
# macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or via pip:
pip install uv
# Install dependencies and create virtual environment
uv sync
# Activate the virtual environment
source .venv/bin/activate # On macOS/Linux
# or
.venv\Scripts\activate # On Windows
# Set up environment variables
# Copy .env.example to .env and add your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# Ensure sources.txt contains your RSS feed URLs# Install Python dependencies
pip install -r requirements.txt
# Set up environment variables
# Copy .env.example to .env and add your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# Ensure sources.txt contains your RSS feed URLs# Default: Last 7 days
python pipeline.py
# Custom date range
python pipeline.py --start-date 2025-08-18 --end-date 2025-08-24
# Custom output directory
python pipeline.py --output my_reports --days 14
# Single day analysis
python pipeline.py --output daily_reports --days 1# RSS parsing only
python rss_parser.py
# Date filtering only
python query_articles.py 2025-08-18 2025-08-24
# Podcast generation only
python podcast_generator.py --input filtered.txt
# LinkedIn post only
python linkedin_extractor.py --podcast podcast.txt --articles filtered.txt- Multi-source aggregation: Combines articles from multiple RSS feeds
- Duplicate detection: Identifies and handles duplicate articles
- Metadata extraction: Captures title, URL, publication date, content preview
- Occurrence tracking: Counts article appearances across sources
- Error handling: Graceful handling of network issues and malformed feeds
- Flexible date ranges: Custom start/end dates or relative periods
- Source analysis: Provides statistics on article sources
- Content statistics: Analyzes article length and distribution
- Enhanced metadata: Adds content length and source diversity metrics
- Filtered output: Clean, organized article list with metadata
- Impact scoring: Ranks articles by relevance and importance
- Topic diversity: Ensures coverage across biotech categories
- Narrative flow: Structures content with opening, main stories, quick hits
- Professional formatting: Creates broadcast-ready scripts
- Source attribution: Lists all sources at the end
- Duration control: Configurable target podcast length
- Clickable titles: Article titles as markdown links
- Multiple formats: Standard and compact versions
- HTML cleaning: Removes HTML tags from titles
- Professional hashtags: Industry-relevant tags included
- Title-URL mapping: Automatic linking of titles to source URLs
- Impact scoring based on keywords (clinical trial, FDA, breakthrough, etc.)
- Topic classification into 10 biotech categories:
- Therapeutics, Diagnostics, Research, Industry, Technology
- Genetics, Microbiome, Cancer, Rare Disease, Infectious Disease
- Hybrid selection balancing main stories and quick hits
- Time-based structuring for optimal flow
ποΈ BIOTECH NEWS PODCAST
[Date Range]
=== OPENING ===
Welcome to this week's biotech news roundup...
=== MAIN STORIES ===
Story 1: [Title]
[Detailed summary with context and implications]
Story 2: [Title]
[Detailed summary with context and implications]
=== QUICK HITS ===
β’ [Brief summary]
β’ [Brief summary]
=== TRENDS & INSIGHTS ===
[Analysis of common themes and patterns]
=== CLOSING ===
[Wrap-up and forward-looking statements]
=== SOURCES SUMMARY ===
[Complete list of sources used]
π¬ This Week's Top Biotech News
Here are the key developments in biotechnology this week:
1. [Article Title](URL)
2. [Article Title](URL)
#Biotech #Biotechnology #Science #Innovation #Healthcare #Research
π¬ This Week's Top Biotech News
Key developments in biotechnology:
1. [Article Title](URL)
2. [Article Title](URL)
#Biotech #Biotechnology #Science #Innovation #Healthcare #Research
https://phys.org/rss-feed/biology-news/
https://www.sciencedaily.com/rss/health_medicine/biotechnology.xml
https://www.labiotech.eu/feed/
https://www.genengnews.com/feed/
https://endpoints.news/feed/
https://www.biopharmadive.com/rss/
https://www.fiercebiotech.com/rss/xml
https://bio.news/feed/
https://www.biotech.ca/news/feed/
https://www.technologyreview.com/topic/biotechnology/feed/
https://biotechexpressmag.com/feed/
https://o2h.com/feed/
https://bioengineer.org/feed/
# Pipeline options
--output, -o Output directory (default: output)
--start-date, -s Start date for filtering (YYYY-MM-DD)
--end-date, -e End date for filtering (YYYY-MM-DD)
--days, -d Number of days to look back (default: 7)
# Podcast options
--input, -i Input articles file
--output, -o Output podcast file
--duration, -d Target duration in minutes (default: 10)
# LinkedIn options
--podcast, -p Podcast script file
--articles, -a Articles file with URLs
--output, -o Output LinkedIn post file
--compact, -c Generate compact format# Generate daily report
python pipeline.py --output daily_reports --days 1
# Schedule with cron (daily at 9 AM)
0 9 * * * cd /path/to/news-podcast && python pipeline.py --output daily_reports --days 1# Generate weekly summary
python pipeline.py --output weekly_reports --days 7
# Schedule with cron (every Monday at 8 AM)
0 8 * * 1 cd /path/to/news-podcast && python pipeline.py --output weekly_reports --days 7# Monthly analysis
python pipeline.py --output monthly_reports --start-date 2025-08-01 --end-date 2025-08-31
# Event-specific coverage
python pipeline.py --output conference_coverage --start-date 2025-08-15 --end-date 2025-08-20
# Recent developments (last 3 days)
python pipeline.py --output recent_news --days 3# Parse RSS feeds only
python rss_parser.py
# Filter articles for specific date range
python query_articles.py 2025-08-18 2025-08-24 --output filtered_articles.txt
# Generate podcast from filtered articles
python podcast_generator.py --input filtered_articles.txt --output podcast_script.txt
# Create LinkedIn post from podcast
python linkedin_extractor.py --podcast podcast_script.txt --articles filtered_articles.txt --output linkedin_post.txt- Python 3.7+
feedparser- RSS feed parsingrequests- HTTP requestsbeautifulsoup4- HTML parsingdatetime- Date handlingpathlib- File path management
# Clone repository
git clone <repository-url>
cd news-podcast
# Install dependencies
# Using uv (Recommended):
uv sync
source .venv/bin/activate # macOS/Linux
# Or using pip:
pip install feedparser requests beautifulsoup4
# Set up RSS sources
# Edit sources.txt with your preferred RSS feedsnews-podcast/
βββ pipeline.py # Main pipeline orchestrator
βββ rss_parser.py # RSS feed parser
βββ query_articles.py # Date range filter
βββ podcast_generator.py # Podcast script generator
βββ linkedin_extractor.py # LinkedIn post creator
βββ sources.txt # RSS feed URLs
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ PIPELINE_README.md # Detailed pipeline docs
βββ PODCAST_README.md # Podcast generator docs
βββ LINKEDIN_README.md # LinkedIn extractor docs
βββ CLEANUP_SUMMARY.md # Cleanup documentation
βββ example_output/ # Example pipeline run
βββ run_20250827_121133/
βββ raw/
βββ processed/
βββ final/
βββ pipeline_log.txt
βββ pipeline_summary.txt
β Complete Automation: End-to-end workflow from RSS to social media β Professional Quality: Production-ready content generation β Organized Output: Structured file organization with clear naming β Comprehensive Logging: Detailed execution tracking and error reporting β Flexible Configuration: Customizable date ranges and output locations β Easy Integration: Works with existing scripts and workflows β Scalable: Handles multiple runs with unique identifiers β Reproducible: Consistent results with detailed logging
# Daily reports at 9 AM
0 9 * * * cd /path/to/news-podcast && python pipeline.py --output daily_reports --days 1
# Weekly summaries every Monday at 8 AM
0 8 * * 1 cd /path/to/news-podcast && python pipeline.py --output weekly_reports --days 7
# Monthly analysis on the 1st at 7 AM
0 7 1 * * cd /path/to/news-podcast && python pipeline.py --output monthly_reports --start-date $(date -d '1 month ago' +%Y-%m-01) --end-date $(date +%Y-%m-%d)# Process multiple date ranges
for days in 1 3 7 14 30; do
python pipeline.py --output batch_reports --days $days
done- Schedule runs: Use cron jobs for automated daily/weekly reports
- Monitor logs: Check pipeline_log.txt for any issues
- Archive results: Keep historical runs for comparison
- Customize output: Use meaningful output directory names
- Consistent timing: Run at regular intervals for audience expectations
- Quality review: Always review generated content before posting
- Engagement tracking: Monitor LinkedIn post performance
- Iterative improvement: Use feedback to refine the pipeline
- Update sources: Regularly review and update RSS feed sources
- Monitor dependencies: Keep Python packages updated
- Backup data: Archive important pipeline runs
- Performance monitoring: Track execution times and resource usage
-
Install Dependencies:
Using uv (Recommended):
uv sync source .venv/bin/activate # macOS/Linux
Using pip:
pip install feedparser requests beautifulsoup4
-
Configure RSS Sources:
- Edit
sources.txtwith your preferred RSS feed URLs - Ensure URLs are accessible and valid
- Edit
-
Test the Pipeline:
python pipeline.py --output test_run --days 1
-
Review Output:
- Check
test_run/directory for generated content - Review
pipeline_log.txtfor execution details - Examine
pipeline_summary.txtfor run overview
- Check
-
Customize Settings:
- Adjust date ranges as needed
- Modify output directory names
- Configure podcast duration preferences
-
Set Up Automation:
- Create cron jobs for regular execution
- Monitor logs for any issues
- Archive important runs
# Run pipeline for last 3 days
python pipeline.py --output first_run --days 3
# Check results
ls first_run/
cat first_run/*/pipeline_summary.txt
cat first_run/*/final/linkedin_post.txtThe example_output/ directory contains a complete pipeline run demonstrating:
- 98 articles processed from 1 day
- Professional podcast script (13.3 minutes)
- LinkedIn posts with clickable titles (standard & compact)
- Complete logging and execution summary
This serves as a reference for expected output format and quality.
This project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for a detailed history of changes and features.
Contributions are welcome! Please feel free to submit a Pull Request.
The Biotech News Pipeline transforms raw RSS feeds into professional, engagement-ready content with complete automation and organization! π