Purpose: Automated bulk transcript extraction tool built to support technical certification study and knowledge base management.
A Python-based automation solution that demonstrates practical systems thinking, API integration, and process optimization skills applicable to IT operations and infrastructure roles.
Developed as part of CCNA certification preparation to automate the conversion of video-based technical content into searchable text documentation. Solves the operational challenge of managing large-scale educational content by implementing a resilient, optionally proxy-enabled scraping workflow.
Real-World Application: Transformed 100+ hours of network engineering video content into a searchable knowledge base, reducing study material lookup time from minutes to seconds.
- Python 3 - Core automation scripting
- API Integration - YouTube Transcript API implementation with error handling
- Optional Proxy Management - Webshare proxy configuration for rate-limit mitigation (when needed)
- Secure Configuration - Environment-based credential management (dotenv)
- Error Handling - Multi-level exception handling with graceful degradation and retry logic
- Virtual Environment Management - Isolated dependency management following Python best practices
- File System Operations - Dynamic directory management and filename sanitization
- Bulk Processing: Iterates through playlists of any size with progress tracking
- Idempotent Operations: Automatically skips existing files to enable resume capability
- Fallback Logic: Attempts multiple transcript sources (manual β auto-generated β generic)
- Configurable Rate Limiting: Implements polite delays to respect API constraints
- Optional Proxy Support: Can run with or without proxy service based on playlist size
- Clean Output: Formats transcripts as plain text with sanitized filenames
- Professional Logging: Creates detailed log files for troubleshooting and audit trails
Built and tested in a home lab environment as part of hands-on skill development for:
- CCNA Certification (currently pursuing)
- Systems administration fundamentals
- Automation and scripting best practices
- Understanding API interactions and network constraints
This project reflects a systematic approach to filling knowledge gaps through self-directed learning and practical tool-buildingβcore competencies for technical support and infrastructure roles.
- Python 3.7+ (tested with Python 3.13)
- Optional: Webshare proxy account (only needed for very large playlists or if you hit rate limits)
git clone https://github.com/idle5/youtube-transcript-automation.git
cd youtube-transcript-automation# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows (PowerShell):
venv\Scripts\Activate.ps1
# On Windows (CMD):
venv\Scripts\activate.bat
# Upgrade pip (optional but recommended)
pip install --upgrade pippip install -r requirements.txtCreate a .env file in the project root with your settings:
Best for playlists under 50-100 videos
# Basic configuration - no proxy needed
USE_PROXY=false
PLAYLIST_URL=https://www.youtube.com/playlist?list=YOUR_PLAYLIST_ID
OUTPUT_FOLDER=transcriptsNote: Without a proxy, you may hit YouTube's rate limits on very large playlists (100+ videos). If this happens, you'll see errors and can then enable proxy support.
Use this if you're processing 100+ videos or encountering rate limit errors
# Proxy configuration for high-volume processing
USE_PROXY=true
PROXY_USER=your_webshare_username
PROXY_PASS=your_webshare_password
PLAYLIST_URL=https://www.youtube.com/playlist?list=YOUR_PLAYLIST_ID
OUTPUT_FOLDER=transcriptsGetting Webshare Proxy (if needed):
- Sign up at webshare.io
- Free tier includes 10 proxies (sufficient for most use cases)
- Get credentials from your dashboard
# 1. Ensure virtual environment is activated
source venv/bin/activate # macOS/Linux
# or: venv\Scripts\activate # Windows
# 2. Update PLAYLIST_URL in .env file with your target playlist
# 3. Run the script
python main.pyTranscripts are saved to the transcripts/ directory with format: [video_id] - title.txt
You can customize behavior through additional .env variables:
# Custom output directory
OUTPUT_FOLDER=my_transcripts
# Adjust retry attempts (default: 3)
RETRY_ATTEMPTS=5
# Adjust rate limit delay in seconds (default: 0.5)
RATE_LIMIT_DELAY=1.0Cause: Video doesn't have transcripts enabled, or you're hitting rate limits
Solutions:
- Check if the video actually has captions on YouTube
- If processing 100+ videos without proxy, enable proxy support
- Increase
RATE_LIMIT_DELAYin .env to slow down requests
Cause: Too many requests without proxy
Solution: Enable proxy in .env:
USE_PROXY=true
PROXY_USER=your_credentials
PROXY_PASS=your_credentialsmacOS/Linux:
source venv/bin/activateWindows (PowerShell):
venv\Scripts\Activate.ps1Windows (CMD):
venv\Scripts\activate.batSolution: Upgrade pip first:
pip install --upgrade pip
pip install -r requirements.txt- Processes 100+ videos in ~15 minutes (with rate limiting and proxy)
- 95%+ success rate on publicly available transcripts
- Handles playlist updates incrementally (skip existing files)
- Idempotent operations allow resume after interruption
- CLI argument support for playlist URL and output directory
- Configurable retry logic with exponential backoff (partially implemented)
- Structured logging with rotation
- Multi-threading for improved throughput
- Database integration for metadata tracking
- Support for additional video platforms
- Extraction: Bulk download transcripts from comprehensive video courses (Jeremy's IT Lab, NetworkChuck, etc.)
- Indexing: Build searchable text corpus of 100+ hours of content
- Analysis: Use LLM tools to generate summaries and extract key concepts
- Validation: Cross-reference generated content with personal flashcard deck
- Result: Reduced time to locate specific concepts (OSPF, STP, subnetting) from 10+ minutes to <30 seconds
This workflow demonstrates practical problem-solving and process optimization mindset applicable to IT operations.
This tool is intended for personal educational use. Users are responsible for ensuring compliance with YouTube's Terms of Service and applicable copyright laws. Only download transcripts from content you have the right to access for personal study purposes.
This is a personal learning project built as part of CCNA certification preparation. Feel free to fork or use as reference for your own projects.
Author: Andriy Karp
Contact: LinkedIn | GitHub
Context: Built during career transition from fitness/personal training to IT infrastructure roles