A comprehensive tool that extracts emails from multiple Gmail accounts based on domain, sender, and recipient filters, converts them to PDFs organized by email threads, and uploads them to Google Drive with attachments properly organized.
- Multi-Account Support: Extract emails from multiple Gmail accounts simultaneously
- Advanced Filtering: Filter emails by domains, specific senders, and recipients
- Thread Organization: Groups related emails into threads and exports as single PDFs
- Attachment Handling: Extracts and organizes email attachments with date-prepended filenames
- Google Drive Integration: Automatically uploads PDFs and attachments to organized folders
- Smart Naming: Generates filenames with ISO dates, domain names, sender/recipient initials, and subject keywords
- Comprehensive CLI: Easy-to-use command-line interface with setup guidance
- Clone the repository:
git clone https://github.com/fef5002/gmail-extractor.git
cd gmail-extractor- Install dependencies:
pip install -r requirements.txt- Install the package:
pip install -e .- Initialize configuration:
gmail-extractor init-config-
Setup Google APIs (see detailed setup guide below):
- Enable Gmail API and Google Drive API in Google Cloud Console
- Download OAuth2 credentials for each account
- Place credential files in the project directory
-
Edit configuration: Edit
config.jsonwith your account details and filtering preferences -
Validate setup:
gmail-extractor validate- Run extraction:
gmail-extractor runThe tool uses a JSON configuration file (config.json) with the following structure:
{
"accounts": [
{
"email": "account1@gmail.com",
"credentials_file": "credentials_account1.json",
"token_file": "token_account1.json"
},
{
"email": "account2@gmail.com",
"credentials_file": "credentials_account2.json",
"token_file": "token_account2.json"
}
],
"drive": {
"credentials_file": "drive_credentials.json",
"token_file": "drive_token.json",
"root_folder_name": "Gmail Exports"
},
"filters": {
"domains": ["example.com", "company.com"],
"senders": ["important@example.com"],
"recipients": ["me@example.com"],
"include_attachments": true
},
"max_filename_length": 100,
"date_format": "ISO"
}- accounts: List of Gmail accounts to extract from
- drive: Google Drive configuration for upload destination
- filters: Email filtering criteria
- max_filename_length: Maximum length for generated filenames
- date_format: Date format for filenames (ISO recommended)
- Go to Google Cloud Console
- Create a new project or select an existing one
- Enable the Gmail API and Google Drive API
- Go to the Credentials section
- Create OAuth2 Client ID for Desktop Application
- Download the JSON credential file
- Place Gmail API credentials for each account in your project directory
- Place Google Drive API credentials in your project directory
- Update the configuration file with the correct file paths
# Initialize configuration
gmail-extractor init-config
# Validate setup and test authentication
gmail-extractor validate
# Check if all credential files exist
gmail-extractor check-credentials
# Preview what would be extracted (without processing)
gmail-extractor preview --dry-run
# Run the full extraction process
gmail-extractor run
# Show detailed setup guide
gmail-extractor setup-guideThe tool organizes exported files in Google Drive as follows:
Gmail Exports/
├── domain1_com/
│ ├── 2024-01-15_domain1-com_JS_MR_meeting-agenda.pdf
│ ├── 2024-01-16_domain1-com_AB_CD_project-update.pdf
│ └── Attachments/
│ ├── 2024-01-15-meeting-agenda.pdf
│ └── 2024-01-16-project-proposal.docx
└── domain2_com/
├── 2024-01-17_domain2-com_XY_PQ_invoice-payment.pdf
└── Attachments/
└── 2024-01-17-invoice.pdf
PDFs: YYYY-MM-DD_domain-name_SenderInitials_RecipientInitials_subject-keywords.pdf
Attachments: YYYY-MM-DD-original-filename.ext
- Date is in ISO format (YYYY-MM-DD)
- Domain names have dots replaced with hyphens
- Initials are extracted from sender/recipient names (max 3 characters)
- Subject keywords are the first 3 meaningful words from the subject line
- Hyphens are used as separators throughout
- Supports multiple Gmail accounts simultaneously
- Independent authentication for each account
- Consolidated results organized by domain
- Domain filtering: Extract emails from/to specific domains
- Sender filtering: Target emails from specific senders
- Recipient filtering: Target emails to specific recipients
- Combination filtering: All filters work together with AND logic
- Groups related emails into conversation threads
- Generates one PDF per thread containing all emails
- Maintains chronological order within threads
- Includes email metadata (from, to, date, subject)
- Converts HTML content to readable text
- Extracts all email attachments
- Prepends ISO date to attachment filenames
- Maintains original file extensions
- Organizes attachments in separate folders per domain
- Handles filename conflicts automatically
- Creates organized folder structure automatically
- Separates PDFs and attachments
- Handles authentication and token refresh
- Provides upload progress and error reporting
The tool includes comprehensive error handling:
- Authentication failures are clearly reported
- Network issues are handled gracefully
- File conflicts are resolved automatically
- Detailed error messages help with troubleshooting
- Python 3.8+
- Google account with Gmail and Drive access
- Google Cloud Project with enabled APIs
- Valid OAuth2 credentials
See requirements.txt for the complete list of dependencies:
- google-api-python-client
- google-auth-httplib2
- google-auth-oauthlib
- reportlab
- python-dateutil
- click
- beautifulsoup4
- lxml
- Authentication Errors: Ensure credential files are valid and in the correct location
- API Quota Exceeded: Check Google Cloud Console for API usage limits
- Permission Denied: Verify OAuth2 scopes include necessary permissions
- File Upload Errors: Check Google Drive storage space and permissions
- Run
gmail-extractor setup-guidefor detailed setup instructions - Use
gmail-extractor validateto test your configuration - Check the console output for detailed error messages
- Ensure all credential files are properly configured
This project is licensed under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.