Skip to content

KalashThakare/google-mail-automation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gmail to Google Sheets Automation

A backend automation script that authenticates with Gmail using OAuth 2.0, fetches unread emails, extracts relevant information, and appends the data to a Google Sheet with idempotent execution and persistent state management.


πŸ“ Project Structure

MailParser/
β”œβ”€β”€ credentials/
β”‚   └── credentials.json          # OAuth client credentials (not in repo)
β”œβ”€β”€ logs/
β”‚   └── mailparser_2026-01-14.log # Application logs (auto-generated)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ store/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ email_store.py        # Email deduplication logic
β”‚   β”‚   └── token_store.py        # OAuth token persistence
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ email_parser.py           # Email content extraction
β”‚   β”œβ”€β”€ gmail_service.py          # Gmail API interactions
β”‚   β”œβ”€β”€ logger.py                 # Logging configuration
β”‚   β”œβ”€β”€ main.py                   # Entry point & orchestration
β”‚   β”œβ”€β”€ oauth.py                  # OAuth 2.0 flow handler
β”‚   β”œβ”€β”€ sheets_service.py         # Google Sheets API wrapper
β”‚   β”œβ”€β”€ test.py                   # Unit tests
β”‚   └── utils.py                  # Utility functions
β”œβ”€β”€ venv1/                        # Virtual environment
β”œβ”€β”€ .env                          # Environment variables (not in repo)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ config.py                     # Configuration loader
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt              # Python dependencies
└── state.db                      # SQLite database (generated at runtime)

Flow:

  1. main.py orchestrates the entire process
  2. oauth.py handles authentication and stores tokens via token_store.py
  3. gmail_service.py fetches unread emails from Gmail API and marks them readed after processing
  4. email_parser.py extracts sender, subject, body, and timestamp
  5. email_store.py checks if email was already processed (by message ID)
  6. sheets_service.py appends new emails to Google Sheet
  7. All state persists in state.db SQLite database
  8. logger.py manages application logging to logs/ directory

Step-by-Step Setup Instructions

Prerequisites

  • Google account
  • Google Cloud Platform project

Step 1: Clone Repository & Install Dependencies

git clone <repository-url>
cd MailParser
python -m venv venv1
source venv1/bin/activate  # On Windows: venv1\Scripts\activate
pip install -r requirements.txt

Step 2: Configure Google Cloud Project

  1. Create a Google Cloud Project:

  2. Enable Required APIs:

    • Navigate to "APIs & Services" > "Library"
    • Enable Gmail API
    • Enable Google Sheets API
  3. Configure OAuth Consent Screen:

    • Go to "APIs & Services" > "OAuth consent screen"
    • User type: External
    • Add your email as a Test User
    • Add scopes:
      • https://www.googleapis.com/auth/gmail.modify
      • https://www.googleapis.com/auth/spreadsheets
  4. Create OAuth Client ID:

    • Go to "APIs & Services" > "Credentials"
    • Click "Create Credentials" > "OAuth client ID"
    • Application type: Desktop app
    • Download the JSON file
  5. Save Credentials:

    • Rename the downloaded file to credentials.json
    • Place it in the credentials/ folder

Step 3: Create Google Sheet

  1. Create a new Google Sheet
  2. Add headers in the first row:
   Sender | Subject | Body | Timestamp
  1. Copy the Spreadsheet ID from the URL:
   https://docs.google.com/spreadsheets/d/{SPREADSHEET_ID}/edit

Step 4: Configure Environment Variables

Create a .env file in the project root:

SPREADSHEET_ID=your_spreadsheet_id_here

Step 5: Run the Script

python -m src.main

On first run:

  • A browser window will open for OAuth consent
  • Grant the requested permissions
  • Tokens will be stored in state.db for future runs

OAuth Flow Used

This project uses OAuth 2.0 Desktop Application Flow (formerly called "Installed Application Flow").

How It Works:

  1. Initial Authentication:

    • User runs the script for the first time
    • Script redirects to Google's OAuth consent page in browser
    • User grants permissions
    • Authorization code is exchanged for access token + refresh token
  2. Token Storage:

    • Both tokens are stored in SQLite database (state.db)
    • Tokens are encrypted using the database's built-in mechanisms
  3. Token Refresh:

    • Access tokens expire after 1 hour
    • When expired, the refresh token automatically requests a new access token
    • No user interaction required after initial setup
  4. Security Considerations:

    • Tokens are stored locally (acceptable for single-user automation)
    • credentials.json and .env are excluded from version control
    • For production use, consider encrypting the database file

Why this approach?

  • Recommended by Google for CLI/backend tools
  • No web server required
  • Seamless token refresh without re-authentication

Duplicate Prevention Logic

Problem:

Running the script multiple times should not create duplicate entries in the Google Sheet.

Solution:

Each Gmail message has a unique, immutable message ID assigned by Google.

Implementation:

  1. Database Table:
   CREATE TABLE processed_emails (
       message_id TEXT PRIMARY KEY,
       processed_at TIMESTAMP
   );
  1. Check Before Processing:
   def is_processed(message_id: str) -> bool:
       # Query database for message_id
       # Return True if exists, False otherwise
  1. Store After Processing:
   def mark_as_processed(message_id: str):
       # Insert message_id with current timestamp
  1. Workflow:
    • Fetch all unread emails
    • For each email, check if message_id exists in database
    • If exists β†’ skip
    • If not β†’ process and store message_id

Benefits:

  • Idempotent execution: Running the script 10 times processes each email only once
  • Reliable: Message IDs never change
  • Efficient: Database lookups are fast even with thousands of entries

State Persistence Method

Technology: SQLite

The project uses SQLite (state.db) for all persistent storage.

Database Schema:

-- OAuth tokens table
CREATE TABLE oauth_tokens (
    id INTEGER PRIMARY KEY,
    token_data TEXT NOT NULL,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Processed emails table
CREATE TABLE processed_emails (
    message_id TEXT PRIMARY KEY,
    processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Why SQLite?

Feature Benefit
Zero Configuration No server setup required
ACID Compliance Guarantees data integrity
Atomic Operations Prevents corruption during concurrent access
File-Based Single file, easy to backup
Fast Lookups Indexed queries for duplicate checking
Better than JSON No race conditions or partial writes

Alternative Considered:

  • JSON files were considered but rejected due to:
    • Risk of file corruption during concurrent writes
    • No built-in indexing for fast lookups
    • Manual handling of race conditions

Dependencies

See requirements.txt for complete list. Key dependencies:

  • google-auth-oauthlib - OAuth 2.0 authentication
  • google-api-python-client - Gmail & Sheets API clients
  • python-dotenv - Environment variable management

πŸ§ͺ Testing

Run unit tests:

python -m src.test

πŸ“„ License

This project is created as part of a technical assignment.


About

A backend automation script that authenticates with Gmail using OAuth 2.0, fetches unread emails, extracts relevant information, and appends the data to a Google Sheet with idempotent execution and persistent state management.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages