Skip to content

Latest commit

 

History

History
150 lines (121 loc) · 3.6 KB

File metadata and controls

150 lines (121 loc) · 3.6 KB

Twitter Archive Cleanup Script

This script cleans up a Twitter/X archive export by extracting only public content and converting it to clean JSON format.

What It Does

Keeps (Converted to JSON):

  • tweets.json - Your 14,863 public tweets
  • likes.json - 25,553 tweets you liked
  • followers.json - 1,690 followers
  • following.json - 1,176 accounts you follow
  • profile.json - Profile information
  • community-tweets.json - Community notes
  • note-tweets.json - Note tweets
  • Media folders - 780MB of images/videos from your tweets

Removes (Saves ~9.89 GB):

  • All direct messages (9.6GB of media + 220MB of text)
  • Ad tracking data (ad-engagements, ad-impressions)
  • Grok AI chat history
  • Deleted tweets
  • IP audit logs
  • Device tokens and personalization data

Usage

Basic Usage (Current Directory)

python3 cleanup_twitter_archive.py

Specify Archive Path

python3 cleanup_twitter_archive.py /path/to/twitter-archive

Specify Output Directory

python3 cleanup_twitter_archive.py /path/to/archive output_folder_name

Output Structure

twitter_archive_clean/
├── README.txt              # Human-readable summary
├── cleanup_report.json     # Detailed JSON report
├── data/                   # Clean JSON files
│   ├── tweets.json
│   ├── likes.json
│   ├── followers.json
│   ├── following.json
│   ├── profile.json
│   ├── community-tweets.json
│   ├── note-tweets.json
│   ├── tweet-headers.json
│   └── account.json
└── media/                  # Media files
    ├── tweets_media/
    ├── profile_media/
    └── community_tweet_media/

Results from This Archive

  • Original size: ~11GB (with DMs and tracking data)
  • Cleaned size: 819MB
  • Space saved: 9.89GB
  • Tweets extracted: 14,863
  • Likes extracted: 25,553
  • Time period: Check your tweets for date range

JSON Format

The script converts from Twitter's JavaScript format:

window.YTD.tweets.part0 = [{...}]

To clean JSON arrays:

[
  {
    "tweet": {
      "full_text": "...",
      "created_at": "...",
      ...
    }
  }
]

Working with the Data

Python Example

import json

# Load tweets
with open('twitter_archive_clean/data/tweets.json') as f:
    tweets = json.load(f)

# Iterate through tweets
for item in tweets:
    tweet = item['tweet']
    print(f"{tweet['created_at']}: {tweet['full_text']}")

Count Tweets by Year

from collections import Counter
from datetime import datetime

with open('twitter_archive_clean/data/tweets.json') as f:
    tweets = json.load(f)

years = Counter()
for item in tweets:
    date_str = item['tweet']['created_at']
    # Parse: "Fri Jun 20 18:43:40 +0000 2025"
    date = datetime.strptime(date_str, "%a %b %d %H:%M:%S %z %Y")
    years[date.year] += 1

for year, count in sorted(years.items()):
    print(f"{year}: {count} tweets")

Extract All Tweet Text

with open('twitter_archive_clean/data/tweets.json') as f:
    tweets = json.load(f)

for item in tweets:
    print(item['tweet']['full_text'])

Reusability

This script works with any Twitter/X archive export. Just:

  1. Download your archive from Twitter/X
  2. Extract it
  3. Run this script on the extracted folder

Notes

  • The script does not modify your original archive
  • All original files remain intact
  • Safe to run multiple times (overwrites previous clean archive)
  • Validates JSON structure to ensure data integrity
  • Works with the standard Twitter archive format as of 2025