Skip to content

jpulakka/organize_photos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

organize_photos

Sorts a messy photo/video collection into tidy YYYY/ directories (or YYYY-MM/ with --by-month), with fast parallel hashing and two-stage duplicate detection. A persistent SQLite cache makes repeat runs much faster.

before/                          after/
  IMG_20231014_120000.jpg    →   2023/IMG_20231014_120000.jpg
  DSC_0042.JPG               →   2021/DSC_0042.JPG
  copy of DSC_0042.JPG       →   (duplicate — skipped, logged)
  random_name.jpg            →   2019/random_name.jpg  (from mtime)

Features

  • Date resolution — tries EXIF DateTimeOriginal first, then common filename patterns (IMG_20231014_…, 2023-10-14_…, 20231014_…, etc.), falls back to file modification time
  • Exact dedup — SHA-256; bit-for-bit identical files are collapsed to one
  • Perceptual dedup — pHash via imagehash; catches re-saves, slight crops, and social-media re-uploads
  • Parallel hashing — SHA-256 in threads (I/O-bound), pHash in processes (CPU-bound)
  • EXIF date writing — JPEGs without EXIF dates get DateTimeOriginal written losslessly from the resolved date (filename/mtime), so sorted files always carry their date in metadata
  • Safe moves — same-device moves use an atomic os.rename; cross-device moves copy, verify SHA-256, then delete the source
  • Dry-run by default — nothing changes until you pass --copy or --move

Requirements

pip install Pillow        # EXIF reading + pHash image decoding
pip install imagehash     # perceptual hashing (skip with --exact-only)
pip install tqdm          # optional — nicer progress bars

Python 3.10+ required.

Quick start

# 1. Preview what would happen (nothing is changed)
python organize_photos.py --src /path/to/messy --dst /path/to/sorted

# 2. Copy files (originals stay in source — recommended for a first run)
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --copy

# 3. Move files once you are happy with the result
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --move

# Use YYYY-MM/ directories instead of plain YYYY/
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --copy --by-month

Output layout

Default (YYYY/):

/path/to/sorted/
  2023/
    IMG_20231014_120000.jpg
    IMG_20231014_130000.jpg
  2024/
    DSC_0099.JPG
  duplicates.log      ← written after --copy / --move when duplicates exist
  copied.log          ← log of all operations (or moved.log with --move)

With --by-month (YYYY-MM/):

/path/to/sorted/
  2023-10/
    IMG_20231014_120000.jpg
    IMG_20231014_130000.jpg
  2024-06/
    DSC_0099.JPG
  duplicates.log
  moved.log

Options

Option Default Description
--src DIR (required) Source directory (scanned recursively)
--dst DIR (required) Destination root directory
--copy Copy files; originals are untouched
--move Move files
--by-month off Organise into YYYY-MM/ instead of YYYY/
--only PERIOD ... all Process only specific periods: YYYY, YYYY-MM, YYYY-YYYY range, YYYY-MM:YYYY-MM range. E.g. --only 2005-2015 2024-06
--exact-only off SHA-256 dedup only; skip perceptual hashing
--phash-threshold N 8 Hamming distance for near-duplicate detection (0 = identical hashes only, 64 = maximum)
--no-exif-write off Don't write resolved date into EXIF of destination JPEGs (see below)
-v, --verbose off Show per-file output during copy/move (default: progress bar only)
--sha-workers N 4 Threads for SHA-256 hashing
--phash-workers N cpu count Processes for pHash computation
--cache PATH ~/.cache/organize_photos.db SQLite cache location
--clear-cache off Wipe the cache before running
--extensions EXT … all photo/video Restrict to specific extensions, e.g. .jpg .heic

Duplicate detection explained

Stage 1 — exact (always on) Files with identical SHA-256 digests are duplicates. Since the bytes are identical, every copy is equally valid as the keeper — but their paths may resolve to different dates. The copy whose path yields the best date source wins: EXIF-dated beats filename-dated beats mtime-dated. (For example, two byte-identical JPEGs at IMG_20231014_120000.jpg and copy.jpg will pick the first, since its filename gives a confident date.) The other paths are recorded as duplicates.

Stage 2 — perceptual (skipped with --exact-only) Images are perceptually hashed (pHash). Pairs within --phash-threshold Hamming distance are near-duplicates — visually similar but not byte-identical (re-saves, slight crops, social-media re-uploads). A BK-tree makes this O(n log n). The keeper is picked the same way, with one extra tie-breaker: when two near-duplicates share the same date source, the larger file wins, so thumbnails and previews lose to the full-resolution original.

With --move, duplicate files are left behind in the source directory and listed in duplicates.log in the destination. Exact duplicates (SHA-256 match) are safe to delete. Near-duplicates should be reviewed first — they may be meaningfully different photos.

EXIF date writing

When a JPEG file's date was resolved from a filename pattern or file modification time (not from EXIF), the tool writes that date into the destination file's EXIF DateTimeOriginal, DateTimeDigitized, and DateTime tags. This is done losslessly — the EXIF metadata segment is replaced in the raw JPEG binary without re-encoding the image data.

This means sorted files always carry their date in EXIF, making them work correctly with any photo viewer or library that reads EXIF dates.

Pass --no-exif-write to disable this and copy files byte-for-byte.

Supported formats

Photos: .jpg .jpeg .png .heic .heif .tiff .tif .bmp .webp .raw .cr2 .cr3 .nef .arw .dng .orf .rw2

Video: .mp4 .mov .avi .mkv .m4v .3gp .mts .m2ts

Use --extensions to restrict or extend this list.

Safety guarantees

  • Overlap check — aborts if --src and --dst overlap in either direction; re-scanning already-moved files is always a mistake
  • No silent overwrites — destination name conflicts get a numeric suffix (photo_1.jpg, photo_2.jpg, …); existing files are never overwritten
  • Verified cross-device moves — when source and destination are on different filesystems, the tool copies first, re-hashes the copy, and only deletes the source if the hash matches
  • Interrupted copy cleanup — if a copy is interrupted (disk full, Ctrl+C), any partial destination file is removed so future runs start clean
  • Atomic EXIF writes — date injection writes to a temp file in the same directory, then does an atomic os.replace; the destination file is never partially overwritten (important in --move mode where the source is already deleted)

Performance

The SQLite cache stores resolved dates, SHA-256 hashes, perceptual hashes, and near-duplicate query results. On repeat runs with an unchanged collection, all expensive steps — EXIF reading, hashing, and duplicate detection — are served from cache. The remaining cost is file discovery (rglob) and stat calls, which is seconds to a few minutes depending on the filesystem and collection size.

First run  (50 000 photos): one hour (dominated hashing + dedup) plus copying/moving
Repeat run (nothing changed): a couple of minutes (file discovery + stat) plus copying/moving

To move the cache (e.g., to a fast local SSD when the source is a NAS):

python organize_photos.py --src /mnt/nas/photos --dst /sorted \
    --cache /tmp/photos_cache.db --copy

License

MIT — see LICENSE.

About

Sorts a messy photo/video collection into tidy YYYY(-MM)/ directories, add missing EXIF dates (derive from filename or mtime), detect duplicates (exact and perceptual) and take only the best copy.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages