Sorts a messy photo/video collection into tidy YYYY/ directories (or
YYYY-MM/ with --by-month), with fast parallel hashing and
two-stage duplicate detection. A persistent SQLite cache makes repeat
runs much faster.
before/ after/
IMG_20231014_120000.jpg → 2023/IMG_20231014_120000.jpg
DSC_0042.JPG → 2021/DSC_0042.JPG
copy of DSC_0042.JPG → (duplicate — skipped, logged)
random_name.jpg → 2019/random_name.jpg (from mtime)
- Date resolution — tries EXIF
DateTimeOriginalfirst, then common filename patterns (IMG_20231014_…,2023-10-14_…,20231014_…, etc.), falls back to file modification time - Exact dedup — SHA-256; bit-for-bit identical files are collapsed to one
- Perceptual dedup — pHash via imagehash; catches re-saves, slight crops, and social-media re-uploads
- Parallel hashing — SHA-256 in threads (I/O-bound), pHash in processes (CPU-bound)
- EXIF date writing — JPEGs without EXIF dates get
DateTimeOriginalwritten losslessly from the resolved date (filename/mtime), so sorted files always carry their date in metadata - Safe moves — same-device moves use an atomic
os.rename; cross-device moves copy, verify SHA-256, then delete the source - Dry-run by default — nothing changes until you pass
--copyor--move
pip install Pillow # EXIF reading + pHash image decoding
pip install imagehash # perceptual hashing (skip with --exact-only)
pip install tqdm # optional — nicer progress barsPython 3.10+ required.
# 1. Preview what would happen (nothing is changed)
python organize_photos.py --src /path/to/messy --dst /path/to/sorted
# 2. Copy files (originals stay in source — recommended for a first run)
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --copy
# 3. Move files once you are happy with the result
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --move
# Use YYYY-MM/ directories instead of plain YYYY/
python organize_photos.py --src /path/to/messy --dst /path/to/sorted --copy --by-monthDefault (YYYY/):
/path/to/sorted/
2023/
IMG_20231014_120000.jpg
IMG_20231014_130000.jpg
2024/
DSC_0099.JPG
duplicates.log ← written after --copy / --move when duplicates exist
copied.log ← log of all operations (or moved.log with --move)
With --by-month (YYYY-MM/):
/path/to/sorted/
2023-10/
IMG_20231014_120000.jpg
IMG_20231014_130000.jpg
2024-06/
DSC_0099.JPG
duplicates.log
moved.log
| Option | Default | Description |
|---|---|---|
--src DIR |
(required) | Source directory (scanned recursively) |
--dst DIR |
(required) | Destination root directory |
--copy |
— | Copy files; originals are untouched |
--move |
— | Move files |
--by-month |
off | Organise into YYYY-MM/ instead of YYYY/ |
--only PERIOD ... |
all | Process only specific periods: YYYY, YYYY-MM, YYYY-YYYY range, YYYY-MM:YYYY-MM range. E.g. --only 2005-2015 2024-06 |
--exact-only |
off | SHA-256 dedup only; skip perceptual hashing |
--phash-threshold N |
8 |
Hamming distance for near-duplicate detection (0 = identical hashes only, 64 = maximum) |
--no-exif-write |
off | Don't write resolved date into EXIF of destination JPEGs (see below) |
-v, --verbose |
off | Show per-file output during copy/move (default: progress bar only) |
--sha-workers N |
4 |
Threads for SHA-256 hashing |
--phash-workers N |
cpu count | Processes for pHash computation |
--cache PATH |
~/.cache/organize_photos.db |
SQLite cache location |
--clear-cache |
off | Wipe the cache before running |
--extensions EXT … |
all photo/video | Restrict to specific extensions, e.g. .jpg .heic |
Stage 1 — exact (always on)
Files with identical SHA-256 digests are duplicates. Since the bytes are
identical, every copy is equally valid as the keeper — but their paths
may resolve to different dates. The copy whose path yields the best date
source wins: EXIF-dated beats filename-dated beats mtime-dated. (For
example, two byte-identical JPEGs at IMG_20231014_120000.jpg and
copy.jpg will pick the first, since its filename gives a confident
date.) The other paths are recorded as duplicates.
Stage 2 — perceptual (skipped with --exact-only)
Images are perceptually hashed (pHash). Pairs within --phash-threshold
Hamming distance are near-duplicates — visually similar but not
byte-identical (re-saves, slight crops, social-media re-uploads). A
BK-tree makes this O(n log n). The keeper is picked the same way, with
one extra tie-breaker: when two near-duplicates share the same date
source, the larger file wins, so thumbnails and previews lose to the
full-resolution original.
With --move, duplicate files are left behind in the source directory and
listed in duplicates.log in the destination. Exact duplicates (SHA-256
match) are safe to delete. Near-duplicates should be reviewed first — they
may be meaningfully different photos.
When a JPEG file's date was resolved from a filename pattern or file
modification time (not from EXIF), the tool writes that date into the
destination file's EXIF DateTimeOriginal, DateTimeDigitized, and
DateTime tags. This is done losslessly — the EXIF metadata segment
is replaced in the raw JPEG binary without re-encoding the image data.
This means sorted files always carry their date in EXIF, making them work correctly with any photo viewer or library that reads EXIF dates.
Pass --no-exif-write to disable this and copy files byte-for-byte.
Photos: .jpg .jpeg .png .heic .heif .tiff .tif .bmp
.webp .raw .cr2 .cr3 .nef .arw .dng .orf .rw2
Video: .mp4 .mov .avi .mkv .m4v .3gp .mts .m2ts
Use --extensions to restrict or extend this list.
- Overlap check — aborts if
--srcand--dstoverlap in either direction; re-scanning already-moved files is always a mistake - No silent overwrites — destination name conflicts get a numeric suffix
(
photo_1.jpg,photo_2.jpg, …); existing files are never overwritten - Verified cross-device moves — when source and destination are on different filesystems, the tool copies first, re-hashes the copy, and only deletes the source if the hash matches
- Interrupted copy cleanup — if a copy is interrupted (disk full, Ctrl+C), any partial destination file is removed so future runs start clean
- Atomic EXIF writes — date injection writes to a temp file in the
same directory, then does an atomic
os.replace; the destination file is never partially overwritten (important in--movemode where the source is already deleted)
The SQLite cache stores resolved dates, SHA-256 hashes, perceptual
hashes, and near-duplicate query results. On repeat runs with an
unchanged collection, all expensive steps — EXIF reading, hashing,
and duplicate detection — are served from cache. The remaining cost
is file discovery (rglob) and stat calls, which is seconds to
a few minutes depending on the filesystem and collection size.
First run (50 000 photos): one hour (dominated hashing + dedup) plus copying/moving
Repeat run (nothing changed): a couple of minutes (file discovery + stat) plus copying/moving
To move the cache (e.g., to a fast local SSD when the source is a NAS):
python organize_photos.py --src /mnt/nas/photos --dst /sorted \
--cache /tmp/photos_cache.db --copyMIT — see LICENSE.