Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

De duplication is the killer feature. Especially if it can handle edited files even partially.

Exif tag management is a nightmare. Dublin core on steroids. Date and time handling for approximate time knowledge kills many systems.

It has to make choices about photo import implications for file path, and for file atime and mtime and multiple exif times, and private tags.

Google honours a ridiculous small set of tags, and never reread. Google does sidecar files to avoid file change breaking hash values.

All decisions have consequences. Photoprism and exiftool forums abound with special cases. A million of them.



Deduplication is a hairy problem, and was my first priority to solve when trying to get my own mess of photos together when I started writing PhotoStructure.

I'm on the fifth major iteration of image hashing at this point, using a L*a*b mean hash, along with a kmeans-gathered set of dominant colors, along with dynamic thresholds that take into account differing mimetypes, fuzzy captured at times, and monochromatic images.

This explains a bunch of the issues and tradeoffs I made while assembling the heuristics in PhotoStructure : https://photostructure.com/faq/what-do-you-mean-by-deduplica...


Mylio has a pretty good dedup in my experience. It's extra careful also and lets you verify each one or just do it all at once: https://community.mylio.com/posts/video-introducing-deduplic...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: