Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's a broad and perhaps a bit naive question on this;

Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?

At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?



They would not do deduplication like this because this is based on similar images, but they probably (and should) do it by image hash.


I once built an image comparison feature into some webpage that had uploads. What I did was scale down all images (for comparison only) to something like 100x100 and I think I made them black and white, but I am not sure about that last detail. I'd then XOR one thumbnail with another to compare their level of similarity. I didn't come up with this myself, I put together a few pieces of information from around the web... as with about 100% of things I build ;).

Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.


Tumblr did something similar, but only for exact matches. You can tell if it’s a legacy image or not by looking for a hash in the image url path.

Legacy style: https://66.media.tumblr.com/tumblr_m61cvzNYF81qg0jdoo1_640.g...

New style: https://66.media.tumblr.com/76451d8fee12cd3c5971e20bb8e236e3...


People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)

There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.

But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)


The cnn methods in the package are particularly robust against resolution differences. In fact, if it's just a simple up/downscale that differentiates 2 images, then even hashing algorithms could be expected to do a good job.


The CPU cost would far outweigh the storage cost.


Only if you go by similarity. If you go by exact dedupes using hashes (which they almost certainly do), the CPU cost is trivial.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: