Here's a broad and perhaps a bit naive question on this;
Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?
At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?
I once built an image comparison feature into some webpage that had uploads. What I did was scale down all images (for comparison only) to something like 100x100 and I think I made them black and white, but I am not sure about that last detail. I'd then XOR one thumbnail with another to compare their level of similarity. I didn't come up with this myself, I put together a few pieces of information from around the web... as with about 100% of things I build ;).
Not perfect, but it worked pretty well for images that were exactly the same. Of course it isn't as advanced as Imagededup.
People do deduplicate files to save on space, except it's usually based on exact byte match using md5 or sha256. Some don't due to privacy issues: https://news.ycombinator.com/item?id=2438181 (e.g., MPAA can upload all their torrented movies and see which ones uploaded instantly to prove that your system has their copyrighted files)
There's no way to make the UX work out for images that are only similar. Would be pretty wild to upload a picture of myself just to see a picture of my twin used instead.
But I do wonder if it's possible to deduplicate different resolutions of an image that only differ in upscaling/downscaling algorithm and compression level used (thereby solving the jpeg erosion problem: https://xkcd.com/1683/)
The cnn methods in the package are particularly robust against resolution differences. In fact, if it's just a simple up/downscale that differentiates 2 images, then even hashing algorithms could be expected to do a good job.
Reddit, Imgur, and any other site that uploads significant amounts of images from significant amount of users.. do they attempt to do this? To de-dupe images and instead create virtual links?
At face value it'd seem like a crazy amount of physical disk space savings, but maybe the processing overhead is too expensive?