This looks really interesting. Can you give us an idea of the performance? E.g. ...

tanujjain · on Oct 7, 2019

Scale is unfortunately not the focus of the current implementation. We would address this aspect in the future releases however. Considering the speed and memory requirements, following are the current considerations: 1. Hashing methods: Generation of hashes is quick (a couple of seconds on about 10K images). The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes. (I would refrain from giving exact numbers since this was done on a local system, not a very good environment for benchmarking) 2. CNN: Here, the time consuming part is encoding generation, which, in the absence of GPU would take much more time (a couple of minutes on 10k images). The retrieval part is pretty quick, but requires memory. So, at this point, using this package on a scale of more than a couple of thousand images is not a good idea when done locally. We would however, address the scale aspect of the problem in future releases. Thanks for your question.

flud · on Oct 7, 2019

Doesn't sound too bad. But can you elaborate on the last part, about retrieval requiring memory and not scaling to more than a couple of thousands images?

How much memory would you need for ~2000 images, how slow does it get, etc.

Thx

tanujjain · on Oct 7, 2019

Retrieval using CNNs requires computing a cosine similarity matrix. So, for 'n' images, a matrix of size n x n would need to be stored in the memory. As you can see, the storage requirements blow up quadratically as 'n' increases. We have already made some optimizations to reduce the memory footprint but there would be clear upper limits to it (We haven't experimented). As for the numbers, the cifar10 dataset example present in the repo was carried out on google colab notebook. The dataset has 60k images and ended up consuming about 6G of RAM using CNN. However, since there hasn't yet been a principled benchmarking done on the package, I would consider these numbers as only marginally indicative of the capabilities. A better way to figure out if it works for you would be to try it out with your own dataset, starting out at a small scale and increasing the dataset size gradually.

dgacmu · on Oct 7, 2019

Use approximate nearest neighbor. The FAISS library is good.

tanujjain · on Oct 7, 2019

Thanks for the pointer, will check it out.

jonatron · on Oct 7, 2019

Couldn't you add images>threshold to a dict/map as you iterate, rather than building a complete matrix, then iterating through that?

tanujjain · on Oct 7, 2019

We tried that approach, but it was way too slow.

jayalpha · on Oct 7, 2019

"The tricky part is the retrieval of duplicates, which on the same 10K dataset takes a few minutes"

"Doesn't sound too bad."

I doesn't? As a word of caution: If a 10k dataset takes a few minutes I would be careful how long 10MM pictures take. I predict it does not take 10MM/10k times a few minutes.

gravypod · on Oct 7, 2019

What way is the hash represented? Have you looked at NN libraries like faiss [0] and NGT [1]? Those can quite easily handle a nearest neighbor search of 10 million vectors and, from my understanding, they turn their vectors into some kind of hash that is then searched.

[0] - https://github.com/facebookresearch/faiss

[1] - https://github.com/yahoojapan/NGT

tanujjain · on Oct 8, 2019

The hashes are 16 character hexadecimals represented as strings. Had a quick look at the faiss package and it looks promising. Would consider it for the next versions.

gravypod · on Oct 8, 2019

If you're interested in collaboration I'd be happy to help with a prod-focused version. My work has a need for a shardable daemon for dedup tasks. My personal email is in my description and I'm also available via josh@xix.ai.

We also have an image heavy production use case that would be able to yield some nice metrics from this tool.

jonatron · on Oct 7, 2019

I use Cython (CPU) to brute force 400,000 pHash's of images. It takes somewhere between 100 and 200ms to search.

tanujjain · on Oct 8, 2019

Only realized now that the 100-200ms time you refer to is for a single search and not for 400,000 searches. The package already achieves this brute-force speed. In fact, the package also implements bktree, which, depending upon the distance threshold passed, could drastically reduce the search time. Moreover, the search through bktree is also parallelized in the package(each image's hash gets searched through the tree independently after the tree is constructed). On one of the example dataset containing 10k images, with a distance threshold of 10 (for 64-bit hashes), the retrieval time per image obtained was < 50 ms.

jonatron · on Oct 11, 2019

The 100-200ms time I referred to was indeed a single search. The difference is, it's on a single core. Cython definitely makes the hamming distance function faster.

tanujjain · on Oct 7, 2019

That sounds good! Would be great if you can share the code, or even better, make a PR to the repo.

jonatron · on Oct 7, 2019

The implementation is trivial, for speed we use:

cdef extern int __builtin_popcountll(unsigned long long) nogil

dist = __builtin_popcountll(key ^ phash)

It would only take a couple of minutes to fill out the rest.

mkl · on Oct 7, 2019

How long did the generating take?

jonatron · on Oct 7, 2019

We store pHash'es in a database, but just quickly checking on my laptop, between 1-2ms to generate a single image pHash. IO could be significant for many images.