How to find similar pictures

Web 2.0 is great. Self service websites. Users fill them with their own hands (“post content”, as they say now). They themselves posted, they themselves laughed. And the site owner only pays for hosting and cuts coupons for advertising. It’s convenient.

But our life is so strangely structured that there are no pluses without minuses, and often shortcomings in general are a continuation of virtues. There are problems with self-filled sites – button accordions. I mean, duplicates.

Many visitors do not like duplicates, especially the old-timers, who remember by heart the memes that appeared during the time of the preved-bear and the Alban yazygg. Every time they appear, they greet them with snorts and threats to unsubscribe immediately.

What to do? Of course, to call on an iron machine for help – let it look for button accordions on its own.

With the text, the procedure is more or less mastered. A simple machine for full-text search can be found in almost any DBMS. If someone needs something more complicated, for example, taking into account the morphology of the language, Sphinx, Solr, and so on are at your service. Even if your requests go so far that you want to look not just for duplicates, but for reworkings of bearded jokes, in which Vasily Ivanovich and Petka are replaced by Darth Vader and Harry Potter in the spirit of the times, td / idf, cosines of the angle between vectors and other methods of plagiarism detection.

Dealt with the text. But what to do with pictures? Surely this is somehow possible? Google is looking for similar ones, albeit lousy. Yandex searches, and even better than Google. Yes, Yandex is there – some entertainment sites like Pikabu also have debayanization tools.

Yes, you can. And not very difficult, oddly enough, because smart people have already done the most difficult part of the work for us, inventing and testing beautiful and witty algorithms.

How to find duplicate images? Very simple.

Step 1. For each picture in the database, compose its digital fingerprint (fingerprinting) – some kind of bit sequence that reflects its content.

Step 2. Create a fingerprint for the desired image and compare it with those already in our database.

Wow la! And the devil, as usual, lives in the details.

Do it once: create an imprint

Common sense dictates that the print should not be too long – the shorter the better. After all, it must be stored in the database itself, and then also compared with the rest. If it is the size of the image itself, it will be difficult to do both.

So we need some kind of hash function that will collapse the contents of the picture into a short string of bits.

Traditional hash functions like the sha family are out of the question, of course. They are too sensitive – even if the binary content of two pictures differs by only one bit, the calculated hashes will have nothing in common at all (although, oddly enough, it is sha1 that is used in the Wikipedia engine. They could have invented better).

There is no more talk about cryptographic hashes. They react just as nervously to an extra bit, only at the same time it is even more stressful to count them.

We need something that would give similar hashes for similar pictures, but at the same time the similarity would be from the point of view of a person, not a machine. Not bitmaps were similar to each other, but the picture itself. Here is a chair – and there is a chair, here is the sun – and there it is.

And such hashes are indeed invented. It is possible, for example, using machine learning methods to recognize what we have drawn, for example, a dog, to recognize what breed, what color and what position it is, and then encode these features in a hash.

But we definitely don’t need such frills, we’ll get by with something simpler. After all, we do not need to classify the drawn, we just need to find duplicates.

Therefore, our methodology will be as follows: the image is discolored, that is, it is translated into 50 64 shades of gray and greatly reduced. The first operation makes similar pictures that differ only in colors, the second greatly reduces the amount of information for processing and removes minor details. Then the hash is calculated based on the resulting picture.

You can, for example, calculate the average color over the entire image and set the value of 0 or 1 for each pixel – it is darker or lighter than the average. It’s quick and easy, but the results are pretty mediocre.

It is much better to use the discrete cosine transform, resulting in the so-called perceptual hash (phash). You don’t even have to write the algorithm yourself, phash is already included in the most popular image processing library – ImageMagick.

Another option is a difference hash (difference hash, dhash). It gives results slightly worse than perceptual, but still quite acceptable, and it is considered simpler and faster. The idea is the same, but at the same time, the difference between neighboring pixels is encoded in bits – whether it is positive or negative.

You can, of course, invent something else in the same spirit, but whether it is necessary?

Do Two: Compare Hashes

It’s quite easy. In clever terms, we use the Hamming distance, and in simple terms, we consider how many bits in two hashes differ.

If, for example, you are using MySQL or its little sister Maria, then this can be done using the built-in function:

SELECT * FROM images WHERE BIT_COUNT(newMem ^ hash) <= 6

newMem here – dhash (or phash) of the new picture, the duplicates of which we want to find, hash – accordingly, the hash of the picture already in the database. The lid symbolizes a bitwise XOR, and the number 6 is empirical.

Here is an example of the original image for clarity.

Idyllic scene on Dzhemete beach
Idyllic scene on Dzhemete beach

And here are its various distortions and perversions, with the Hamming distances calculated for them (the distances are given for the original images sized 3000×2000, I stung them for posting on the site):

Decrease brightness - 1
Decrease brightness – 1
Removing all colors - 1
Removing all colors – 1
Making Colors Acid - 0
Making Colors Acid – 0
Cut off the right edge a little - 6
Cut off the right edge a little – 6
Lightly cut on all sides - 1
Lightly cut on all sides – 1
Adding an inscription - 2
Adding an inscription – 2
Twisting the sharpness - 1
Twisting the sharpness – 1
Making a mirror - 37
Making a mirror – 37

As you can see, a hash (in this case, dhash) does a great job with small changes in the original image – those that come from light editing, of course, and not from conscious attempts to trick the algorithm. He did not pull only the mirror flipping of the image, but after all, no one bothers us to take a picture that causes doubt, mirror it and drive it through the database again.

If your site gathers users posting their lovingly collected photos of old towed trams, Russian stoves, lifeboats on the beach, self-grown lemon trees, and all that exciting stuff—that is, serious, respectable people, not teenagers, those who wish to break the system out of sick vanity, then you don’t need more than dhash.

Do Two: Really Compare Hashes

Obviously, the SQL query shown above is purely for blazeera. If you need to compare a hundred hashes, it is good, but already at a thousand, perhaps, it will hesitate. Why is clear: it compares the test hash with all the others in turn and, due to the function call, cannot use the base indexes.

The situation is aggravated by the fact that the algorithm for calculating hashes is slightly non-deterministic and even for identical pictures it can give slightly different results – only by one or two bits, but this is enough that even the search for completely identical pictures cannot be carried out using a simple equality query. .

And everything is not clear with indexes either – we need one that shows the Hamming distance from … From what? That’s it. Obviously something clever is needed here.

Such indexes, or rather, trees, do exist in nature. For example, there is such exotic as vp‑tree (vantage point tree), which can be artistically translated as “tree with dots convenient observation of other points. It is not difficult to understand its principle – this is an ordinary binary tree, only the points in it are grouped by distance from each other. Not so happy with building a tree and searching through it – it’s not easy to get into the algorithm the first time, but this is not the biggest problem. It is much worse that such a tree is built once, and when new points are added or old ones are removed, it has to be rebuilt almost entirely. This means that for use on a site where new material is constantly added and old material is often removed, it is not very suitable.

How does Google deal with this? Oh, Google, or rather, Moses Charikar, who worked for him, came up with a very elegant thing called Simhash, which is much simpler than entangled spatial trees. Look here:

Let’s say for specifics that our hash consists of 64 bits. Let’s break them down into 8 bytes.

We want to find hashes that are no more than 6 bits away from the one being checked. But this means that even if all 6 different bits are in different bytes of the hash, anyway, whatever one may say, some two bytes will completely match in both hashes.

Here they are - two matching bytes
Here they are – two matching bytes

And this means that we can select in advance from the database those hashes that match our checked one in some two bytes, and calculate the distance only for them. This decently reduces the amount of work – every 200-300 times.

How to do it?

According to the original idea, you need to create additional tables for hashes, the number of which will be equal to the number of possible combinations of two bytes out of eight, in our case C^2_8= 28. Each table stores the same hashes, but with swapped bytes. In the main in the first and second byte – the first and second byte of the hash, in the first additional – the first and third, in the second – the first and fourth, and so on, until we go through all the combinations. Herein case you are too lazy to count them yourself.

{1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {1, 7}, {1, 8}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {2, 7}, {2, 8}, {3, 4}, {3, 5}, {3, 6}, {3, 7}, {3, 8}, {4, 5}, {4, 6}, {4, 7}, {4, 8}, {5, 6}, {5, 7}, {5, 8}, {6, 7}, {6, 8}, {7, 8}

And now we will sort these tables by the first two bytes and we can use binary search to select candidates for the subsequent comparison!

However, at this point we may deviate from the advice of the distinguished author. After all, we don’t have Google, but a simpler site, we store much fewer pictures and therefore we don’t need to invent our own turbocharged bicycles. After all, we already have a database – we keep links and attributes of our photos somewhere. Let’s use it and put the same hash split into bytes next to the hash in the form of a binary string.

CREATE TABLE images (
  name VARCHAR(100) NOT NULL,
  hash BIGINT UNSIGNED NOT NULL,
  b1 TINYINT UNSIGNED NOT NULL,
  ...
  b8 TINYINT UNSIGNED NOT NULL,
)

Naturally, we will add a separate index to each byte, and then we will start looking for our hash with such a query, sorting through all possible combinations of two out of eight bytes:

SELECT name FROM images WHERE (
  (b1=newMem_b1 AND b2=newMem_b2) OR
  (b1=newMem_b1 AND b3=newMem_b3) OR
  ...
  (b7=newMem_b7 AND b8=newMem_b8)
) 
AND BIT_COUNT(newMem ^ hash) <= 6

The query for a person, of course, is long, but the database looks at it differently, and it is not difficult for it to optimize it by using indexes on the fields b1…b8.

Combinations 2 to 8, of course, are not written on the tablets, this is just one of the options, although it is practically convenient. Depending on the expected number of stored images and the Hamming distance we want to search, we can break the hash into a different number of parts, not necessarily even the same length.

Some practical numbers. On my old ten-year-old car, the creation delta hash in nodejs takes about a second and a half, of which a second is used to reduce the image using ImageMagick.

Searching for one hash in a database of 30,000 photos (I didn’t find more cats and curvy girls) takes about two seconds, of which a second is spent again reducing the original image. This is without indexes – I was too lazy to create them for the prototype – so there is a backlog for scaling the search. Whether it will be enough for a million or two photographs is an open question. If you have so many different photos, it would be interesting to know your impressions.

And finally, one more question. What should we do if we want to find photographs that are more distant in the sense of Hamming – differing not by 6 bits, but by 20 or 30? I answer: this is a completely different task – if in English, then not near-duplicate search, but similarity detection, and a completely different story for which the described methods are not suitable and something else needs to be invented.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *