Using index hashes to speed up the search for frames in the database
Search speed problem
Before moving on to the main topic, it makes sense to look at the problem from the outside.
How many frames does an average video film contain?
How many movies must be in the database for users to start using this service?
Let’s try to answer these questions.
150,000 frames – contains an average movie.
1,000,000 videos – this is how much a modern database should contain to be in demand.
Our database should have about 150 billion frames. That’s a lot! How to carry out a search in a split second on such a huge table?
Search algorithm “VideoColor”
Search technology “VideoColor”Is that each frame in the video is treated as a separate image that can be searched for. The indexed and then the desired image is divided into tabular areas and in each of its cells there are averaged values of the components of red, green and blue colors. According to them, in the future, you can make a comparison to find the desired frame.
Using hashes to speed up searches
There is a solution to the above problem and it is to use hashes. A hash is a number that is obtained by applying a hash function to the original data. The use of hashes will give us the opportunity to speed up the search for our frame in a huge array of information.
As initial data, you can use the averaged RGB values of certain areas of a given image. And the hash itself can be made as a 32-bit unsigned integer.
Hashes ambiguity problem
When calculating hashes, we need to bring the average value to an integer, but here we can get in trouble.
Since when transcoding a video for a different resolution or bitrate, the frames obtained when viewing are slightly different from the original, the following may happen. The hash from the original image may differ from the hash of the recoded image. This is due to the fact that the averaged values in the border regions can change up or down (even by a small amount) and we can get instead of a meaning a + 1…
In addition, capturing or exporting a frame from video to JPEG or another (lossy) format further changes the original frame.
As a result, we get hash ambiguity from different versions of frames of the same video. This is a serious problem!
Using an array of hashes
The following approach can be used to solve the hash ambiguity problem. If you analyze the input values, you can see that they occupy either central (near-center) positions or boundary positions. In this case, we can make the assumption that we got a fork of values and from one hash value we go to two values, etc. At each such borderline case, we double the number of hashes, and to each we assign a certain probability value.
Yes, we are complicating the search algorithm and multiplying the search time. Slippery path, but let’s try to follow it further.
If all values from the array of hashes had equal probabilities, then we would have to search through all the hashes during the search. If the hashes have different probabilities, then we can search for the more probable ones first. And if a frame is found with minimal deviations at the nodal points, we have the right to assume that the search is successful, and simply ignore the remaining hashes. Thus, we significantly speed up the search process.
The problem of uneven distribution of hashes over the address space
How does an index hash work?
If you assign a new field to each record in the table – a hash, then you can use it as an index for search. Unfortunately, this will require additional costs for indexing and we will do it differently. Let’s create an index file, each cell of which will contain a link to a record in a table with data. And the data itself will be ordered in such a way that within one group there are records with the same index hash, and these groups are ordered by the hash value.
In this case, when we need to find data with a given hash (index) value, we simply read the field with this index from the index file (and also the subsequent field to find out the number of records with this index value). And then, having the number of the zero record of the group in the table with records, we read the necessary records from the table of records.
Frequency distribution of index hash
Ideally, if the distribution of records is more or less even, we get that for 150 billion records in each index cell there will be approximately
150,000,000,000 / 4,000,000,000 = 38 entries.
However, in reality, things are not so rosy. If you build a distribution histogram, it will look something like this.
The peaks correspond to the most frequent values. This, for example, can be a value (R = 0, G = 0, B = 0) and it can reach 3-5% of the total number of frames in frequency. In this case, it is necessary to use other mechanisms to speed up the search within the group, for example, perceptual hashes.
Index hashes are great for speeding up the search for frames in the database. With their help, you can speed up the search process many times over. Although the problem of the ambiguity of hashes forces us to search not by one, but by a whole group of hashes, nevertheless, “the game is worth the candle.”