Computer scientists have invented a new, efficient way to count unique elements.

Using randomness, a team of scientists created a simple algorithm to count large numbers of individual objects in a data stream.

Imagine being sent into a pristine rainforest to conduct a wildlife census. Every time you see an animal, you take a photo. Your digital camera will record the total number of pictures, but you are only interested in the number of unique animals – all those that you have not yet counted. What is the best way to get this number? “The obvious solution is to remember all the animals you've already seen and compare each new animal to that list,” says Lance Fortnow, a computer scientist at the Illinois Institute of Technology. But there are smarter ways, he added, because if you have thousands of records, the obvious approach is not so simple.

Things are getting worse. What if you are Facebook and you need to count the number of individual users who access your site every day, even if some of them are accessing from multiple devices and at different times? Now we compare each new input to a list that can number in the billions.

IN recent article Scientists have described a new way to approximate the number of individual entries in a long list, which requires remembering only a small number of entries. The algorithm works for any list in which elements arrive one at a time – like words in speech, goods on a conveyor belt, or cars on a highway.

The CVM algorithm, named for its creators—Sourav Chakraborty of the Indian Statistical Institute, Vinodchandran Variyam of the University of Nebraska-Lincoln, and Kuldeep Mila of the University of Toronto—is a significant step toward solving the so-called single element problem that computer scientists have been struggling with for more than a year. 40 years. It involves finding a way to efficiently track a stream of elements, the total number of which may exceed the amount of available memory, and then estimating the number of unique elements.

“The new algorithm is amazingly simple and easy to implement,” says Andrew MacGregor of the University of Massachusetts Amherst. “I wouldn't be surprised if this becomes the standard approach to the problem in practice. [отдельных элементов]”.

To illustrate the problem and how the CVM algorithm solves it, imagine that you are listening to an audiobook of Hamlet. The play has 30,557 words. How many of them are unique? To find out, you can listen to the piece (using the pause button often), write each word in alphabetical order in a notepad, and skip words that are already on your list. When you reach the end, you simply count the number of words in the list. This approach works, but it requires an amount of memory approximately equal to the number of unique words.

In typical data streaming situations, there may be millions of items that need to be tracked. “You may not want to store everything,” says Variyam. And this is where the CVM algorithm can offer an easier path. The trick, he says, is to rely on randomization.

Back to Hamlet, but this time your blackboard working memory only has room for 100 words. Once the piece begins, you will write down the first 100 words you hear, again skipping all repetitions. When the space is full, press pause and flip a coin for each word. Eagle – the word remains on the list; Tails – you delete it. After this preliminary round you will be left with about 50 different words.

Now you move on to what the team calls Round 1. Continue reading Hamlet, adding new words as you go. If you come across a word that is already on your list, flip the coin again. If it comes up heads, cross out the word; If it's tails, the word will remain on the list. Continue in this manner until there are 100 words left on the board. Then randomly remove about half again, based on the results of 100 coin tosses. This completes the first round.

Next, move on to the second round. Continue as in the first round, only now we will complicate the task. When you come to a repeated word, flip the coin again. If it lands heads, you remove it as before. But if it lands heads, flip the coin a second time. Leave the word only if the second head comes up. Once you fill the board, the round will end with another clear of about half the words based on 100 coin flips.

In the third round, you will need three heads in a row to keep your word. In the fourth round you will need four heads in a row. And so on.

Eventually, in the kth round you will reach the end of Hamlet. The point of the exercise is to make sure that each word, due to random selection, has the same probability of being there: 1/2k. If, for example, you ended up with 61 words on your list at the end of Hamlet and the process took six rounds, you can divide 61 by the probability of 1/26 to estimate the number of individual words—in this case, 3,904. (Easy to understand, how this procedure works: Let's say you have 100 coins and you flip each one individually, keeping only the ones that come up heads. You'll end up with about 50 coins, and if someone divides that number by the probability of ½, then. will be able to guess that initially there were about 100 coins.)

Variyam and his colleagues mathematically proved that the accuracy of this technique depends on the amount of memory. There are exactly 3,967 unique words in Hamlet. (In the 100-word memory experiments, the average score after five runs was 3,955 words. With a 1,000-word memory, the average improved to 3,964. “Of course,” says Variyam, “if the memory is large enough to hold all the words , then we can achieve 100 percent accuracy.”

“This is a great example of how even for basic and well-studied problems, sometimes there are very simple but not obvious solutions that have yet to be found,” said William Cushmaul of Harvard University.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *