Solving the Distinct Elements Problem with the CVM Algorithm

Category Science

Tuesday - May 21 2024, 13:29 UTC - 1 year ago

tldr #

25 seconds

The distinct elements problem, which has been studied in computer science for over 40 years, asks for an efficient way to estimate the number of unique entries in a long list. The CVM algorithm, developed by researchers from multiple universities, uses randomization to more efficiently solve this problem with significantly less memory. It has the potential to become the default method for solving the distinct elements problem in practical situations.

content #

3 minutes, 29 seconds

Imagine that you’re sent to a pristine rainforest to carry out a wildlife census. Every time you see an animal, you snap a photo. Your digital camera will track the total number of shots, but you’re only interested in the number of unique animals — all the ones that you haven’t counted already. What’s the best way to get that number? "The obvious solution requires remembering every animal you’ve seen so far and comparing each new animal to the list," said Lance Fortnow, a computer scientist at the Illinois Institute of Technology. But there are cleverer ways to proceed, he added, because if you have thousands of entries, the obvious approach is far from easy.

The distinct elements problem has been studied in computer science for over 40 years

It gets worse. What if you’re Facebook, and you want to count the number of distinct users who log in each day, even if some of them log in from multiple devices and at multiple times? Now we’re comparing each new login to a list that could run to the billions.

In a recent paper, computer scientists have described a new way to approximate the number of distinct entries in a long list, a method that requires remembering only a small number of entries. The algorithm will work for any list where the items come in one at a time — think words in a speech, goods on a conveyor belt or cars on the interstate.

The CVM algorithm was developed by Sourav Chakraborty, Vinodchandran Variyam, and Kuldeep Meel

The CVM algorithm, named for its creators — Sourav Chakraborty of the Indian Statistical Institute, Vinodchandran Variyam of the University of Nebraska, Lincoln, and Kuldeep Meel of the University of Toronto — is a significant step toward solving what’s called the distinct elements problem, which computer scientists have grappled with for more than 40 years. It asks for a way to efficiently monitor a stream of elements — the total number of which may exceed available memory — and then estimate the number of unique elements.

Before the CVM algorithm, a lot of memory was needed to solve the distinct elements problem

"The new algorithm is astonishingly simple and easy to implement," said Andrew McGregor of the University of Massachusetts, Amherst. "I wouldn’t be surprised if this became the default way the [distinct elements] problem is approached in practice." .

To illustrate both the problem and how the CVM algorithm solves it, imagine that you’re listening to the audiobook of Hamlet. There are 30,557 words in the play. How many are distinct? To find out, you could listen to the play (making frequent use of the pause button), write down each word alphabetically in a notebook, and skip over words already on your list. When you reach the end, you’ll just count the number of words on the list. This approach works, but it requires an amount of memory roughly equal to the number of unique words.

The CVM algorithm uses randomization to more efficiently estimate the number of distinct elements in a long list

In typical data-streaming situations, there could be millions of items to keep track of. "You might not want to store everything," Variyam said. And that’s where the CVM algorithm can offer an easier way. The trick, he said, is to rely on randomization.

Let’s return to Hamlet, but this time your working memory — consisting of a whiteboard — has room for just 100 words. Once the play starts, you write down the first 100 words you hear, again skipping any repeats. When the space is full, press pause and flip a coin for each word. Heads, and the word stays on the whiteboard; tails, and the word gets erased, meaning that when Hamlet says "the," it won’t necessarily make the cut. Now press play to resume the audiobook.

This algorithm has potential to be the default method for solving the distinct elements problem in practical situations

You’ll reach the end relatively quickly, and you’ll count up the number of words on the remaining whiteboard. At that point, Fortnow said, "some magic math" converts that number to a guess of the distinct count, and with substantially better accuracy than the simpler implementation using a notebook.

hashtags #

distinctelements cvmalgorithm computerscience bigdata randomization

worddensity #

number (11, 1.76%)
words (8, 1.28%)
list (6, 0.96%)
way (5, 0.8%)
algorithm (5, 0.8%)