The Revolutionary Lossless Compression Technique That Changed Technology

Category Science

Friday - June 16 2023, 17:43 UTC - 2 years ago

tldr #

Robert Fano challenged his students at Massachusetts Institute of Technology to improve a data compression algorithm but nothing came of it until David Huffman stepped in with a revolutionary new idea: a system of codes with variable code lengths that worked without pauses. The technology is now known as Huffman coding and is still used today, across different types of data compression algorithms.

content #

With more than 9 billion gigabytes of information traveling the internet every day, researchers are constantly looking for new ways to compress data into smaller packages. Cutting-edge techniques focus on lossy approaches, which achieve compression by intentionally "losing" information from a transmission. Google, for instance, recently unveiled a lossy strategy where the sending computer drops details from an image and the receiving computer uses artificial intelligence to guess the missing parts. Even Netflix uses a lossy approach, downgrading video quality whenever the company detects that a user is watching on a low-resolution device.

The technology, known as Huffman coding, is named after its inventor, the MIT graduate student David Huffman.

Very little research, by contrast, is currently being pursued on lossless strategies, where transmissions are made smaller, but no substance is sacrificed. The reason? Lossless approaches are already remarkably efficient. They power everything from the PNG image standard to the ubiquitous software utility PKZip. And it’s all because of a graduate student who was simply looking for a way out of a tough final exam.

Today Huffman coding is one of the most-used data compression techniques in modern computer science and information theory.

Seventy years ago, a Massachusetts Institute of Technology professor named Robert Fano offered the students in his information theory class a choice: Take a traditional final exam, or improve a leading algorithm for data compression. Fano may or may not have informed his students that he was an author of that existing algorithm, or that he’d been hunting for an improvement for years. What we do know is that Fano offered his students the following challenge.

Copies of the code David Huffman wrote for this project still exist at MIT.

Consider a message made up of letters, numbers and punctuation. A straightforward way to encode such a message would be to assign each character a unique binary number. For instance, a computer might represent the letter A as 01000001 and an exclamation point as 00100001. This results in codes that are easy to parse — every eight digits, or bits, correspond to one unique character — but horribly inefficient, because the same number of binary digits is used for both common and uncommon entries. A better approach would be something like Morse code, where the frequent letter E is represented by just a single dot, whereas the less common Q requires the longer and more laborious dash-dash-dot-dash.

Huffman coding is most commonly used to compress text-based data.

Yet Morse code is inefficient, too. Sure, some codes are short and others are long. But because code lengths vary, messages in Morse code cannot be understood unless they include brief periods of silence between each character transmission. Indeed, without those costly pauses, recipients would have no way to distinguish the Morse message dash dot-dash-dot dot-dot dash dot ("trite") from dash dot-dash-dot dot-dot-dash dot ("true").

Other types of data compression, such as JPEG, use Huffman coding as part of their algorithms.

Fano had solved this part of the problem. He realized that he could use codes of varying lengths without needing costly spaces, as long as he never used the same pattern of digits as both a complete code and the start of another code. For instance, if the letter S was so common in a particular message that Fano assigned it the extremely short code 01, then no other letter in that message would be encoded with anything that started 01; codes like 010, 011 or 0101 would all be forbidden. As a result, any recipient interested in unpacking the message would know that the number 01 represented an S, and the longer 011 or 0101 could only stand for other characters.

Huffman coding has been an essential part of coding languages since the 1950s.

Along came David Huffman, a first-year graduate student with a knack for problem-solving. After studying Fano’s challenge, he saw how — with some technical know-how — he could still achieve nearly perfect efficiency without the need for pauses. Huffman proposed the next great solution in data compression: a system of codes, with some characters represented by just a single digit, and others explained in five, ten, however many digits were needed to differentiate them. Fano’s students were so impressed with Huffman’s idea that they quickly adopted it for their final project.

Huffman’s technique was revolutionary. By assigning a variable number of digits to each character, it erected a kind of wall in which there was no danger of confusing one code for another. What’s more, Huffman’s system elegantly accounted for the unequal frequency of characters in a message. Arguably the most common character in English is the letter E; if you give it the briefest of codes — say, a single zero — then your entire message stands to benefit from the tiny amount of space that is saved.

The technology, known as Huffman coding, is named after its inventor, the MIT graduate student David Huffman. Today Huffman coding is one of the most-used data compression techniques in modern computer science and information theory. Copies of the code David Huffman wrote for this project still exist at MIT. Huffman coding is most commonly used to compress text-based data. Other types of data compression, such as JPEG, use Huffman coding as part of their algorithms. Huffman coding has been an essential part of coding languages since the 1950s.

hashtags #

datacompression huffmancoding informationtheory code ml ai dataencoding

worddensity #

code (9, 1.13%)
huffman (9, 1.13%)
message (8, 1%)
data (6, 0.75%)
coding (6, 0.75%)