How Synthetic Data is Solving AI's Data Problem

Saturday - June 17 2023, 16:41 UTC - 2 years ago

tldr #

In 1987, in the Carnegie Mellon University, the self-taught AI model Navlab, started the revolution of synthetic data training. It showed the power of synthetic data to train AI systems. These days, synthetic data is proving useful in addressing issues concerning privacy, facial recognition and bias, by allowing generates huge databases of synthetic faces permissioned by 500 individuals for training AI models.

content #

On a sunny day in late 1987, a Chevy van drove down a curvy wooded path on the campus of Carnegie Mellon University in Pittsburgh. The hulking vehicle, named Navlab, wasn’t notable for its beauty or speed, but for its brain: It was an experimental version of an autonomous vehicle, guided by four powerful computers (for their time) in the cargo area. At first, the engineers behind Navlab tried to control the vehicle with a navigation algorithm, but like many previous researchers they found it difficult to account for the huge range of driving conditions with a single set of instructions .

Synthetic data can be used to supplement, or even replace, natural data for training neural networks.

So they tried again, this time using an approach to artificial intelligence called machine learning: The van would teach itself how to drive. A graduate student named Dean Pomerleau constructed an artificial neural network, made from small logic-processing units meant to work like brain cells, and set out to train it with photographs of roads under different conditions. But taking enough photographs to cover the huge range of potential driving situations was too difficult for the small team, so Pomerleau generated 1,200 synthetic road images on a computer and used those to train the system .

Microsoft’s synthetic data set spans a wide range of ethnicities, ages and styles.

The self-taught machine drove as well as anything else the researchers came up with.Navlab didn’t directly lead to any major breakthroughs in autonomous driving, but the project did show the power of synthetic data to train AI systems. As machine learning leapt forward in subsequent decades, it developed an insatiable appetite for training data. But data is hard to get: It can be expensive, private or in short supply .

Synthetic data can be used to address issues about the privacy of individuals in AI model training datasets.

As a result, researchers are increasingly turning to synthetic data to supplement or even replace natural data for training neural networks. "Machine learning has long been struggling with the data problem," said Sergey Nikolenko, the head of AI at Synthesis AI, a company that generates synthetic data to help customers make better AI models. "Synthetic data is one of the most promising ways to solve that problem .

Self taught AI model Navlab was a project at Carnegie Mellon University, looking into autonomous cars.

"Fortunately, as machine learning has grown more sophisticated, so have the tools for generating useful synthetic data.One area where synthetic data is proving useful is in addressing concerns about facial recognition. Many facial recognition systems are trained with huge libraries of images of real faces, which raises issues about the privacy of the people in the images. Bias is also a problem, since various populations are over- and underrepresented in those libraries .

The Navlab project showed the power of synthetic data to train AI systems.

Researchers at Microsoft’s Mixed Reality & AI Lab have tackled these concerns, releasing a collection of 100,000 synthetic faces for training AI systems. These faces are generated from a set of 500 people who gave permission for their faces to be scanned.Microsoft’s system takes elements of faces from the initial set to make new and unique combinations, then adds visual flair with details like makeup and hair .

Bias in AI models due to lack of data is a big concern, synthetic data can help overcome this.

The researchers say their data set spans a wide range of ethnicities, ages and styles. "There’s always a long tail of human diversity. We think and hope we’re capturing a lot of it," said Tadas Baltrušaitis, a Microsoft researcher who helped build the system.

hashtags #

syntheticdata aimodeltraining facialrecognition navlab carnegiemelonuniversity

worddensity #

data (11, 2.14%)
synthetic (8, 1.56%)
ai (6, 1.17%)
researchers (5, 0.97%)
set (5, 0.97%)