The Dangerous Developments of Language Models: A Case Study of OpenAI's GPT-4o

Friday - May 24 2024, 00:39 UTC - 1 year ago

tldr #

30 seconds

OpenAI's recent release of GPT-4o, a language model that can interact with humans, has faced backlash for its use of Chinese training data filled with pornography and spam. This issue highlights the growing concern of the lack of high-quality and unbiased training data in AI development. The presence of state media phrases in the data also suggests a narrow range of language usage. OpenAI's lack of transparency on the sources of their training data raises questions about their commitment to ethical AI development.

content #

2 minutes, 17 seconds

Last week, OpenAI released GPT-4o, their newest AI model that can interact with humans through voice, text, or video. It was supposed to be a groundbreaking moment for the company, but just days later, it seems like the company is in big trouble. The model has been facing criticism from all sides – from its own safety team resigning to accusations of unauthorized voice replication from actress Scarlett Johansson.

OpenAI's GPT-4o is the latest AI model that can interact with humans through voice, text, or video.

However, the biggest issue with GPT-4o may lie in its Chinese training data. As some experts have pointed out, anyone who is fluent in Chinese could easily spot the problem with the list of tokens used by GPT-4o. While it's not uncommon for adult content to be present in training data sets, the fact that it accounts for 90% of the Chinese language used to train the model is concerning.

Zhengyang Geng, a PhD student in computer science, shared his thoughts on the issue: "It's an embarrassing thing to see as a Chinese person. Is it just a reflection of the quality of the Chinese data? Or is the language itself so ridden with inappropriate content?" .

The model has come under fire for using pornographic and gambling content in its Chinese training data.

Further investigation revealed that GPT-4o's Chinese training data heavily relies on state media and spam content. This is a major problem when it comes to training large language models, as it reflects a narrow and often biased range of language usage.

The lack of high-quality training data is a growing concern in the AI industry. In the case of GPT-4o, its failure to filter out porn and spam may be just the tip of the iceberg. The bigger issue lies in the lack of effort to identify and curate appropriate data sets, especially for languages and regions with limited accessible online content. This not only affects the effectiveness of AI models, but it can also have negative impacts on society as a whole.

Many experts believe that the lack of quality training data is a much bigger issue than the inappropriate content in GPT-4o's data.

OpenAI has remained tight-lipped about the sources of its Chinese training data, and it is unlikely that they will disclose this information in the future. However, this raises questions about the company's commitment to responsible and ethical AI development.

In conclusion, GPT-4o's release has sparked a much-needed discussion about the dangers of relying on biased and inappropriate training data. As AI continues to advance, it is more important than ever to prioritize the collection and curation of high-quality, unbiased training data.

OpenAI has not been transparent about the sources of its Chinese training data.

hashtags #

ai openai gpt4o trainingdata ethics chinese

worddensity #

training (8, 2.06%)
data (8, 2.06%)
chinese (7, 1.8%)
gpt-4o (4, 1.03%)
language (4, 1.03%)