MLCommons and Hugging Face Release Massive Speech Dataset for AI Research
In a groundbreaking collaboration, MLCommons, a nonprofit AI safety working group, has partnered with Hugging Face, a leading AI development platform, to release one of the world’s largest collections of public domain voice recordings. The dataset,aptly named Unsupervised People’s Speech,boasts over a million hours of audio spanning at least 89 different languages. This release aims to support research and development in “various areas of speech technology,” according to MLCommons.
The initiative is driven by a desire to democratize access to speech technology research. “Supporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally,” MLCommons stated in a recent blog post. The association anticipates that the dataset will fuel advancements in low-resource language speech models, improve speech recognition across diverse accents and dialects, and inspire novel applications in speech synthesis.
Though, the release of such a massive dataset is not without its challenges. One significant concern is biased data. The recordings in Unsupervised People’s Speech were sourced from Archive.org, a nonprofit best known for its Wayback Machine web archival tool. Given that many of Archive.org’s contributors are English-speaking Americans, the dataset is predominantly composed of American-accented English. This imbalance could lead to AI systems trained on the dataset struggling with non-native English speakers or languages other than English.
Another issue is the potential inclusion of recordings from individuals unaware their voices are being used for AI research. While MLCommons asserts that all recordings are either public domain or available under Creative commons licenses, the possibility of errors remains. A recent MIT analysis highlighted that hundreds of publicly available AI training datasets lack proper licensing information and contain inaccuracies.
Ed Newton-Rex, CEO of the AI ethics-focused nonprofit Fairly trained, has been vocal about the ethical implications of such datasets. “Many creators (e.g. Squarespace users) have no meaningful way of opting out,” Newton-Rex wrote in a post on X last June. He argued that placing the burden of opting out on creators is unfair, especially when generative AI uses their work to compete with them.
Despite these concerns, mlcommons remains committed to refining the dataset. The organization has pledged to update, maintain, and improve the quality of Unsupervised People’s Speech. However,developers are urged to exercise caution when utilizing the dataset to mitigate potential risks.
Key Highlights of Unsupervised People’s Speech
| Feature | Details |
|—————————-|—————————————————————————–|
| Size | Over 1 million hours of audio |
| Languages | At least 89 languages |
| Source | Archive.org |
| primary Use Case | Speech technology research and development |
| Licensing | Public domain or Creative Commons licenses |
| Potential Risks | Biased data, lack of opt-out mechanisms for creators |
This release marks a significant milestone in AI research, offering unprecedented opportunities for innovation while raising crucial ethical questions. As the AI community continues to explore the potential of Unsupervised People’s Speech, the balance between progress and responsibility remains a critical consideration.