Inquiries expose insufficient attempts to sanitize datasets from fascist, pirated, and harmful content.
Several investigations have revealed the fascist, pirated, and malicious origins of data used to train some of the largest and most potent artificial intelligence models, sparking renewed concerns.
Google’s Colossal Clean Crawled Corpus (C4) dataset, created from over 15 million websites, is utilized to train the search engine’s LaMDA AI as well as Meta’s LLaMA competitor. Although the dataset is publicly available, its vast size makes it challenging to scrutinize its contents. It is considered a sanitized version of Common Crawl, a larger dataset, with offensive language, racist slurs, and other “noisy” content removed.
Some organizations even opt to skip Google’s data “cleaning” process, allowing them to access a more extensive range of data for their systems to learn from. On Wednesday, London-based Stability AI launched its new StableLM language model, trained on an 850GB dataset called the Pile. This dataset contains the unfiltered Common Crawl database, 2 million pirate ebooks from Bibliotik (a BitTorrent site), 100GB of data scraped from GitHub, as well as more obscure sources like every internal email ever sent by the defunct energy company Enron and the complete proceedings of the European Parliament.
The Pile dataset is publicly hosted by a group of anonymous “data enthusiasts” known as the Eye. Their copyright policy refers to a video depicting a choir of clothed women simulating the act of masturbation while singing. Stability AI uses a private version of this dataset, which they claim is three times larger. They have not disclosed any further information about the additional content, stating that it provides “surprisingly high performance” in conversational and coding tasks for their StableLM language model.