Mundo AI - High Quality Multilingual Training Data for AI Models

TL;DR

AI models are great at English, but struggle with almost every other language. So, we are building the world’s largest and highest quality multilingual data library to help AI labs build better non-English models.

The Story

When Jason was working on AI research abroad, he found that it was incredibly difficult to find training data in non-English languages. Because of this, his peers were all working on English models rather than ones in their native language.

The Problem

After speaking with researchers and entrepreneurs around the world, it became clear to us that AI usability was dramatically behind in non-English languages - even for major languages like Hindi and Arabic. This is because of the severe shortage of high quality training data in non-English languages. That leaves the 75% of the world that does not speak English out of the AI revolution.

Data has been a major bottleneck for researchers and AI labs building multilingual AI models, and the demand for better and larger datasets is only increasing.

Current workarounds such as synthetic data and machine translation simply don’t achieve the desired results, and open-source efforts fail to produce datasets in the quantity and quality required.

How are we solving this

We work directly with native speakers to build and create completely novel and high quality datasets. We do this by setting up end-to-end operations in the country where native speakers of a language reside, and by using our proprietary software platform to streamline data collection, generation, annotation, and quality assurance.

Demo Video

https://www.youtube.com/watch?v=zZiilPrhDJs

The Team

Jason Liao helped build a record-breaking fraud detection AI model at Tsinghua University. Before that, he led a quant research team at a $60B quant hedge fund.

Kenneth Wu was a quant at Canada’s largest quant fund. Previous roles in SWE at Amazon Web Services and Analyst at the Ontario Teachers’ Pension Plan.

Naijide Anwaer was the youngest Platform PM at Binance US. He speaks 4 languages.

Garreth Lee was an ML engineer and the first Indonesian at Hugging Face, where he helped build the world’s best open pre-training dataset. Previously a member of technical staff at Cohere.

Also shoutout to our founding PM @Ahnaf Muqset Haque

Our Ask

Do you know any researchers or data partnership managers at any AI labs? We’d love to get in touch! We’re trying to learn as much as we can about the data bottlenecks that are preventing researchers from making progress.

You can reach us at contact@mundoai.world

Mundo AI

High Quality Multilingual Training Data for AI Models

Jason Liao, CEO and Founder

Naijide Anwaer, Founder

Garreth Lee, Founder

Kenneth Wu, Founder