The digital platform became Te Hiku’s first major step to establishing data sovereignty—a strategy in which communities seek control over their own data in an effort to ensure control over their future. For Māori, the desire for such autonomy is rooted in history, says Tahu Kukutai, a cofounder of the Māori data sovereignty network. During the earliest colonial censuses, after a series of devastating wars in which they killed thousands of Māori and confiscated their land, the British collected data on tribal numbers to track the success of the government’s assimilation policies.
Data sovereignty is thus the latest example of Indigenous resistance—against colonizers, against the nation-state, and now against big tech companies. “The nomenclature might be new, the context might be new, but it builds on a very old history,” Kukutai says.
In 2016, Jones embarked on a new project: to interview native te reo speakers in their 90s before their language and knowledge was lost to future generations. He wanted to create a tool that would display a transcription alongside each interview. Te reo learners would then be able to hover on words and expressions to see their definitions.
But few people had enough mastery of the language to manually transcribe the audio. Inspired by voice assistants like Siri, Mahelona began looking into natural-language processing. “Teaching the computer to speak Māori became absolutely necessary,” Jones says.
But Te Hiku faced a chicken-and-egg problem. To build a te reo speech recognition model, it needed an abundance of transcribed audio. To transcribe the audio, it needed the advanced speakers whose small numbers it was trying to compensate for in the first place. There were, however, plenty of beginning and intermediate speakers who could read te reo words aloud better than they could recognize them in a recording.
So Jones and Mahelona, along with Te Hiku COO Suzanne Duncan, devised a clever solution: rather than transcribe existing audio, they would ask people to record themselves reading a series of sentences designed to capture the full range of sounds in the language. To an algorithm, the resulting data set would serve the same function. From those thousands of pairs of spoken and written sentences, it would learn to recognize te reo syllables in audio.
The team announced a competition. Jones, Mahelona, and Duncan contacted every Māori community group they could find, including traditional kapa haka dance troupes and waka ama canoe-racing teams, and revealed that whichever one submitted the most recordings would win a $5,000 grand prize.
The entire community mobilized. Competition got heated. One Māori community member, Te Mihinga Komene, an educator and advocate of using digital technologies to revitalize te reo, recorded 4,000 phrases alone.
Money wasn’t the only motivator. People bought into Te Hiku’s vision and trusted it to safeguard their data. “Te Hiku Media said, ‘What you give us, we’re here as kaitiaki [guardians]. We look after it, but you still own your audio,’” says Te Mihinga. “That’s important. Those values define who we are as Māori.”
Within 10 days, Te Hiku amassed 310 hours of speech-text pairs from some 200,000 recordings made by roughly 2,500 people, an unheard-of level of engagement among researchers in the AI community. “No one could’ve done it except for a Māori organization,” says Caleb Moses, a Māori data scientist who joined the project after learning about it on social media.