Saltar para o conteúdo principal

Build African language datasets, the right way.

An open playbook and annotation platform for grassroots NLP data collection — designed with communities, for communities, across the continent.

Read the Playbook in: HausaAmharicSwahiliFrançaisPortuguês

Masakhane Playbook

A practical guide to dataset creation, written with the communities who use it — from task formulation and label schema design to consent forms, inter-annotator agreement, and sustainability. Every chapter is built around real low-resource language scenarios.

  • Step-by-step guidelines, video demos, and quality checklists
  • Voice, text, speech–text alignment, and translation chapters
  • Templates for consent, licensing, and governance toolkits
  • Translated into 5 African languages with community review

Masakhane Tool

An open, mobile-first, Progressive Web App for grassroots data collection — built for the realities of African contexts: patchy connectivity, multiple scripts, and community-led annotation workflows.

  • Offline-first capture with background synchronization
  • Speech, text, ranking, and multimodal annotation support
  • Inter-annotator agreement (Fleiss' κ, Krippendorff's α) dashboards
  • African-language localization and virtual keyboards
  • Apache 2.0 licensed with a clear contributor agreement

From the Community

The Playbook is exactly the practical, reproducible guide that African NLP has needed — a real reference, not a brochure.

Prof. Vukosi MarivateCo-founder, Masakhane · University of Pretoria

Pairing the Playbook with the Tool turns annotation theory into reproducible practice — that combination is what makes it useful in the field.

Dr. David AdelaniNLP Researcher, Masakhane · University College London

Documenting low-resource language work has long been ad-hoc — a shared playbook gives our teams a common vocabulary and saves a lot of guesswork.

Lilian WanzareNLP Researcher · Maseno University

Open guidance like this lowers the barrier for builders across the continent to ship language-first AI products responsibly.

Pelonomi MoiloaCo-founder & CEO · Lelapa AI

Open infrastructure for African languages is finally catching up with the rest of the field. This is a major milestone for the community.

GremaAI Researcher · Microsoft

A community-built standard for low-resource annotation. Long overdue and well executed — the kind of resource teams will reach for daily.

AishwaryaResearch, Language Technologies · Google

For multilingual annotation across Bantu languages, this is the resource I wish we had had years ago — clear, applicable, and honest about trade-offs.

Peter NabendeNLP Researcher · Makerere University

The combination of methodological rigor and African-context grounding makes this stand out from generic NLP guides — a long-overdue reference.

Prof. Muhammad Abdul-MageedNLP Lab Lead · University of British Columbia

Thanks to our Contributors

The Playbook is built by a growing community of researchers, students, and language experts. If you've contributed code, content, or review — thank you.

SUPPORTED BY

Masakhane African Languages HubBayero University, KanoBahir Dar UniversityHausaNLPEthioNLP