Passer au contenu principal

1. Introduction

A comprehensive guide to dataset design, annotation, and task formulation for building reliable and responsible language AI systems.

Welcome to the dataset design and annotation playbook!

This playbook will help you plan and develop training and evaluation datasets, define annotation schemas, and design AI tasks across different languages, domains, and modalities. It provides guidance on dataset structuring, labeling strategies, and ethical considerations for language technologies.

Who is this playbook for?

This playbook is designed for:

  • Researchers working on NLP dataset creation and evaluation
  • Annotation teams developing labeled datasets
  • Project managers overseeing data collection and annotation workflows
  • AI practitioners designing and evaluating language models
  • Students and academics studying dataset design and annotation
  • Multilingual communities contributing to language resources

What will you learn?

By the end of this playbook, you will understand:

  • How to define the purpose and scope of a dataset
  • Differences between training and evaluation datasets
  • Trade-offs between scale and quality
  • How to design label schemas and ontologies
  • Approaches for multi-label, single-label, and structured outputs
  • How to handle ambiguity, edge cases, and annotation boundaries
  • Best practices for multilingual and cross-lingual dataset design
  • Ethical considerations, risks, and limitations in dataset creation

How to use this playbook

Each section of this playbook contains:

  • Clear explanations of dataset design principles
  • Structured guidance for task and schema definition
  • Examples and edge cases to support annotation decisions
  • Practical recommendations for dataset creation workflows
  • Ethical considerations to guide responsible use

Getting Started

Ready to begin? Start with our foundational sections:

  1. Purpose of this Playbook – Understand target users, scope, and intended use
  2. How to Use This Playbook – Learn how to navigate chapters and contribute
  3. Dataset Types and Design Goals – Explore dataset categories and trade-offs
  4. Task and Schema Definition – Define tasks, labels, and annotation structures
  5. Glossary and Terminology – Learn key concepts and definitions

Purpose of this playbook

  • Target users and communities
  • Languages, domains, and modalities covered
  • Intended use and risks

Dataset Types and Design Goals

  • Training vs evaluation datasets
  • General-purpose vs domain-specific datasets
  • Scale vs quality trade-offs
  • Monolingual, multilingual, cross-lingual setups

Task and Schema Definition

  • Task formulation (classification, generation, alignment, retrieval)
  • Label schema and ontology design
  • Multi-label vs single-label vs structured outputs
  • Ambiguity, edge cases, and annotation boundaries

Glossary and Terminology

A reference section providing clear definitions of key terms used throughout the playbook.

Loading comments…