Large-Scale Sign Language Datasets: Resources, Benchmarks, and Annotation Standards

This ACL 2026 survey systematically reviews over 120 publicly available sign‑language datasets covering 35 languages, analyzes their modalities, annotation inconsistencies, and benchmark limitations, and proposes a 24‑field datasheet to promote reproducible and comparable AI research in sign language recognition, translation, and generation.

Data Party THU
Data Party THU
Data Party THU
Large-Scale Sign Language Datasets: Resources, Benchmarks, and Annotation Standards

1 Introduction

Sign language is a visual‑gesture language used by >70 million deaf and hard‑of‑hearing individuals. It combines a manual channel (handshape, location, movement, orientation) and a non‑manual channel (facial expression, mouth shape, gaze, body posture) that are not strictly synchronized, making single‑sequence modeling difficult.

Automatic sign‑language technology focuses on three tasks: sign‑language recognition (SLR) (video → gloss or class label), sign‑language translation (SLT) (video → spoken‑language text), and sign‑language generation (SLP) (text or gloss → video, pose, or 3D mesh).

2 Background

Because most sign languages lack a standard written form, research relies on glosses as an intermediate representation. Glosses approximate spoken words but cannot capture spatial grammar, non‑manual signals, or fine‑grained semantics; more detailed schemes such as HAMNOSYS exist but require higher annotation cost.

Task taxonomy:

SLR: isolated word/letter recognition and continuous gloss‑sequence recognition.

SLT: gloss‑based and gloss‑free translation from video to text.

SLP: synthesis of video, skeleton/key‑point pose, or 3D mesh from text or gloss.

Research progressed from fingerspelling and isolated‑word recognition to continuous recognition, sentence‑level translation, and video generation, yet remains concentrated on a few high‑resource languages and benchmarks.

3 Dataset Compendium

Datasets are grouped into three categories:

Fingerspelling datasets (static images or short clips of letters/numbers).

Isolated sign‑language datasets (single signs or short segments).

Continuous sign‑language datasets (longer sentences or natural discourse), which are essential for CSLR, SLT, and SLP.

Key continuous corpora compared include PHOENIX14T, CSL‑Daily, How2Sign, YouTube‑ASL, and OpenASL. Differences span scale, language, duration, vocabulary size, number of signers, domain, accessibility, annotation granularity, and file format.

Scale does not guarantee suitability: YouTube‑ASL and OpenASL are large and open‑domain but often lack pose/depth data or synchronized annotations; PHOENIX14T provides clean, reproducible annotations but is narrow‑domain with few training samples; CSL‑Daily is more everyday‑oriented yet suffers from access and recording constraints. These trade‑offs shape model capability boundaries.

4 Benchmarks & Leaderboards

Five widely used benchmarks are aggregated: PHOENIX14T, CSL‑Daily, How2Sign, YouTube‑ASL, and OpenASL. Representative results are reported for SLR, SLT, and SLP.

Continuous sign‑language recognition: best word error rate (WER) on PHOENIX14T is 17.9 %; CSL‑Daily’s lowest WER is ≈24.1 %. The gap reflects dataset characteristics—PHOENIX14T is narrow‑domain with consistent annotations, whereas CSL‑Daily is more diverse, testing generalisation.

Translation: gloss‑based methods achieve higher BLEU scores on PHOENIX14T because the intermediate gloss provides structured supervision; however, gloss annotation is costly and inconsistent across corpora. Gloss‑free methods reduce annotation dependence and are more extensible to low‑resource languages, but performance remains limited by data scale, video quality, and semantic alignment.

Generation: both Gloss‑to‑Pose and Text‑to‑Pose models are surveyed. Evaluation should combine BLEU with motion‑naturalness metrics such as MPJPE, Hand‑MJE, timing F1, video‑quality scores, and human understandability studies involving the DHH community.

5 Dataset Challenges

Access and sustainability: Many of the >100 datasets have broken links, require NDAs, or depend on external platforms, jeopardising long‑term reproducibility.

Language and geographic imbalance: High‑resource sign languages (ASL, DGS, CSL, BSL) dominate; many African, indigenous, and village sign languages lack representation. Signer attributes (age, gender, region, hand dominance) are often missing.

Modality and annotation inconsistency: Datasets provide varying combinations of RGB, depth, pose, flow, skeleton, gloss, or 3D mesh. Gloss fields differ in naming, granularity, and alignment, increasing preprocessing cost and hindering cross‑dataset training.

Metadata completeness: Hand dominance, a crucial factor for bias analysis, is reported in only 10 of 108 surveyed datasets (≈9 %).

6 Future Dataset Curation

Recommendations aim at realism, comparability, and reproducibility:

Collect videos covering greetings, medical, education, emergency, daily life, and news scenarios. Open‑platform sources (e.g., YouTube) increase topic diversity but require filtering of low‑quality or noisy clips because fine hand and facial details are vital.

Balance signer attributes (age, gender, region, dialect, hand dominance) and consider hand dominance when splitting evaluation sets to avoid right‑hand bias.

Adopt a modular annotation hierarchy: start with a unique ID and cleaned sentence‑level translation, then optionally add gloss, temporal boundaries, pose/skeleton, non‑manual signals, and Facial Action Units.

Use annotation tools suited for long‑term readability and interoperability: ELAN for hierarchical multimodal annotation, SignStream for fine‑grained linguistic transcription, and SLAN‑tool for AI‑assisted segmentation.

Introduce a 24‑field Sign‑Language Datasheet to record language, acquisition method, modality, signer information, annotation level, task suitability, licensing, accessibility, and evaluation setup. The datasheet is intended as an evolving framework rather than a final standard.

7 Conclusion

The survey re‑examines sign‑language AI from a dataset perspective, showing that insufficient coverage, inconsistent annotation, missing metadata, and narrow benchmarks limit real‑world generalisation even when leaderboard scores improve.

Four contributions are highlighted:

Compilation of 120 datasets spanning 35 sign languages.

Systematic analysis of modality imbalance, signer bias, annotation inconsistency, and benchmark fragmentation.

Aggregation of representative leaderboards for SLR, SLT, and SLP.

Proposal of a 24‑field datasheet and release of a public GitHub repository ( https://github.com/Ginqwerty/Open-Sign-Language) to foster standardized documentation and reproducible evaluation.

8 Limitations

Public corpora remain concentrated on ASL, DGS, CSL, and BSL; metadata extracted from original papers and repositories may be incomplete; quantitative leaderboards cover only five flagship datasets; UMAP visualisations rely on a single random seed; and the 24‑field datasheet has not yet undergone extensive validation by the DHH community.

9 Broader Impact & Ethical Considerations

Sign‑language recordings contain identifiable faces and body movements, requiring careful handling of signer privacy, licensing terms, and access control. Benchmarks dominated by white, Western, or high‑resource signers risk amplifying existing inequities for minority groups. The authors advocate community‑led data collection involving DHH participants in design, review, and feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

datasetsmultimodalAI researchsign languagebenchmarksannotation standards
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.