Artificial Intelligence 11 min read

Thought-Based Gloss-Free Sign Language Translation Model for the Deaf (ACL 2026)

The paper introduces SignThought, a gloss‑free sign language translation framework that uses a latent chain‑of‑thought reasoning layer and a plan‑then‑ground decoder, evaluates it on five benchmarks with state‑of‑the‑art BLEU‑4 and ROUGE scores, and releases a large new Hong Kong sign language dataset.

Machine Heart

May 4, 2026

Thought-Based Gloss-Free Sign Language Translation Model for the Deaf (ACL 2026)

Research Background

Deaf and hard‑of‑hearing communities face high barriers in accessing information because mainstream communication relies on speech and text; sign language translation (SLT) is therefore crucial for improving social inclusion. However, SLT is more than a simple video‑to‑text mapping: meaning emerges from motion trajectories, spatial positions, body orientation, and contextual relations, making direct segment‑to‑word alignment insufficient.

Problem Statement

Existing gloss‑free approaches implicitly couple two tasks—deciding what semantic content to express and locating supporting evidence in long videos. This coupling leads to unstable semantic planning, scattered attention, and translations that may be fluent but not faithfully grounded.

Core Method (SignThought)

SignThought consists of three modules:

Sign Encoder : encodes the input sign video into dense temporal evidence features.

Latent Chain‑of‑Thought Thinking Module : compresses the sequential evidence into an ordered set of learnable thought slots, forming a latent thought chain that serves as an explicit intermediate semantic interface.

Dual‑Stream Decoder : first uses the thought chain to plan the target sentence (semantic planning), then grounds each planned token by retrieving the corresponding video evidence (evidence retrieval), finally generating the translation.

The design follows a plan‑then‑ground decoding strategy that separates semantic decision from evidence search, reducing interference between the two processes.

Dataset Construction

The authors also release LC‑HKSLT, a large‑scale Hong Kong sign language dataset collected from broadcast‑style videos. It contains 1,311 hours of video, 432 K clips, 14 signers, and a vocabulary of 125 833 tokens. A curated 30‑hour subset is provided for fair comparison with existing Chinese SLT benchmarks.

Experimental Results

SignThought was evaluated on five benchmarks: PHOENIX14T, CSL‑Daily, How2Sign, OpenASL, and the newly introduced LC‑HKSLT. It achieved the highest gloss‑free BLEU‑4 scores across all datasets and the best ROUGE scores on PHOENIX14T, How2Sign, OpenASL, and LC‑HKSLT. Representative results include:

PHOENIX14T: 27.22 BLEU‑4 / 54.50 ROUGE

CSL‑Daily: 23.92 BLEU‑4 / 50.99 ROUGE

How2Sign: 13.39 BLEU‑4 (up from 9.37)

OpenASL: 19.55 BLEU‑4 (up from 13.21)

LC‑HKSLT (30‑hour subset): 21.15 BLEU‑4 / 47.87 ROUGE, improving to 30.22 BLEU‑4 / 60.01 ROUGE after pre‑training on the full set and fine‑tuning.

Ablation Study

Removing the latent thinking module caused the largest performance drop. Omitting causal thought updates, structured routing, the dual‑stream decoder, or thought‑guided prior injection each led to measurable degradation, confirming that the improvement stems from the combined effect of the intermediate reasoning chain, routing mechanism, and grounding process.

Conclusion and Outlook

SignThought reframes sign language translation as a cross‑modal reasoning problem rather than a direct video‑to‑text mapping. By introducing explicit latent thoughts and a plan‑then‑ground decoder, the model demonstrates strong, stable performance on large‑scale real‑world data. Future work may make the latent reasoning chain more explicit and controllable, potentially enabling explanations of why a particular translation was produced and advancing multimodal understanding and generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Multimodal Reasoning Sign Language Translation ACL 2026 Gloss-Free Latent Thoughts

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.