SignThought: A New Gloss‑Free Sign Language Translation Framework for the Deaf Community
The paper introduces SignThought, a gloss‑free sign language translation model that inserts an ordered latent‑thought chain between video encoding and text generation, uses a plan‑then‑ground decoding strategy, and is evaluated on five benchmarks and a newly built 1,311‑hour LC‑HKSLT dataset, achieving state‑of‑the‑art BLEU‑4 and ROUGE scores.
Research Background
Sign language translation (SLT) is crucial for reducing communication barriers faced by deaf and hard‑of‑hearing communities, yet most existing methods assume a direct alignment between video fragments and lexical glosses, which fails in real‑world scenarios where meaning depends on motion trajectories, spatial relations, and context.
Core Method: SignThought
SignThought consists of three modules:
Sign Encoder : encodes raw sign videos into dense temporal evidence features.
Latent Chain‑of‑Thought Thinking Module : compresses the evidence into an ordered sequence of learnable thought slots , each representing a progressively refined semantic concept.
Dual‑Stream Decoder : first plans the semantic output using the thought chain, then grounds each planned token by retrieving the corresponding video evidence, implementing a plan‑then‑ground decoding scheme.
This design creates an explicit intermediate reasoning interface, separating semantic decision from evidence retrieval and allowing the model to align generated text with specific video segments.
Dataset Construction
The authors also release LC‑HKSLT, a large‑scale Hong Kong sign language dataset collected from broadcast‑style videos. It contains 1,311 hours of video, 432 K clips, 14 signers, and a vocabulary of 125 833 sentence‑level captions, without any gloss annotations. A curated 30‑hour subset is provided for fair comparison with existing Chinese SLT benchmarks.
Experimental Results
SignThought was evaluated on five SLT benchmarks (PHOENIX14T, CSL‑Daily, How2Sign, OpenASL, and LC‑HKSLT). It achieved the highest gloss‑free BLEU‑4 scores on all datasets and the best ROUGE scores on PHOENIX14T, How2Sign, OpenASL, and LC‑HKSLT. Representative results include:
PHOENIX14T: 27.22 BLEU‑4 / 54.50 ROUGE
CSL‑Daily: 23.92 BLEU‑4 / 50.99 ROUGE
How2Sign: BLEU‑4 improved from 9.37 to 13.39
OpenASL: BLEU‑4 improved from 13.21 to 19.55
LC‑HKSLT (30‑hour subset): 30.22 BLEU‑4 / 60.01 ROUGE after pre‑training on the full set.
Ablation studies show that removing the latent thinking module causes the largest performance drop, while disabling causal thought updates, structured routing, the dual‑stream decoder, or thought‑guided prior injection each leads to measurable degradation, confirming that the combined mechanisms are responsible for the gains.
Conclusion and Outlook
The work reframes sign language translation as a cross‑modal reasoning problem rather than a simple video‑to‑text mapping. By introducing latent thoughts and a plan‑then‑ground pipeline, SignThought demonstrates that explicit intermediate semantic planning improves fidelity and grounding. Future directions include making the latent planning more interpretable and integrating explicit semantic structures or controllable reasoning to further enhance accuracy and explainability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
