Artificial Intelligence 17 min read

End-to-End Speech Relation Extraction

This paper presents an end‑to‑end approach for extracting relational triples directly from speech signals, bypassing intermediate transcription, and demonstrates its effectiveness on synthesized speech versions of the CoNLL04 and TACRED datasets, highlighting challenges such as length constraints and cross‑modal alignment.

DataFunSummit
DataFunSummit
DataFunSummit
End-to-End Speech Relation Extraction

With the rapid growth of big‑data and multimodal information, much of the data on the Internet is semi‑structured or unstructured, making low‑cost information extraction increasingly important. Relation extraction, a core task of information extraction, aims to identify entity pairs and their semantic relations from natural language text, forming structured triples.

While most prior work focuses on textual data, speech also contains rich relational cues (e.g., interviews, news, conversations). Traditional pipelines first transcribe speech to text and then apply text‑based relation extraction, which introduces additional errors. This study proposes the first end‑to‑end method that directly extracts relations from speech, reducing error propagation and improving performance.

Related work on relation extraction includes pipeline methods that separate named‑entity recognition and relation classification, as well as joint extraction models such as Nguyen et al. (2019) that mitigate error accumulation. Speech recognition has evolved from HMM/GMM models to deep learning‑based encoders like wav2vec 2.0, which greatly improve accuracy.

To build a speech‑relation dataset, the authors synthesize speech from existing text relation corpora using a two‑step TTS pipeline (text‑to‑spectrogram → spectrogram‑to‑waveform). Five pretrained TTS models (Glow‑TTS, Speedy‑Speech‑WN, Tacotron2‑DCA, MultiBand‑MelGAN, WaveGrad) are evaluated, and the Tacotron2‑DCA + MultiBand‑MelGAN combination is selected for its naturalness.

The proposed end‑to‑end model, SpeechRE, consists of a wav2vec 2.0 encoder and a BART decoder linked by a length‑adapter to handle the mismatch between long speech embeddings and target sequence length. Two baselines are compared: (1) a pipeline approach using wav2vec 2.0 for ASR followed by the SpERT text‑based relation extractor, and (2) a direct text‑only SpERT model.

Experiments are conducted on synthesized speech versions of CoNLL04 and TACRED. Results show that SpeechRE surpasses the pipeline method on CoNLL04, while still lagging behind on TACRED due to data imbalance and the difficulty of recognizing named entities from speech. Detailed error analysis reveals issues such as length constraints, cross‑modal alignment, and model “memory” effects that generate triples not present in the audio.

The authors outline future directions: remote supervision using knowledge‑base links, increasing speech diversity with real or varied synthetic voices, multimodal encoders that jointly process text and speech, and extending the approach to event extraction, slot filling, and video‑based relation extraction.

References include seminal works on relation extraction, speech recognition, TTS, and end‑to‑end models, providing a comprehensive bibliography for further study.

Natural Language Processingmultimodalend-to-endRelation Extractionspeech processing
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.