Artificial Intelligence 8 min read

How Structure-Aware Sparse Attention Boosts Long-Code Transformers

The SASA model, a structure‑aware sparse‑attention Transformer developed by Alibaba Cloud PAI and Prof. Gao Ming’s team, improves long‑code sequence processing by sparsifying self‑attention using top‑k frequency and AST pattern matrices, achieving higher performance and lower memory/computation costs on CodeXGLUE benchmarks.

Alibaba Cloud Big Data AI Platform

Jul 11, 2022

How Structure-Aware Sparse Attention Boosts Long-Code Transformers

Alibaba Cloud PAI, in collaboration with Prof. Gao Ming’s team at East China Normal University, introduced the Structure‑Aware Sparse Attention (SASA) Transformer for long code sequences, aiming to improve both effectiveness and efficiency.

Model Framework

The overall SASA architecture consists of two stages: a preprocessing stage that generates two interaction matrices (a top‑k frequency matrix and an AST pattern matrix) and a Sparse Transformer training stage that replaces full self‑attention with structure‑aware sparse self‑attention.

The preprocessing stage produces:

Top‑k frequency matrix: learned token‑wise attention frequencies from a CodeSearchNet‑pretrained language model.

AST pattern matrix: interaction information derived from the abstract syntax tree (AST) of the code.

During training, the full self‑attention in a Transformer encoder is replaced by a structure‑aware sparse self‑attention that computes attention only for token pairs matching specific patterns, reducing both computational complexity and memory usage.

Sparse Attention Modules

Sliding window attention: computes self‑attention within a moving window, complexity O (n·w), where n is sequence length and w is window size.

Global attention: a set of global tokens attend to all tokens, complexity O (n·g), where g is the number of global tokens.

Top‑k sparse attention: each token attends only to its top‑k highest‑attention tokens, complexity O (n·k).

AST‑aware structure attention: leverages the AST to define attention scopes based on code structure.

To exploit modern parallel hardware, the sequence is divided into blocks. Each query block interacts with w sliding‑window blocks, g global blocks, and k top‑k/AST blocks, yielding overall complexity O (n·(w+g+k)·b), where b is block size.

Experimental Results

Evaluation was performed on four CodeXGLUE tasks (code clone detection, defect detection, code search, code summarization) using only samples with sequence length > 512. SASA consistently outperformed baselines, including RoBERTa‑base, CodeBERT, GraphCodeBERT (which use truncation), Longformer, and BigBird (which ignore code structure).

Ablation studies on BigCloneBench and Defect Detection showed the contributions of the top‑k sparse attention and AST‑aware sparse attention modules.

SASA also reduces GPU memory consumption, enabling larger batch sizes without out‑of‑memory errors.

The SASA module can be integrated into other Transformer‑based pretrained models for long‑sequence natural language tasks and will be contributed to the open‑source EasyNLP framework.

Paper Information

Paper: Tingting Liu, Chengyu Wang, Cen Chen, Ming Gao, and Aoying Zhou. "Understanding Long Programming Languages with Structure‑Aware Sparse Attention." SIGIR 2022. https://arxiv.org/abs/2205.13730

Code: https://github.com/alibaba/EasyNLP

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AST Transformer Code Understanding Long Sequences Sparse attention

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.