Artificial Intelligence 14 min read

Graph Convolutional Networks for Intelligent Document Processing: Principles, Feature Engineering, and Applications

This article presents a comprehensive overview of using graph convolutional networks in intelligent document processing, covering basic GCN theory, adjacency matrix construction, feature engineering—including text, image, and handcrafted features—model architecture, self-supervised training, and real-world applications such as semantic entity recognition and relation extraction.

Laiye Technology Team
Laiye Technology Team
Laiye Technology Team
Graph Convolutional Networks for Intelligent Document Processing: Principles, Feature Engineering, and Applications

1. Introduction

We previously introduced the core scenarios of intelligent document processing and representative deep learning models. Laiye Technology's IDP product achieved the highest level in the China Academy of Information and Communications Technology's Trusted AI evaluation.

To address Semantic Entity Recognition (SER) and Relation Extraction (RE) in documents, we evaluated various solutions and selected a graph convolutional model based on interpretability, feature injection, inference speed, and pre‑trained models.

2. Basic Principles of Graph Convolution

The basic idea is that a node’s feature is aggregated from its neighbors; the simplest case sums neighbor features. For a graph with n nodes and feature dimension d we define the adjacency matrix A, feature matrix X and degree matrix D (illustrated in the following figures).

Mathematically the node features are iteratively updated as X^{(k+1)} = D^{-1/2} (A+I) D^{-1/2} X^{(k)} W^{(k)} where the added identity preserves self‑features and the symmetric normalization prevents feature explosion.

After applying a non‑linear activation (e.g., tanh) the final propagation rule becomes X^{(k+1)} = σ( D^{-1/2} (A+I) D^{-1/2} X^{(k)} W^{(k)} ).

3. Adjacency Matrix Construction

Three strategies are used:

1) Rule‑based matrix : nodes correspond to text regions; edges are created by spatial overlap from top‑left to bottom‑right, optionally weighted by inverse distance.

2) Learned positional matrix : features derived from bounding‑box coordinates (x, y, w, h) are fed through dense layers to produce an adjacency matrix.

3) GAT‑based matrix : attention scores computed from node features (including positional information) generate a lightweight, dynamic adjacency matrix that is normalized each layer.

4. Node Features

We combine multiple modalities:

• Text features: RoBERTa embeddings (truncated to 48 tokens) with digits replaced by a special token.

• Image features: UNet extracts pixel‑level semantics for each text region.

• Hand‑crafted features: ratios of digits, letters, punctuation, Chinese characters, and flags for special entities (person, amount, email, date, URL).

• Index features: embedding of the node order after sorting by top‑left coordinate.

• Positional features: normalized coordinates and size relative to the whole page.

All modalities are fused via an attention‑weighted combination rather than simple concatenation.

5. Application Scenarios

In Laiye’s IDP system the GCN is applied to:

• SER – extracting structured entities (e.g., name, gender) from IDs and licenses.

• RE – discovering key‑value pairs in custom forms by reconstructing a directed adjacency matrix and selecting edges with confidence > 0.5.

• Multi‑task settings – simultaneously handling SER and RE on receipts, identifying items, prices, quantities, and their relationships.

6. Self‑Supervised Learning

Due to limited labeled data, we adopt a pre‑train‑then‑fine‑tune paradigm using a graph contrastive method (NNCLR) that selects the nearest positive example, mitigating over‑fitting compared with SimCLR or MoCo.

Node‑level pooling (max, average, and attention) is evaluated, with attention pooling yielding the best global representation.

Self‑supervised pre‑training dramatically improves downstream F1 scores (e.g., from 18 % to 90 % on a 26‑image PO‑form SER task).

7. Additional Optimizations

To alleviate over‑smoothing and computational cost we employ:

1) Highway connections that add residual features across layers.

2) Sparse adjacency matrices retaining only the top 30 % of values.

3) Drop‑edge regularization that randomly removes edges during forward passes.

8. Reference Implementation

The following TensorFlow 2 layer implements a learnable adjacency matrix for graph learning:

class GraphAdjLearningLayer(tf.keras.layers.Layer):
    def __init__(self, name="graph_learning", **kwargs):
        super(GraphAdjLearningLayer, self).__init__(name=name, **kwargs)
        self.dense1 = tf.keras.layers.Dense(32, activation=tf.keras.layers.LeakyReLU(0.18))
        self.dense2 = tf.keras.layers.Dense(16, activation=tf.keras.layers.LeakyReLU(0.18))
        self.dense3 = tf.keras.layers.Dense(1, use_bias=False)
        self.act = tf.keras.layers.Activation("sigmoid")

    def call(self, inputs, training=False):
        x = inputs
        b, n, d = x.shape
        x = tf.nn.l2_normalize(x, axis=-1)
        x = self.dense1(x)
        x = self.dense2(x)
        x = self.dense3(x)
        x = self.act(x)
        node_num = tf.cast(tf.sqrt(tf.cast(n, tf.float32)), tf.int32)
        x = tf.reshape(x, [-1, node_num, node_num])
        return x

References are listed at the end of the original document.

feature engineeringself-supervised learninggraph convolutional networksIntelligent Document ProcessingRelation Extraction
Laiye Technology Team
Written by

Laiye Technology Team

Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.