How AI Powers Automatic Security Tagging in Large‑Scale Data Governance

This article details how a Chinese e‑commerce platform leverages large‑language‑model AI, the open‑source Dify platform, and engineered workflows to automate security tagging of massive data assets, covering data‑governance fundamentals, AI‑driven tagging advantages, technical architecture, prompt engineering, optimization cases, and future roadmap.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
How AI Powers Automatic Security Tagging in Large‑Scale Data Governance

Background

Data governance covers the full lifecycle of business, information, and data flows. It consists of two layers: (1) business‑data governance that creates a true data image, and (2) analysis‑system governance that designs rational analytical structures. Security tagging is a sub‑step of the analysis layer, required to meet regulatory, permission‑isolation, and risk‑control requirements by assigning a security level to every table and field.

图1 数据治理核心理念和体系

AI‑Driven Security Tagging

The AI solution tags tables and fields, captures incremental DDL changes, and replaces manual labeling with a fast, accurate, and low‑cost process.

Timeliness : Micro‑batch processing provides near‑real‑time response to metadata changes.

Fine‑grained : Tags can be applied at the field level.

Accuracy : Optimized prompts and rule‑based metadata reduce human subjectivity.

Cost : One‑time development yields scalable automation.

Incremental handling : Automatic detection of DDL changes triggers re‑tagging.

图3 元数据打标的五大优势

AI Technology Evolution

From RNN/LSTM (pre‑2017) to the Transformer architecture (self‑attention, 2017) and the pre‑training + fine‑tuning paradigm (2018‑2022), AI has entered the MASS (Model‑as‑a‑Service) era where the focus is on delivering business value rather than merely scaling model size.

图4 AI技术发展历程

ZZ‑Dify Platform Role

Dify is an open‑source MASS platform that provides visual development, one‑stop deployment, and both workflow and agent execution modes. It serves as the foundation for rapid AI‑tagging application development.

图5 Dify平台的作用

Limitations of Large Models

Vulnerability under high load : Massive schema changes can cause inference collapse.

Weak interference resistance : Irrelevant or malformed metadata may pollute predictions.

Hallucination risk : When rule metadata is missing, the model may fabricate plausible but incorrect tags.

System Architecture

The AI tagging platform consists of the following components (see Figure 6):

Rest API – external entry point for tagging requests.

LockManager & MySQL Distributed Lock – guarantees atomicity of concurrent tagging jobs.

AI Module – assembles metadata, wraps calls to Onservice/星河, and drives the tagging loop.

ZZSchedule (XXL‑JOB) – high‑performance distributed scheduler for micro‑batch tagging.

Notify Module – pushes tagging results to enterprise WeChat and other channels.

Log Storage – end‑to‑end traceability for debugging and secondary model training.

Rate Limiter – token‑bucket / leaky‑bucket limiter to cap concurrent AI requests.

Schedule Module – records task metadata and manages state machines.

Local Caffeine Cache – caches metadata for high‑performance consistency.

OneService / 星河 – unified data query gateway for Hive, SQL, etc.

MySQL MetaData – central repository for all tagging metadata.

图6 AI打标平台架构

Schema Evolution Process

Schema evolution monitors DDL events, parses change types, and atomically re‑invokes the AI module for affected fields. The process includes:

Listening – capture upstream DDL events.

Parsing – extract change type (add/drop/modify) and impact range.

Sync – re‑tag affected objects atomically and rewrite metadata.

Notification – alert via enterprise WeChat on discrepancies.

Unified Metadata Structure

The metadata model (see Figure 7) comprises four parts:

Database metadata – tables, types, fields.

Actual field data – sample values or statistics.

Rule metadata – description and security level.

Auxiliary extensions – compression type, format, etc.

图7 元数据结构

Task Flow

Figure 8 illustrates the sequence of interactions among:

Base Data Platform – metadata management and rule storage.

OneService/星河 – data query gateway.

Dify – LLM service and workflow engine.

DeepSeek/Doubao – commercial LLM back‑ends.

图8 任务时序图

Enterprise Workflow on Dify

The initial workflow split metadata into three parallel branches (instance data, schema‑only, full metadata + instance) and merged the results before feeding the LLM. The final workflow consolidates everything into a single main branch and adds a Java post‑processing step to improve accuracy and reduce token usage.

图9 初版工作流
图10 终版工作流

Metadata Assembly Script (Python)

The script performs four steps:

Receive JSON metadata from the upstream system.

Submit a SQL task to OneService/星河 and wait for completion.

Compress the result according to the configured algorithm.

Generate a Markdown structure that is friendly to LLMs.

图11 python解析过程

Prompt Engineering

Markdown is used as the LLM input format. The optimized prompt template contains four sections:

Role definition – tells the model it is an expert annotator.

Tagging metadata – wrapped with custom tags such as <table_info>…</table_info> and <rule_info>…</rule_info>.

Rule description – provides the security‑level rules.

Output format – requires a JSON object inside a json{} block.

Example instruction (Chinese):

请在打标过程中,先在<思考>标签内详细分析每个字段打标的依据和推理过程,然后按照以下JSON格式输出打标结果,仅返回打标结果,无任何额外文本/解释:"json{}"
图12 prompt优化模板

Optimization Cases

Prompt Optimization (Accuracy Boost)

Clear role definition : Define the AI as a domain‑expert annotator with deterministic conditions.

Structured prompt design : Use explicit tags and hierarchical Markdown headings to reduce ambiguity.

Output format restriction : Enforce JSON output to simplify downstream parsing.

A/B testing : Compare accuracy, completeness, latency, stability, and token usage to select the best prompt.

Special‑case constraints : Define fallback behavior when rule metadata is missing.

Post‑Tagging Re‑calculation

Two‑stage formulas refine the initial security‑level calculation.

图14 一阶计算公式
图15 二阶计算公式

Stage 1 uses L4/L3 ratios and table type (raw/ads) to derive a base security level.

Stage 2 incorporates extreme‑value probability, custom sensitive‑word lists, and sub‑business‑domain ratios to adjust the level.

Batch Tagging (Accuracy Boost)

Original approach sent all fields to the model at once, causing token overflow, reduced accuracy, and long runtimes.

Optimized approach adds a field‑count threshold parameter to Dify, processes data in batches, and aggregates results, yielding higher accuracy and more stable performance.

Future Plans

Extend AI tagging to cover all business lines and unify metadata across Hive, Doris, ClickHouse, Redis, HBase, and MySQL.

Improve response speed with lighter models, upgraded inference engines, and higher concurrency.

Continuously track LLM advances (e.g., GLM‑4.6V, OpenAI, Gemini) and evolve from rule‑centric to semantic‑plus‑rule tagging.

Support multi‑source schema evolution and dynamic concurrency control.

Conclusion

The case demonstrates that large‑model AI can be applied to large‑scale data governance to achieve fast, accurate, and adaptable security tagging. Key takeaways are:

Prompt engineering and task decomposition matter more than raw model size.

Robust engineering—stable architecture, distributed scheduling, monitoring, and logging—is essential for production reliability.

Continuous iteration on prompts, workflows, and rule sets is required to maintain performance as models and business requirements evolve.

AIprompt engineeringLarge Language Modelsworkflow automationData GovernanceSecurity Tagging
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.