Fundamentals 6 min read

Getting Started with Tree-sitter: High‑Performance Code Parsing and Multi‑Language SQL Extraction

Tree-sitter is a high‑performance incremental parsing library that supports over 50 languages; the article explains its core features, typical use cases such as editor syntax highlighting and static analysis, and walks through a concrete multi‑language SQL extraction implementation in Java, Python, and XML.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Getting Started with Tree-sitter: High‑Performance Code Parsing and Multi‑Language SQL Extraction

What is Tree-sitter?

Tree-sitter is a high‑performance incremental parsing library developed by GitHub, designed for parsing programming language grammars. Its main features are:

High performance : written in C, very fast parsing.

Incremental parsing : only re‑parses edited parts, boosting efficiency.

Error recovery : provides partial results even with syntax errors.

Multi‑language support : supports over 50 languages.

Rich bindings : provides bindings for Python, Node.js, Rust, etc.

Core application scenarios

Syntax highlighting and code folding in editors such as VSCode and Atom.

Static code analysis tools.

Code formatting utilities.

Language Server Protocol (LSP) implementations.

Case study: Multi‑language SQL extraction

Extract SQL statements from Java, Python, and XML files.

1. Building parsers for each language

from tree_sitter import Language, Parser

# Build the language library
Language.build_library(
    './languages.so',
    [
        './tree-sitter-java',
        './tree-sitter-python',
        './tree-sitter-xml'
    ]
)

# Load each language
JAVA_LANGUAGE = Language('./languages.so', 'java')
PYTHON_LANGUAGE = Language('./languages.so', 'python')
XML_LANGUAGE = Language('./languages.so', 'xml')

2. Java SQL extraction

SQL may appear in MyBatis annotations, JPA/Hibernate native queries, or string concatenations. The following function walks the Java AST, detects annotation nodes and method invocations, and extracts the SQL text.

def extract_java_sql(node, results):
    # Handle annotation SQL
    if node.type == 'annotation':
        name_node = node.child_by_field_name('name')
        if name_node and name_node.text.decode('utf-8') in MYBATIS_ANNOTATIONS:
            # Extract SQL from annotation arguments
            extract_annotation_arguments(node, results)

    # Handle method invocation SQL
    elif node.type == 'method_invocation':
        method_name = node.child_by_field_name('name')
        if method_name and method_name.text.decode('utf-8') in HIBERNATE_METHODS:
            # Extract SQL from method arguments
            extract_method_arguments(node, results)

    # Recursively process children
    for child in node.children:
        extract_java_sql(child, results)

3. Python SQL extraction strategy

SQL in Python usually appears as strings, possibly spanning multiple lines, concatenated, or added via conditional appends. The extractor checks string nodes, identifies SQL content, and also recognises patterns like conditions.append("AND name = ?").

def extract_python_sql(node, code_bytes, results):
    # Handle string concatenation
    if node.type == 'string':
        text = code_bytes[node.start_byte:node.end_byte].decode()
        if is_sql_content(text):
            results.append(clean_sql(text))

    # Handle conditions.append pattern
    elif (node.type == 'call' and
          node.child_by_field_name('function') and
          node.child_by_field_name('function').text.endswith('conditions.append')):
        extract_append_conditions(node, results)

    # Recursively process children
    for child in node.children:
        extract_python_sql(child, code_bytes, results)

4. XML MyBatis mapper processing

MyBatis XML mapper files contain SQL inside select, insert, update, and delete tags, often mixed with dynamic tags. The function walks the XML AST, extracts those tags, simplifies the SQL, and stores the result.

def extract_xml_sql(node, results, sql_definitions):
    # Extract SQL operation tags (select, insert, update, delete)
    if node.type == 'element' and node.child_by_field_name('name'):
        tag_name = node.child_by_field_name('name').text.decode('utf-8')
        if tag_name.lower() in SQL_TAGS:
            content = extract_element_content(node)
            simplified_sql = simplify_mybatis_sql(content, sql_definitions)
            results.append(simplified_sql)

    # Recursively process children
    for child in node.children:
        extract_xml_sql(child, results, sql_definitions)

Conclusion

Tree-sitter is a powerful parsing tool suitable for scenarios that require deep code‑structure analysis. Using it, developers can build efficient, accurate multi‑language SQL extraction utilities, addressing common pain points in real projects. Whether building editors, static analysis tools, or custom code‑processing pipelines, Tree-sitter is worth considering.

References: Tree‑sitter official documentation, Python Tree‑sitter bindings, Tree‑sitter language support list.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaPythonXMLstatic analysisTree-sittersyntax highlightingincremental parsingSQL extraction
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.