Getting Started with Tree-sitter: High‑Performance Code Parsing and Multi‑Language SQL Extraction
Tree-sitter is a high‑performance incremental parsing library that supports over 50 languages; the article explains its core features, typical use cases such as editor syntax highlighting and static analysis, and walks through a concrete multi‑language SQL extraction implementation in Java, Python, and XML.
What is Tree-sitter?
Tree-sitter is a high‑performance incremental parsing library developed by GitHub, designed for parsing programming language grammars. Its main features are:
High performance : written in C, very fast parsing.
Incremental parsing : only re‑parses edited parts, boosting efficiency.
Error recovery : provides partial results even with syntax errors.
Multi‑language support : supports over 50 languages.
Rich bindings : provides bindings for Python, Node.js, Rust, etc.
Core application scenarios
Syntax highlighting and code folding in editors such as VSCode and Atom.
Static code analysis tools.
Code formatting utilities.
Language Server Protocol (LSP) implementations.
Case study: Multi‑language SQL extraction
Extract SQL statements from Java, Python, and XML files.
1. Building parsers for each language
from tree_sitter import Language, Parser
# Build the language library
Language.build_library(
'./languages.so',
[
'./tree-sitter-java',
'./tree-sitter-python',
'./tree-sitter-xml'
]
)
# Load each language
JAVA_LANGUAGE = Language('./languages.so', 'java')
PYTHON_LANGUAGE = Language('./languages.so', 'python')
XML_LANGUAGE = Language('./languages.so', 'xml')2. Java SQL extraction
SQL may appear in MyBatis annotations, JPA/Hibernate native queries, or string concatenations. The following function walks the Java AST, detects annotation nodes and method invocations, and extracts the SQL text.
def extract_java_sql(node, results):
# Handle annotation SQL
if node.type == 'annotation':
name_node = node.child_by_field_name('name')
if name_node and name_node.text.decode('utf-8') in MYBATIS_ANNOTATIONS:
# Extract SQL from annotation arguments
extract_annotation_arguments(node, results)
# Handle method invocation SQL
elif node.type == 'method_invocation':
method_name = node.child_by_field_name('name')
if method_name and method_name.text.decode('utf-8') in HIBERNATE_METHODS:
# Extract SQL from method arguments
extract_method_arguments(node, results)
# Recursively process children
for child in node.children:
extract_java_sql(child, results)3. Python SQL extraction strategy
SQL in Python usually appears as strings, possibly spanning multiple lines, concatenated, or added via conditional appends. The extractor checks string nodes, identifies SQL content, and also recognises patterns like conditions.append("AND name = ?").
def extract_python_sql(node, code_bytes, results):
# Handle string concatenation
if node.type == 'string':
text = code_bytes[node.start_byte:node.end_byte].decode()
if is_sql_content(text):
results.append(clean_sql(text))
# Handle conditions.append pattern
elif (node.type == 'call' and
node.child_by_field_name('function') and
node.child_by_field_name('function').text.endswith('conditions.append')):
extract_append_conditions(node, results)
# Recursively process children
for child in node.children:
extract_python_sql(child, code_bytes, results)4. XML MyBatis mapper processing
MyBatis XML mapper files contain SQL inside select, insert, update, and delete tags, often mixed with dynamic tags. The function walks the XML AST, extracts those tags, simplifies the SQL, and stores the result.
def extract_xml_sql(node, results, sql_definitions):
# Extract SQL operation tags (select, insert, update, delete)
if node.type == 'element' and node.child_by_field_name('name'):
tag_name = node.child_by_field_name('name').text.decode('utf-8')
if tag_name.lower() in SQL_TAGS:
content = extract_element_content(node)
simplified_sql = simplify_mybatis_sql(content, sql_definitions)
results.append(simplified_sql)
# Recursively process children
for child in node.children:
extract_xml_sql(child, results, sql_definitions)Conclusion
Tree-sitter is a powerful parsing tool suitable for scenarios that require deep code‑structure analysis. Using it, developers can build efficient, accurate multi‑language SQL extraction utilities, addressing common pain points in real projects. Whether building editors, static analysis tools, or custom code‑processing pipelines, Tree-sitter is worth considering.
References: Tree‑sitter official documentation, Python Tree‑sitter bindings, Tree‑sitter language support list.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
