Cloud Native 18 min read

How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud

Log data’s immutable, random, and multi‑source nature makes traditional search inefficient, so Alibaba Cloud’s SLS introduces the SPL pipeline language, combining Unix‑style piping with SQL‑like functions, and leverages computation push‑down, vectorized processing, and optimized I/O to deliver high‑performance log queries at scale.

Alibaba Cloud Native

Aug 12, 2024

How SPL’s High‑Performance Mode Supercharges Log Queries in the Cloud

Background and Motivation

Observability relies heavily on log data, which is immutable, randomly generated, and originates from many diverse sources. Traditional keyword‑based search and plain SQL queries struggle to meet the growing demand for flexible, real‑time analysis of massive, semi‑structured logs.

Log Data Characteristics

Immutable: Once created, logs are never modified, preserving the original event.

Random: Events such as errors or user actions appear unpredictably.

Multiple Sources: Different services produce logs with heterogeneous schemas.

Complex Business Logic: Varying interpretations make it hard to anticipate future analysis needs.

Because a unified schema is rarely feasible, logs are stored in a Schema‑on‑Read ("Sushi Principle") fashion, keeping raw data for downstream processing.

SPL Overview

SPL (SLS Processing Language) provides a pipeline syntax similar to Unix pipes, allowing a data source to be followed by a series of <spl‑expr> commands:

<data-source> | <spl‑expr> ... | <spl‑expr> ...

Each <spl‑expr> can perform regex extraction, field splitting, projection, numeric calculations, and more. The pipeline can be extended indefinitely, giving an interactive, exploratory experience.

What SPL Can Do

Field Projection: project keeps only selected fields; project‑away removes unwanted ones.

Real‑time Computation: extend creates new fields, e.g.

Status:200 | extend urlParam=split_part(Uri, '/', 3)

or with type casting:

Status:200 | extend timeRange=cast(BeginTime as bigint) - cast(EndTime as bigint)

Half‑structured Data Expansion: parse‑json and parse‑csv turn JSON/CSV strings into independent columns.

Limitations of Scan‑Mode SPL

Scan mode reads raw rows directly, leading to poor I/O efficiency and a hard limit of 100 000 rows per scan. It also lacks advanced pagination, histogram of final filter results, and incurs extra cost when scanning large volumes.

High‑Performance SPL Optimizations

The new high‑performance mode addresses these issues through:

Computation Push‑Down: WHERE predicates are evaluated on storage shards, reducing data transfer.

Vectorized Engine: A C++ SIMD engine on each shard filters rows in‑place, only passing matching logs to the next pipeline stage.

Index‑Based I/O: Field indexes (with statistics) enable fast data reads, similar to indexed SQL queries.

Early Termination: If the required result count is reached, remaining processing stops.

These layers form a “multi‑stage rocket” that accelerates filtering, indexing, vectorized computation, and final result generation.

Performance Benchmarks

Testing on a single shard with 100 million rows (10 shards total, 1 billion rows in the time range) shows:

High hit‑rate (1 %): query times range from 52 ms to 89 ms across three scenarios.

Very low hit‑rate (0.0001 %): times increase to several seconds, e.g., 2 826 ms for scenario 1.

Overall, high‑performance SPL can process billions of log entries within seconds, especially when hit‑rates are moderate to high.

Console and API Improvements

The console now displays a histogram of the filtered result distribution, and pagination uses a unified offset that refers to the filtered result set, simplifying API calls.

Example: From 10 million raw logs, Status:200 | where Category like '%xx%' yields 10 000 matching logs; the histogram shows the time distribution of those 10 000 entries.

Enabling High‑Performance SPL

No explicit mode switch is needed. If every column used in WHERE clauses has an index with statistics, SPL automatically runs in high‑performance mode; otherwise it falls back to scan mode.

Cost Considerations

High‑performance SPL itself incurs no extra fees, but if the query falls back to scan mode, the scanned raw data volume is billed.

Best Practices

Apply keyword index filters as the first pipeline stage.

Prefer SPL over SQL for fuzzy, phrase, regex, or JSON extraction scenarios to retain raw log view and histogram support.

Future Roadmap

Upcoming SPL enhancements will add sorting and aggregation (output in table mode) to further extend its analytical capabilities.

Conclusion

By combining a Unix‑style pipeline with SQL‑like functions and leveraging push‑down, vectorized computation, and index‑based I/O, SPL’s high‑performance mode delivers fast, scalable log analysis on Alibaba Cloud SLS, supporting complex filters, real‑time histograms, and seamless pagination.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native observability High Performance SPL vectorized computing log query

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.