How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs
This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.
Background
With the rapid acceleration of digital transformation and cloud‑native observability, enterprise log data volumes are exploding, covering application logs, Prometheus metrics, syslog, network logs, mobile telemetry, database binlogs, business events, billing records, IoT reports, and more. These logs are valuable for anomaly detection, fault diagnosis, security risk control, behavior audit, operational reporting, and user profiling.
Big Data Evolution and the Return of SQL
The rise of big‑data technologies (MapReduce, GFS, BigTable) showed that traditional relational databases could not handle massive data. However, critics such as Michael Stonebraker argued that MapReduce ignored three decades of DB research: schemas, schema‑application separation, and high‑level query languages. Modern big‑data engines (Hive, Spark SQL, Presto) have therefore re‑adopted SQL as a universal interface, while databases evolve toward distributed and HTAP architectures.
Constraints of the Relational Model vs. Schema‑on‑Read
Relational databases require a predefined schema (tables, columns, types) and enforce it at write time, whereas document models are schema‑flexible. In log analytics, data is inherently weak‑schema: sources are heterogeneous, events are random, and business requirements evolve, making a fixed schema a bottleneck.
Two processing models exist:
Schema‑on‑Write : data is validated against a predefined schema before ingestion, indexes and columnar stores are built, yielding high query performance.
Schema‑on‑Read : raw data is stored without validation; a schema is applied dynamically at read time, offering flexibility at the cost of some performance.
SLS Schema‑on‑Read Design and Implementation
Problem 1 – Executing SQL without a schema : The engine must know column names and types. SLS solves this by automatically inferring the required schema from the SQL statement itself (e.g., parsing the SELECT list and assigning varchar as the default type). Problem 2 – Reading columns from raw row‑store data : SLS scans the original logs, hashes the target column names, and extracts matching fields while filling missing fields with NULL . Optimizations include hash‑based matching, early termination, LRU caching, and affinity scheduling. These two mechanisms together provide a standard Schema‑on‑Write ‑like experience on top of raw logs.
Scanning Mode Overview
Scanning mode (also called “Scan mode”) disables the need for pre‑built indexes or columnar stores. Users simply toggle a switch in the console or set set session mode=scan in the SQL statement. The same standard SQL syntax is used, but fields are treated as varchar and are extracted on‑the‑fly. Key UI indicators:
"Analysis mode: Scan" – confirms the current mode.
"Scanned data volume" – shows the amount of raw data processed and billed.
Typical Use Cases
Scenario 1 – Uncertain schema : Multiple services write heterogeneous logs to the same Logstore, or logs are JSON/CSV with dynamic fields. Scanning mode allows ad‑hoc analysis without creating new indexes.
Scenario 2 – Write‑heavy, read‑light cost reduction : Only a small fraction of fields are frequently queried. By indexing only hot fields and using scan mode for the long‑tail, storage and index traffic are dramatically reduced.
Scenario 3 – Index‑less historical data : Data older than 30 days cannot be re‑indexed; scan mode can analyze such data directly.
Scenario 4 – Fields exceeding index length limits : Scan mode bypasses column‑store length restrictions, enabling analysis of very long text fields.
Cost Example
A hypothetical workload with 1 TB of daily logs shows that a 20 % index + scan strategy can cut daily costs from ¥486 to ¥183, while still supporting occasional deep‑dive queries.
Conclusion and Outlook
SLS’s Schema‑on‑Read scanning mode provides flexible, cost‑effective log analytics for scenarios where schemas are volatile or query volume is low. It complements, rather than replaces, the traditional index‑based mode, which remains preferable for high‑performance, high‑frequency analytics. Future work will focus on improving scan performance and further integrating storage‑compute innovations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
