Big Data 27 min read

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

This article explains the motivation, design, and implementation of Alibaba Cloud's SLS Schema‑on‑Read scanning mode, showing how it enables SQL analysis on raw log data without pre‑built indexes, improves flexibility for evolving schemas, and reduces storage and index costs in various log‑analysis scenarios.

Alibaba Cloud Developer

Mar 16, 2023

How SLS’s Schema‑on‑Read Scanning Boosts Log Analytics Flexibility and Cuts Costs

Background

With the rapid acceleration of digital transformation and cloud‑native observability, enterprise log data volumes are exploding, covering application logs, Prometheus metrics, syslog, network logs, mobile telemetry, database binlogs, business events, billing records, IoT reports, and more. These logs are valuable for anomaly detection, fault diagnosis, security risk control, behavior audit, operational reporting, and user profiling.

Big Data Evolution and the Return of SQL

The rise of big‑data technologies (MapReduce, GFS, BigTable) showed that traditional relational databases could not handle massive data. However, critics such as Michael Stonebraker argued that MapReduce ignored three decades of DB research: schemas, schema‑application separation, and high‑level query languages. Modern big‑data engines (Hive, Spark SQL, Presto) have therefore re‑adopted SQL as a universal interface, while databases evolve toward distributed and HTAP architectures.

Constraints of the Relational Model vs. Schema‑on‑Read

Relational databases require a predefined schema (tables, columns, types) and enforce it at write time, whereas document models are schema‑flexible. In log analytics, data is inherently weak‑schema: sources are heterogeneous, events are random, and business requirements evolve, making a fixed schema a bottleneck.

Two processing models exist:

Schema‑on‑Write : data is validated against a predefined schema before ingestion, indexes and columnar stores are built, yielding high query performance.

Schema‑on‑Read : raw data is stored without validation; a schema is applied dynamically at read time, offering flexibility at the cost of some performance.

SLS Schema‑on‑Read Design and Implementation

Problem 1 – Executing SQL without a schema : The engine must know column names and types. SLS solves this by automatically inferring the required schema from the SQL statement itself (e.g., parsing the SELECT list and assigning varchar as the default type). Problem 2 – Reading columns from raw row‑store data : SLS scans the original logs, hashes the target column names, and extracts matching fields while filling missing fields with NULL . Optimizations include hash‑based matching, early termination, LRU caching, and affinity scheduling. These two mechanisms together provide a standard Schema‑on‑Write ‑like experience on top of raw logs.

Scanning Mode Overview

Scanning mode (also called “Scan mode”) disables the need for pre‑built indexes or columnar stores. Users simply toggle a switch in the console or set set session mode=scan in the SQL statement. The same standard SQL syntax is used, but fields are treated as varchar and are extracted on‑the‑fly. Key UI indicators:

"Analysis mode: Scan" – confirms the current mode.

"Scanned data volume" – shows the amount of raw data processed and billed.

Typical Use Cases

Scenario 1 – Uncertain schema : Multiple services write heterogeneous logs to the same Logstore, or logs are JSON/CSV with dynamic fields. Scanning mode allows ad‑hoc analysis without creating new indexes.

Scenario 2 – Write‑heavy, read‑light cost reduction : Only a small fraction of fields are frequently queried. By indexing only hot fields and using scan mode for the long‑tail, storage and index traffic are dramatically reduced.

Scenario 3 – Index‑less historical data : Data older than 30 days cannot be re‑indexed; scan mode can analyze such data directly.

Scenario 4 – Fields exceeding index length limits : Scan mode bypasses column‑store length restrictions, enabling analysis of very long text fields.

Cost Example

A hypothetical workload with 1 TB of daily logs shows that a 20 % index + scan strategy can cut daily costs from ¥486 to ¥183, while still supporting occasional deep‑dive queries.

Conclusion and Outlook

SLS’s Schema‑on‑Read scanning mode provides flexible, cost‑effective log analytics for scenarios where schemas are volatile or query volume is low. It complements, rather than replaces, the traditional index‑based mode, which remains preferable for high‑performance, high‑frequency analytics. Future work will focus on improving scan performance and further integrating storage‑compute innovations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL cost optimization Columnar Storage Log Analytics schema-on-read

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.