Overview of Data Lakes and the Open SPL Compute Engine
This article explains the concept and challenges of data lakes, describes the “impossible triangle” of storage, compute, and cost, and introduces the open‑source SPL engine that provides multi‑source, file‑based, high‑performance computing to overcome those limitations.
Data lakes were introduced to store massive raw data for analysis, addressing the limitations of traditional data warehouses that require predefined business questions and extensive ETL processes before data can be used.
Because raw data must be ingested quickly and kept in its original form, data lakes face a storage challenge: they must handle structured, semi‑structured, and unstructured data at scale, which has become feasible thanks to advances in hardware and cloud storage services.
The core difficulty lies in the “impossible triangle” – preserving full raw information, providing convenient computation, and keeping construction costs low – which current technologies cannot satisfy simultaneously.
The open‑source SPL (Structured Processing Language) engine is presented as a solution, offering an open and powerful compute capability that can operate directly on raw data from multiple sources, including files, without requiring prior ETL into a warehouse.
Key capabilities of SPL include:
Multi‑data‑source mixed computation (RDB, NoSQL, JSON/XML, CSV, Webservice, etc.).
Strong file‑based computation, giving files near‑database performance.
Complete compute power comparable to SQL, with a more agile syntax.
Direct access to source data, allowing on‑the‑fly calculations even when data is not yet synchronized to the lake.
SPL also provides high‑performance storage options: =json(file("/data/EO.json").read()) to load JSON, =A1.conj(Orders) for joining, =A2.select(Amount>1000 && Amount<=3000 && like@c(Client,"*s*")) for conditional filtering, =A2.groups(year(OrderDate);sum(Amount)) for grouping and aggregation, and =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) for creating new datasets.
For high‑performance file storage, SPL offers two types: collection files with compression and segment‑based parallelism, and group tables with columnar storage, min‑max indexes, and segment‑based parallelism.
Parallel execution is simple: adding an @m option enables automatic multi‑CPU processing, and SPL includes parallel‑aware functions for reading, filtering, and sorting.
Advanced algorithms such as TopN are implemented as aggregations, e.g., =file("data.ctx").create().cursor() to create a cursor, =A1.groups(;top(10,amount)) for the top 10 orders by amount, and =A1.groups(area;top(10,amount)) for top‑10 per area, delivering performance far beyond traditional warehouses.
Overall, SPL allows data lakes to bypass the costly, serial workflow of import‑transform‑load‑model, enabling simultaneous data organization and usage, and making the previously impossible trade‑offs achievable.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.