Big Data 13 min read

Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

This article explains how large-scale recommendation systems rely on efficient feature engineering, describes the three-layer architecture (offline, stream, online), and details how Spark SQL and the LLVM‑optimized FESQL engine improve performance and ensure offline‑online feature consistency.

DataFunTalk
DataFunTalk
DataFunTalk
Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

Feature engineering plays a crucial role in recommendation systems, and the efficiency of large‑scale feature processing directly impacts online performance. The talk introduces the three‑layer architecture of modern recommendation systems—offline batch processing, near‑real‑time streaming, and online serving—detailing the responsibilities of each layer.

The offline layer handles massive data preprocessing, feature extraction, and model training using Hadoop HDFS for storage and distributed engines such as Spark or MapReduce, with models built in TensorFlow, PyTorch, or MXNet. The streaming layer improves timeliness by ingesting near‑real‑time data via message queues and processing it with Flink, storing results in NoSQL or MySQL for online use. The online layer serves predictions by retrieving streaming features from storage and applying offline‑trained models.

For large‑scale recommendation feature extraction, two main tasks are highlighted: ETL (extract‑transform‑load) for data cleaning and format conversion, and feature extraction (e.g., discretization, embedding). Common tools include SQL/Python for moderate data sizes and Hadoop/Spark/Flink for massive datasets.

The article then focuses on Spark SQL and the fourth‑paradigm’s self‑developed FESQL engine. Spark is described as a fast, general‑purpose engine for big data, offering SparkSQL, PySpark, and MLlib, with built‑in Catalyst and Tungsten optimizations. However, Spark’s batch‑oriented driver‑executor model and JVM‑based execution limit its suitability for low‑latency online services and consistent offline‑online feature handling.

FESQL addresses these limitations by providing a unified SQL execution engine that guarantees feature consistency across offline and online environments. It replaces the JVM layer with an LLVM‑based JIT compiler that generates native binary code, achieving higher performance than Spark 3.0. The engine also includes node‑level optimizations (e.g., SimpleProject merging, code‑generated window operators) and extensive expression optimizations such as loop unrolling, constant folding, and vectorization.

Performance comparisons show that FESQL outperforms traditional Spark and even Databricks’ Photon in both read/write throughput and complex query execution, delivering up to six‑fold speedups in multi‑window scenarios. Future work includes an LLVM‑enabled Spark distribution, Docker/Notebook/Jar/Whl packaging, a Python‑like DSL for UDFs, and GPU acceleration via PTX.

big datafeature engineeringRecommendation systemsSQL OptimizationLLVMSparkFESQL
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.