Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo
The article introduces Weiflow, a dual‑layer DAG‑based machine‑learning workflow framework designed for Sina Weibo, and explains how its modular XML configuration, Scala implementation, and integration with Spark, TensorFlow, Hive, Storm, and Flink improve development efficiency, scalability, and execution performance across the entire ML pipeline.
The paper describes Weiflow, a machine‑learning workflow framework created to address the low efficiency of end‑to‑end ML development at Sina Weibo, where data preparation, feature engineering, and model evaluation consume about 80% of the total pipeline time.
Weiflow abstracts the complex ML pipeline into a configurable XML‑based directed acyclic graph (DAG) composed of reusable modules such as Node, Input, Process, and Output. The outer DAG selects the most suitable execution engine (Spark, TensorFlow, Hive, Storm, Flink) for each node, while the inner DAG implements engine‑specific optimizations.
Each node operates independently: Input modules handle data ingestion (Parquet, ORC, JSON, CSV, Libsvm, etc.), Process modules implement custom business logic (statistics, cleaning, sampling, feature transformation), and Output modules write results (model files, evaluation metrics, AUC) back to storage, forming a closed loop.
Developers define the entire workflow in XML, specifying module dependencies; Weiflow parses the XML, builds the dual‑layer DAG, and at runtime uses Scala’s lazy evaluation, Call‑by‑Name, and reflection to instantiate the required classes and execute the pipeline.
The framework leverages Scala’s functional features—currying, partial functions, case classes, pattern matching—to reduce overhead in high‑frequency operations such as the pickcat function, which maps string lists to indices. Data structures were refined from immutable arrays to HashMaps and from dense to sparse matrices, dramatically improving lookup and memory efficiency.
Performance benchmarks (Table 1) show that after optimization, Weiflow achieves more than a six‑fold increase in execution speed while also improving development ease‑of‑use and extensibility.
Overall, Weiflow enables business engineers to focus on model logic and feature selection rather than low‑level data processing, thereby accelerating model iteration and meeting real‑time online prediction requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
