Big Data 19 min read

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Meituan Technology Team

Nov 21, 2019

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay built an internal “Kaggle Kernels” platform – a customized, platform‑wide Jupyter environment that integrates big‑data and distributed‑computing clusters to support business data analysis and algorithm development.

Background and problems : Existing internal systems (Magic Number platform for SQL queries, Collaborative platform for ETL, Hosted platform for Spark jobs, and Scheduling platform) work well for predefined tasks but lack smooth, unified tools for exploratory and analytical work. Typical workflows involve disjointed steps such as SQL → Excel download → visualization, leading to tool fragmentation, difficulty visualizing large data, and non‑reproducible analyses.

The team identified four key requirements for a new Jupyter‑based tool: seamless experience, consistent tooling, out‑of‑the‑box usability, and reproducible results that can be linked to scheduled jobs.

Desired Jupyter features include Spark integration, scheduling system hooks, internal documentation system (Xuecheng) linkage, pre‑configured environments, and user‑isolated containers.

Architecture : The solution is built on JupyterHub on Kubernetes. Core components are JupyterLab (frontend), Jupyter Server (backend), Commuter (notebook browser), K8s (container orchestration), Cantor (Meituan’s scheduling system, similar to Airflow), the internal Hosted platform for Spark‑Submit, Xuecheng documentation, MSS object storage, and NB‑Runner (a notebook executor based on nbconvert). The architecture enables three main flows: sharing/reproducing notebooks, interactive execution, and scheduled execution.

Jupyter extensions :

JupyterLab extensions (TypeScript) add UI features such as share buttons and schedule triggers.

Notebook Server extensions (Python) provide custom HTTP handlers for backend operations.

Custom Kernels and IPython Magics: %sql for executing MySQL/Hive queries, %spark for launching Spark sessions, and other magics for convenience.

Key code snippets :

import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp").getOrCreate()

# NB‑Runner example
import nbformat
from nbconvert.preprocessors import ExecutePreprocessor

with open(notebook_filename) as f:
    nb = nbformat.read(f, as_version=4)

ep = ExecutePreprocessor(timeout=600, kernel_name='python')
ep.preprocess(nb, {'metadata': {'path': 'notebooks/'}})

with open('executed_notebook.ipynb', 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)

def launch_gateway(conf=None):
    """launch jvm gateway"""
    if "PYSPARK_GATEWAY_PORT" in os.environ:
        gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
    else:
        SPARK_HOME = _find_spark_home()
        # launch the Py4j gateway using Spark's run command ...

%sql <var> [--preview] [--cache] [--quiet]
SELECT field1, field2
FROM table1
WHERE field3 == field4

%spark
[--conf <property-name>=<property-value>]
...

Use cases demonstrated include:

Data analysis and visualization using %sql magics, with results stored as Pandas or Spark DataFrames and visualized via standard Python libraries.

One‑click notebook sharing to the internal Xuecheng documentation system.

Model training on large data using PySpark, XGBoost‑on‑Spark, LightGBM‑on‑Spark via %spark magics.

Online ranking strategy debugging with custom ipywidgets.

Conclusion and outlook : The platform unifies data analysis, data production, and model training in a single environment, integrates internal services, and supports the full workflow from exploration to production. Future work aims to evolve the platform into a cloud‑native integrated development environment for data science.

Authors : Wen Long and Ying Yi, engineers in Meituan Homestay R&D.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kubernetes Platform data analysis Spark Jupyter notebook

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.