Big Data 21 min read

How Dataverse’s Notebook Supercharges Data+AI Development at Xiaohongshu

The article details Xiaohongshu’s Dataverse platform evolution into a Data+AI system, highlighting inefficiencies in algorithm and data‑science workflows, the introduction of an interactive notebook, comprehensive data lineage, AI‑coding assistance, and future DataAgent plans to automate data engineering tasks.

Xiaohongshu Tech REDtech

Sep 22, 2025

How Dataverse’s Notebook Supercharges Data+AI Development at Xiaohongshu

Dataverse Overview

Dataverse is Xiaohongshu’s data development and management platform serving thousands of weekly active users, with algorithms and data scientists (DS) accounting for nearly half of usage. Initially, development faced inefficiencies such as lack of interactive notebooks, fragmented online/nearline/offline pipelines, and missing end‑to‑end lineage, making fault attribution difficult.

2025 Upgrade to Data+AI Platform

In 2025, Dataverse will evolve into a Data+AI platform, launching a notebook to provide full‑chain lineage and a Copilot code assistant, dramatically improving algorithm and DS iteration efficiency.

Current Platform Capabilities

After two years of construction, Dataverse supports cross‑cloud multi‑cluster management and integration with internal systems, offering a complete DataOps pipeline. It handles 93,000 total tasks, 400,000 daily schedule instances, and enjoys a 69% internal NPS.

1.1 Inefficiencies in Algorithm and DS R&D

Python/Scala tasks cannot use production data for development and debugging. Developers must code locally, package JARs, upload, and re‑run on the platform, leading to low efficiency.

Lack of notebook‑style interactive environment. Model loading is time‑consuming; an interactive notebook would allow segmental execution and context retention.

Unstable Python development environment. Private Jupyter services suffer from resource contention and long queues.

Jupyter supports Python but not SQL. Mixed Python‑SQL development is missing.

Insufficient cross‑platform connectivity. SQL results cannot be directly visualized without exporting to CSV or external tools.

1.2 Stability Risks Shifting to Effect‑Based Failures

Over the past 25 years, more than 50% of RCA incidents are effect‑based, with data‑lineage issues accounting for 27% of failures. Root causes include incomplete lineage across online, nearline, and offline layers, lack of pre‑change data quality controls, inadequate monitoring, and insufficient long‑term governance.

1.3 AI‑Coding Efficiency Gains

Large‑model code generation can improve developer productivity, but challenges remain: corporate code‑security policies forbid external assistants, limited SQL training data reduces recommendation quality, and existing assistants are editor plugins rather than web‑integrated tools.

2.1 Notebook Accelerates Algorithm and DS Iteration (100% Efficiency Gain)

What is Dataverse Notebook? An interactive computing environment supporting mixed code, text, and visualizations, enabling “explore‑analyze‑record‑share” workflows for data science and machine learning.

Differences from native Jupyter

Seamless integration with internal systems and Dataverse, allowing notebooks to be published as periodic tasks.

SQL cells automatically save results as temporary tables and create RedBI datasets for downstream analysis.

Data+AI Collaborative Development

SQL → Python: SQL cells produce DataFrames that Python cells can directly manipulate.

Python → SQL: Python output can be uploaded as temporary tables for subsequent SQL queries.

Python dynamically generates SQL for execution in SQL cells.

SaaS Productization

Dedicated personal containers for each user ensure isolated, reproducible environments.

Personal cloud‑disk storage enables easy file upload and data comparison.

Customizable image environments allow developers to package required Python packages and Docker settings.

Notebook Impact

Monthly active users: 650, with 99% Spark user penetration.

4388 notebook tasks created, 264 of which are scheduled jobs.

Medium‑complexity Python tasks reduced from one week to 2‑3 days.

NPS for algorithm and DS roles improved from 61→67 and 37→63 respectively.

2.2 Data+AI Data Lineage Construction

Building end‑to‑end lineage across data ingestion, lake processing, real‑time and offline pipelines, and online services. Key components include agents for log collection, CDC services, batch sync, Spark/Flink jobs, Redis/Redkv, Kafka, indexes, features, models, and experiments.

Unified Catalog (similar to Gravitino) serves as the metadata hub, managing tables, columns, datasets, and model schemas. A metadata center provides data dictionaries, lineage, and tags, feeding into a metadata warehouse for governance.

Applications of lineage:

Visibility of task importance by propagating downstream service criticality upstream.

Pre‑change data quality testing and mandatory SQL scans for high‑priority tasks.

Real‑time monitoring, automated RCA via Radar, and health dashboards enforcing governance.

3.1 AI for Data

Features include code completion and continuation for both SQL and Python, as well as code error correction with diff visualization.

Engineering pipeline includes parsing incomplete SQL, table recall, similar code retrieval via vector databases, and model evaluation to improve recommendation adoption rates.

Model training uses Qwen‑Coder 7B with LoRA fine‑tuning, leveraging annotated real‑world cases and synthetic data augmentation to boost code generation quality.

Results: SQL Copilot adoption 27%, Python Copilot adoption 25.53%, and code‑error correction adoption 60%.

4.1 DataAgent Vision

Over the next 2‑3 years, DataAgent aims to automate low‑complexity data‑engineering tasks through three layers: analysis assistant, data‑retrieval assistant, and code generation assistant, ultimately delivering autonomous agents that turn data insights directly into business value.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Coding Data Platform data lineage DataOps notebook

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.