Big Data 18 min read

R for Fine‑Grained Data Operations: Engineering Practices and Performance at Meituan

Meituan’s in‑store dining team demonstrates how R’s open‑source packages, powerful data manipulation, rich visualization libraries, and reproducible reporting can be engineered into scalable, parallelized workflows that turn secondary data processing into fast, interactive dashboards and analytics, proving R’s enterprise‑grade performance and adoption.

Meituan Technology Team

Aug 2, 2018

R for Fine‑Grained Data Operations: Engineering Practices and Performance at Meituan

In recent years, distributed data‑processing technologies such as Hive, Spark, Kylin, Impala and Presto have made large‑scale data computation and storage a reality. Data‑warehouse and business‑analysis units have become standard in many enterprises, and the ability to extract value from data through refined data‑operation practices is now a key success factor for data teams.

Data presentation is the final and critical step of the data‑to‑insight pipeline. Compared with cold tables, turning data into charts and organizing the content appropriately can convey information more quickly and intuitively, thereby providing better decision support. This process often involves extensive secondary data processing.

The R language has unique advantages in these scenarios. This article, based on the fine‑grained data‑operation practice of Meituan’s in‑store dining technology department, introduces R’s engineering capabilities for data analysis and visualization, aiming to inspire the community and invite further suggestions.

R’s advantages for data operations

Free, open‑source, extensible: as of 2018‑08‑02 the CRAN repository hosts over 12,800 packages covering Bayesian analysis, operations research, finance, genomics, genetics, etc.

Programmable: R is an interpreted language that can be controlled via code and can interoperate with Python and Java through packages such as rPython and rJava.

Powerful data manipulation: access to MySQL, Spark, Elasticsearch via RMySQL, SparkR, elastic; second‑stage processing with sqldf, tidyr, dplyr, reshape2; visualization with ggplot2, Plotly, dygraph; statistical analysis (linear regression, ANOVA, PCA, etc.).

Service framework: web applications via shiny, service‑oriented architecture via rserve.

While both Python and R are viable choices for data‑centric applications, R excels in statistical research and data analysis, whereas Python is more oriented toward engineering environments.

R’s data‑processing, visualization, and reproducible analysis capabilities

For analysts or developers with programming skills, R can satisfy “write once, use forever” requirements while offering flexible adjustments and rich graphics. The following sections detail these capabilities.

Data processing

Enterprise data systems often use Hive, Spark, Kylin, etc., for initial cleaning and integration. R typically works on the resulting data sets but still needs secondary processing at the query layer. Packages such as RMySQL and elastic enable direct access to MySQL and Elasticsearch. When newer technologies (e.g., Kylin) lack R connectors, developers can wrap their APIs in Python/Java and call them from R via rPython or rJava. Within R, sqldf, reshape2, stringr, and other base functions provide comprehensive data‑manipulation capabilities.

Data visualization

R’s greatest strength lies in its graphics system. Three main visualization stacks are supported:

Built‑in system: base, grid, lattice for simple plots. ggplot2: a grammar‑of‑graphics implementation that enables highly customizable layered graphics. As of 2018‑08‑02, CRAN hosts 40 ggplot2 extensions. htmlwidgets for R: a bridge between front‑end JavaScript visualizations and data‑engineers, with over 100 packages on CRAN by 2018‑08‑02.

Meituan’s data team has built a library of reusable visualization components. For example, the four‑quadrant matrix chart can be generated with a single line of code:

vis_4quadrant(iris, 'Sepal.Length', 'Petal.Length', label = 'Species', tooltip = 'tooltip', title = '', xtitle = '萼片长度', ytitle = '花瓣长度', pointSize = 1, annotationSize = 1)

The underlying function is declared as follows:

vis_4quadrant <- function(df, x, y,
  label = '', tooltip = '', title = '', xtitle = '', ytitle = '',
  showLegend = T, jitter = T, centerType = 'mean',
  pointShape = 19, pointSize = 5, pointColors = collocatcolors2,
  lineSize = 0.4, lineType = 'dashed', lineColor = 'black',
  annotationFace = 'sans serif', annotationSize = 5, annotationColor = 'black', annotationDeviationRatio = 15,
  gridAnnotationFace = 'sans serif', gridAnnotationSize = 6, gridAnnotationColor = 'black', gridAnnotationAlpha = 0.6,
  titleFace = 'sans serif', titleSize = 12, titleColor = 'black',
  xyTitleFace = 'sans serif', xyTitleSize = 8, xyTitleColor = 'black',
  gridDesc = c('A 区', 'B 区', 'C 区', 'D 区'), dataMissingInfo = '数据不完整', renderType = 'widget') {
  # drawing code omitted
}

Reproducible analysis

RStudio, together with rmarkdown and knitr, provides a literature‑coding workflow that can render reports to HTML, PDF or Word. By combining flexdashboard, developers can produce highly customized, interactive dashboards using HTML, CSS and JavaScript, greatly reducing manual effort and accelerating delivery.

R service transformation

R can run on Linux servers and be embedded in enterprise reporting or data‑mining systems. A typical service architecture places R as the analysis engine, while Java or other system languages handle caching, security, and permission control.

Parallel processing with foreach + doParallel

Because R is single‑threaded, heavy computations are off‑loaded to distributed engines (Hive, Kylin) or parallelized on multi‑core machines using the doParallel + foreach pattern:

library(doParallel)
library(foreach)
registerDoParallel(cores = detectCores())

vis_process1 <- function() { # visualization step 1 ... }
vis_process2 <- function() { # visualization step 2 ... }

data_process1 <- function() { # data processing step 1 ... }
data_process2 <- function() { # data processing step 2 ... }

processes <- c('vis_process1','vis_process2','data_process1','data_process2')
process_res <- foreach(i = 1:length(processes), .packages = c('magrittr')) %dopar% {
  do.call(processes[i], list())
}

vis_process1_res <- process_res[[1]]
vis_process2_res <- process_res[[2]]
data_process1_res <- process_res[[3]]
data_process2_res <- process_res[[4]]

Rendering performance

Performance tests on a 4‑core, 8 GB Linux machine (2.20 GHz) show an average single‑application rendering time of >0.74 s, with most cases completing within a second. Parallelism improves throughput only up to the number of CPU cores.

Practical adoption at Meituan

Since 2015, Meituan’s in‑store dining data team has increasingly adopted R for internal dashboards, management reports, data‑warehouse governance tools, and client‑facing analytics products. The team maintains a library of reusable visualization and analysis components and has built an ETL‑dependency visualizer using R.

Conclusion

R serves as a powerful technical lever in enterprise data‑operation practice. Although historically driven by statisticians, recent industry support (e.g., Microsoft’s acquisition of Revolution Analytics, integration of R into SQL Server and Visual Studio) indicates growing recognition of R’s value in large‑scale data analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data processing parallel computing Data Visualization R RMarkdown

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.