Big Data 9 min read

How Small Companies Can Break Into Big Data Projects and Master High‑Concurrency Architecture

This article explores why small and medium enterprises struggle with big‑data adoption, proposes partnership‑based strategies to gain access to large datasets, and offers concrete technical roadmaps—including distributed storage, streaming pipelines, and query stacks—to help engineers practice high‑concurrency big‑data systems.

dbaplus Community

Nov 21, 2021

How Small Companies Can Break Into Big Data Projects and Master High‑Concurrency Architecture

Breaking the Big Data Barrier for SMEs

SMEs typically lack large datasets and resources to build independent big‑data platforms. To acquire big‑data capabilities they can partner with data‑rich organizations, provide services such as data cleaning, and gradually build internal expertise. Successful cases include a software firm that joined a provincial government big‑data project and, after two years, could win independent contracts, and domain‑specific examples (medical‑device maker, traffic‑video integrator) that repositioned themselves as data‑oriented service providers.

Diagram illustrating SME big‑data partnership model

Technical Path for SME Engineers

High‑concurrency and big‑data workloads are best addressed with a distributed storage plus real‑time stream‑processing architecture. Incoming requests are buffered in a message queue and processed by stream nodes, which write to distributed stores before persisting to a scalable relational database.

Typical component stack

Message queue: Kafka or RocketMQ for request buffering

Streaming processors that write to distributed stores such as HBase, MongoDB, or Kudu Final persistence in a horizontally scalable database such as MySQL Cluster or NewSQL TiDB Read path: Nginx + Redis (second‑level cache) + read‑write splitting for MySQL, or TiDB; optional Elasticsearch for search workloads

Example: a blog‑article platform processes billions of daily article records. Writes are routed to a key‑value store for fast edits, while a review pipeline queues submissions via Kafka for AI‑driven filtering (sensitive words, duplication). After filtering, data is persisted to the relational store, reducing write pressure and achieving sub‑second response times.

High‑concurrency blog article processing pipeline

Hands‑on practice workflow

Identify a realistic business scenario within your organization.

Obtain a public dataset (e.g., from Kaggle) or generate synthetic data that mimics the scenario.

Perform ETL and data cleaning.

Build a stack that includes at least: Kafka, Redis, a distributed store ( HBase or Kudu), a NewSQL database ( TiDB or MySQL Cluster), and optionally ELK, MongoDB, or Hadoop.

Validate that the system provides:

ACID‑compliant transactions on the dataset.

Distributed storage across multiple nodes.

Sub‑second query and processing latency.

Completing this exercise gives SME engineers practical experience with both OLTP and OLAP patterns in high‑concurrency environments, narrowing the skill gap with engineers from larger enterprises.

data engineering High Concurrency SME Strategy

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.