Big Data 16 min read

Turning Data into Oil: Building Scalable Big Data Pipelines and Secure Insights

The talk outlines how to treat raw data like crude oil—using a robust big‑data platform, entity‑resolution techniques, secure data governance, and visualisation tools to transform disparate sources into reusable assets that drive rapid business insights and operational efficiency.

dbaplus Community

Jan 21, 2022

Turning Data into Oil: Building Scalable Big Data Pipelines and Secure Insights

Data Pipeline Architecture

A custom ingestion framework collects data from heterogeneous sources (Oracle, PostgreSQL, MySQL, SQL Server, Solace, Kafka, flat files, cloud storage, etc.). All raw files are landed on HDFS in columnar Parquet format and simultaneously registered in Hive for ad‑hoc querying. Each ingestion run performs two validation steps: (1) a checksum to detect transmission corruption and (2) a reconciliation check that compares row counts and key fields against the source system to guarantee completeness.

Entity Resolution

When source systems lack a global identifier, the platform applies Entity Resolution to deduplicate and link records. The workflow consists of:

Feature extraction (e.g., name, address, registration number).

Blocking to reduce pairwise comparisons.

Clustering using algorithms such as k‑means or hierarchical agglomeration.

Graph construction and traversal with Apache Spark GraphX, where each record is a vertex and similarity edges connect likely matches.

Assignment of a synthetic, globally unique entity ID to every connected component.

Typical use cases include linking subsidiaries of a conglomerate (e.g., Alibaba Group → Cainiao → downstream entities) and establishing Connected Party relationships such as “person A is a director of company B” or “company C transferred $1 M to company D”. The unified view enables downstream applications like customer due‑diligence to retrieve a 360° profile with a single query.

Data Security – Data Guardian

Data Guardian enforces fine‑grained, role‑based access control by tagging every data element with metadata at the source (e.g., confidentiality=high, owner=branch‑X). At query time the engine evaluates the requester’s role and filters out columns or rows that lack the required clearance. For example, a branch manager can see client identifiers and contact information but cannot view transaction amounts, whereas a risk analyst with a higher privilege can access the full financial view.

Data Exchange Service

Data Exchange is a micro‑service‑based data‑consumption layer that publishes processed assets via RESTful APIs and SFTP endpoints. An F5 load balancer distributes incoming requests across stateless service instances, which read the final assets from the data lake (HDFS/Parquet) or from Hive tables. This architecture removes traditional firewall‑bound silos while preserving security policies enforced by Data Guardian.

Rapid‑V Visualization Platform

Rapid‑V provides a drag‑and‑drop UI for configuring heterogeneous data sources (API, CSV, Oracle, PostgreSQL, Elasticsearch, etc.). The platform leverages Trino as a data‑virtualization layer, allowing users to write standard SQL to join across sources, perform aggregations (e.g., SUM(), COUNT()), and build derived datasets. Visual components (charts, tables, heatmaps) can be assembled on a canvas, saved as a report, and published with configurable visibility (private, team‑wide, or public). Reports can also be version‑controlled and exported for downstream consumption.

Technical Q&A Highlights

Ingestion & validation: The pipeline writes raw data to HDFS as Parquet, registers it in Hive, then runs checksum verification followed by reconciliation against the source system to ensure no loss.

Motivation for cross‑department data sharing: Consolidating product, trade, and client data creates a unified customer view, enabling 360° analytics, risk assessment, and faster onboarding.

Governance model: A Chief Data Office (CDO) oversees data‑management policies, cross‑border data sharing, and cloud migration, ensuring regulatory compliance while reducing approval bottlenecks.

Entity Resolution tooling: Uses Spark GraphX for graph‑based linkage and k‑means‑style clustering for similarity scoring; the resulting graph connects disparate records into a single entity graph.

Access control example: A branch manager sees only the ten customers they manage; financial columns are masked by Data Guardian based on role tags.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline Data Visualization Data Security entity resolution Cloud Data Platform

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.