Big Data 13 min read

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

ITPUB

Jul 19, 2016

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

Traditional Data Warehouse Techniques

A conventional data‑warehouse project follows a layered lifecycle:

Concept definition – capture enterprise‑wide and client‑specific data domains, produce a Statement of Work (SOW) that lists governance functions and business requirements.

Portal & permission management – define access controls for data‑catalog and reporting portals.

Integration layer – select ETL tools (e.g., Informatica) to move data across domains, verify network connectivity (IP, ports, firewalls) for each source system.

Metadata layer – maintain source/target system descriptions, data‑type mappings, and change‑log history.

Data‑rule definition adopts a three‑layer dimensional model:

Metadata layer – raw source attributes are recorded unchanged.

Data‑warehouse layer – load dimension tables and fact tables, generate derived metrics.

Management layer – aggregate dimensions for reporting (daily, weekly, monthly) and provide a unified view for downstream analysis.

Design documents specify source system details, target schema, and ETL extraction logic, enabling developers to implement the blueprint consistently.

Public and Behavioral Data Processing in a Big‑Data Environment

Data sources are classified into three categories:

Public data – collected with language‑agnostic web crawlers; Python‑based frameworks (e.g., Scrapy) are the most common.

Embedded tracking data – obtained via third‑party SDKs such as TalkingData, Umeng, or custom‑built collectors.

User/transaction data – stored in relational databases and ingested with sqoop (or similar bulk‑load utilities).

After landing in HDFS, the processing pipeline consists of three logical layers:

Raw landing layer – immutable copies of source files.

Integration layer – Hive‑based ETL builds a unified data model (dimensional tables, fact tables).

Analysis layer – analytical queries, data‑mining jobs, and ad‑hoc reporting.

Performance is accelerated by running Hive jobs on Spark; typical speed‑up is around tenfold. Spark’s MLlib and SparkR are used for machine‑learning and visual analytics. Final result sets are persisted in HBase or an RDBMS to satisfy low‑latency query requirements.

Key operational practices:

Include interface version and change log in metadata tables.

Record file size, ingestion timestamp, and a success flag for every batch.

Attach traceability metadata (source system, batch ID) to all downstream tables.

Ensure tracking events follow the business data flow to maintain completeness.

Standardize identifier keys (UID, DID) across all sources and align time windows.

Capture source‑level metadata (schema, encoding) at the point of ingestion.

Migration from Traditional to Big‑Data Warehouses

The migration focuses on two core tasks: data synchronization and data masking.

Synchronization – replicate all Level 0 tables (legacy RDBMS tables, public data files, tracking logs) to the big‑data platform. Ingestion tools are chosen by source type:

RDBMS → HDFS:

sqoop import --connect jdbc:... --table ... --target-dir /data/level0/

Real‑time streams → HDFS: Apache Kafka producers or Storm spouts feed data into HDFS partitions.

Masking – apply column‑level or row‑level anonymization (e.g., tokenization, hashing) before data is stored in shared zones.

Best‑practice recommendations:

Define a complete “data‑gene” for each source file: size, checksum, ingestion time, and success flag.

Design clear data‑lineage metadata so that every downstream artifact can be traced back to its origin.

Implement atomic security controls per storage layer (raw, integration, analysis) to enforce least‑privilege access.

Automate monitoring of key quality metrics: consistency ratios, data skew thresholds, validation pass rates.

Visualize core metrics and metadata dashboards to enable proactive issue detection.

Technical Takeaways

• Traditional data‑warehouse processes (concept definition, metadata management, dimensional modeling) remain valid foundations for big‑data platforms.

• A three‑layer architecture (raw → integration → analysis) simplifies governance and scaling.

• Leveraging Spark as the execution engine for Hive dramatically improves throughput and enables advanced analytics.

• Consistent metadata, versioned interfaces, and unified identifiers are essential for reliable data pipelines.

• Automated lineage, security, and monitoring close the loop between legacy RDBMS environments and modern Hadoop ecosystems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse ETL Data Governance Spark Hadoop

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.