Big Data 19 min read

Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

This article walks you through the fundamentals of data governance, explains metadata management concepts, compares traditional tools with DataHub, and provides a step‑by‑step tutorial for installing Docker, Python, and DataHub 0.8.20 on CentOS 7, ingesting MySQL metadata, and exploring the UI.

ShiZhen AI
ShiZhen AI
ShiZhen AI
Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

Data Governance and Metadata Management

Metadata management organizes data assets to support search, discovery, access control, lineage, compliance, quality, and AI reproducibility.

Key Functions

Search and discovery: tables, fields, tags, usage information.

Access control: groups, users, policies.

Data lineage: pipeline execution, query tracing.

Compliance: privacy and regulatory annotations.

Data management: source configuration, ingestion, retention, cleanup.

AI explainability & reproducibility: feature, model, training run definitions.

Data operations: pipeline execution, partition handling, statistics.

Data quality: rule definitions, execution results, statistics.

Metadata Architecture Generations

First generation: monolithic front‑end (e.g., Flask) with a relational store (MySQL/Postgres) and a search index (Elasticsearch). When recursive queries exceed relational limits, a graph index such as Neo4j is added.

Second generation: the monolith is split into services that sit in front of the metadata store and expose an API for push‑based ingestion.

Third generation: event‑driven architecture offering low‑latency lookup, full‑text ranking, graph queries, and analytical scans. DataHub follows this third‑generation design.

DataHub Overview

DataHub is an open‑source metadata catalog originated at LinkedIn (unrelated to Alibaba Cloud DataHub). Repository: https://github.com/linkedin/datahub. Comparable projects: Apache Atlas (https://github.com/apache/atlas) and Lyft Amundsen (https://github.com/lyft/amundsen). DataHub has over 4.3k GitHub stars.

DataHub Architecture

Frontend: React‑based UI for browsing metadata.

Backend (serving): Python services storing metadata in Elasticsearch or Neo4j.

Ingestion: Plugins that pull metadata via API or Kafka streams.

Quick Installation on CentOS 7

Prerequisites: Docker, Docker‑Compose, jq, and Python 3.6+.

Install Docker and Docker‑Compose

yum -y install docker
# docker -v
Docker version 1.13.1, build 7d71120/1.13.1
systemctl start docker
systemctl stop docker
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
docker-compose --version
docker-compose version 1.29.2, build 5becea4c

Install jq

yum install epel-release
yum list jq
yum install jq

Install Python 3.8

yum -y groupinstall "Development tools"
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
wget https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tgz
tar -zxvf Python-3.8.3.tgz
mkdir /usr/local/python3
cd Python-3.8.3
./configure --prefix=/usr/local/python3
make && make install
rm -rf /usr/bin/python
ln -s /usr/local/python3/bin/python3 /usr/bin/python
rm -rf /usr/bin/pip
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
python -V

Install and Start DataHub

python3 -m pip install --upgrade pip wheel setuptools
python3 -m pip install --upgrade acryl-datahub
python3 -m datahub version
DataHub CLI version: 0.8.20.0
python3 -m datahub docker quickstart

After the download completes, the UI is reachable at http://<em>IP</em>:9002 and can be logged in with user “datahub”.

Metadata Ingestion

DataHub uses a plugin architecture. Install the MySQL source plugin and the REST sink plugin: pip install 'acryl-datahub[mysql]' Verify installed plugins: python3 -m datahub check plugins Create a YAML recipe to pull metadata from MySQL and push it via the REST API:

source:
  type: mysql
  config:
    username: root
    password: 123456
    database: cnarea20200630

transformers:
  - type: "fully-qualified-class-name-of-transformer"
    config:
      some_property: "some.value"

sink:
  type: "datahub-rest"
  config:
    server: "http://ip:8080"

Run the ingestion: datahub ingest -c mysql_to_datahub_rest.yml The CLI reports successful ingestion:

{'records_written': 356,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2021, 12, 28, 21, 8, 37, 402989),
 'downstream_end_time': datetime.datetime(2021, 12, 28, 21, 13, 10, 757687),
 'downstream_total_latency_in_seconds': 273.354698}

Refreshing the DataHub UI shows the imported MySQL tables, columns, and lineage visualizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerBig DataPythonmetadataData GovernanceDataHub
ShiZhen AI
Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.