Getting Started with DataHub: A One‑Stop Guide to Metadata Governance
This article walks you through the fundamentals of data governance, explains metadata management concepts, compares traditional tools with DataHub, and provides a step‑by‑step tutorial for installing Docker, Python, and DataHub 0.8.20 on CentOS 7, ingesting MySQL metadata, and exploring the UI.
Data Governance and Metadata Management
Metadata management organizes data assets to support search, discovery, access control, lineage, compliance, quality, and AI reproducibility.
Key Functions
Search and discovery: tables, fields, tags, usage information.
Access control: groups, users, policies.
Data lineage: pipeline execution, query tracing.
Compliance: privacy and regulatory annotations.
Data management: source configuration, ingestion, retention, cleanup.
AI explainability & reproducibility: feature, model, training run definitions.
Data operations: pipeline execution, partition handling, statistics.
Data quality: rule definitions, execution results, statistics.
Metadata Architecture Generations
First generation: monolithic front‑end (e.g., Flask) with a relational store (MySQL/Postgres) and a search index (Elasticsearch). When recursive queries exceed relational limits, a graph index such as Neo4j is added.
Second generation: the monolith is split into services that sit in front of the metadata store and expose an API for push‑based ingestion.
Third generation: event‑driven architecture offering low‑latency lookup, full‑text ranking, graph queries, and analytical scans. DataHub follows this third‑generation design.
DataHub Overview
DataHub is an open‑source metadata catalog originated at LinkedIn (unrelated to Alibaba Cloud DataHub). Repository: https://github.com/linkedin/datahub. Comparable projects: Apache Atlas (https://github.com/apache/atlas) and Lyft Amundsen (https://github.com/lyft/amundsen). DataHub has over 4.3k GitHub stars.
DataHub Architecture
Frontend: React‑based UI for browsing metadata.
Backend (serving): Python services storing metadata in Elasticsearch or Neo4j.
Ingestion: Plugins that pull metadata via API or Kafka streams.
Quick Installation on CentOS 7
Prerequisites: Docker, Docker‑Compose, jq, and Python 3.6+.
Install Docker and Docker‑Compose
yum -y install docker # docker -v
Docker version 1.13.1, build 7d71120/1.13.1 systemctl start docker
systemctl stop docker sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose
ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose docker-compose --version
docker-compose version 1.29.2, build 5becea4cInstall jq
yum install epel-release
yum list jq
yum install jqInstall Python 3.8
yum -y groupinstall "Development tools"
yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel wget https://www.python.org/ftp/python/3.8.3/Python-3.8.3.tgz
tar -zxvf Python-3.8.3.tgz mkdir /usr/local/python3
cd Python-3.8.3
./configure --prefix=/usr/local/python3
make && make install rm -rf /usr/bin/python
ln -s /usr/local/python3/bin/python3 /usr/bin/python
rm -rf /usr/bin/pip
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
python -VInstall and Start DataHub
python3 -m pip install --upgrade pip wheel setuptools python3 -m pip install --upgrade acryl-datahub python3 -m datahub version
DataHub CLI version: 0.8.20.0 python3 -m datahub docker quickstartAfter the download completes, the UI is reachable at http://<em>IP</em>:9002 and can be logged in with user “datahub”.
Metadata Ingestion
DataHub uses a plugin architecture. Install the MySQL source plugin and the REST sink plugin: pip install 'acryl-datahub[mysql]' Verify installed plugins: python3 -m datahub check plugins Create a YAML recipe to pull metadata from MySQL and push it via the REST API:
source:
type: mysql
config:
username: root
password: 123456
database: cnarea20200630
transformers:
- type: "fully-qualified-class-name-of-transformer"
config:
some_property: "some.value"
sink:
type: "datahub-rest"
config:
server: "http://ip:8080"Run the ingestion: datahub ingest -c mysql_to_datahub_rest.yml The CLI reports successful ingestion:
{'records_written': 356,
'warnings': [],
'failures': [],
'downstream_start_time': datetime.datetime(2021, 12, 28, 21, 8, 37, 402989),
'downstream_end_time': datetime.datetime(2021, 12, 28, 21, 13, 10, 757687),
'downstream_total_latency_in_seconds': 273.354698}Refreshing the DataHub UI shows the imported MySQL tables, columns, and lineage visualizations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ShiZhen AI
Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
