Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks
This article provides a step‑by‑step tutorial on using Apache Atlas for data lineage, including SQL execution, custom data maps, tagging, field search, detailed installation procedures, runtime commands, and the configuration of Hive and Sqoop hooks for a complete big‑data governance solution.
1. Application
1.1 Execute SQL
--创建临时表(取出最新一条访问记录)
DROP TABLE IF EXISTS tmp.tmp_ng;
CREATE TABLE tmp.tmp_ng STORED AS parquet AS
SELECT userid,
pageid,
pagetype,
os,
terminal_type,
pagetime
FROM (
SELECT uid userid,
CAST(pageid AS INT) pageid,
pagetype,
os,
terminal_type,
pagetime,
row_number() OVER(
PARTITION BY uid,
pageid
ORDER BY pagetime DESC
) rk
FROM dw.dw_nginxlog
WHERE dt = '2020-11-02'
) t1
WHERE rk = 1;
--统计三个页面的访问及设备
INSERT OVERWRITE TABLE dwd.dwd_hive_fact_nginx_visit_ext_dt PARTITION(dt = '2020-11-03')
SELECT zbid,
userid,
os,
terminal_type
FROM (
SELECT zbid,
userid,
os,
terminal_type,
row_number() OVER(
PARTITION BY zbid,
userid
ORDER BY pagetime DESC
) rk
FROM (
SELECT tp.zbid,
t1.userid,
t1.os,
t1.terminal_type,
t1.pagetime
FROM (
SELECT userid,
pageid,
os,
terminal_type,
pagetime
FROM tmp.tmp_ng
WHERE pagetype = 1 --话题类型
) t1
LEFT JOIN (
SELECT zbid,
topicid
FROM dwd.dwd_minisns_zbtopics_ds
WHERE dt = '2020-11-02'
) tp --快照表
ON t1.pageid = tp.topicid
UNION ALL
SELECT tp.zbid,
t2.userid,
t2.os,
t2.terminal_type,
t2.pagetime
FROM (
SELECT userid,
pageid,
os,
terminal_type,
pagetime
FROM tmp.tmp_ng
WHERE pagetype = 2 --频道类型
) t2
LEFT JOIN dw.dw_zbchannel tp ON t2.pageid = tp.channelid
) t3
) t4
WHERE rk = 1;1.2 Hand‑written Data Map
1.3 Atlas Lineage Analysis
Explanation: Atlas parses all SQL scripts to display upstream and downstream relationships, enabling domain‑level organization and easier iterative maintenance.
1.4 Tagging
1.4.1 Classification
Explanation: Different dimensions can be defined according to project requirements.
1.4.2 Glossary
Explanation: A data‑warehouse project often contains many domains and layers; a glossary helps organize them.
1.5 Field Search
1.5.1 View Table Fields
1.5.2 Trace Field Relationships
Explanation: Powerful lineage lets you view the complete data chain of a field.
2. Installation
2.1 Compile and Install
Download source, modify component versions, fix compatibility errors, compile, package, copy and unzip.
git clone --branch branch-2.0 https://gitee.com/mirrors/apache-atlas.git apache-atlas_branch-2.0 <hadoop.version>2.8.5</hadoop.version>
<hbase.version>1.4.9</hbase.version>
<kafka.version>2.0.0</kafka.version>
<hive.version>2.3.5</hive.version>
<zookeeper.version>3.4.6</zookeeper.version>
<falcon.version>0.8</falcon.version>
<sqoop.version>1.4.6.2.3.99.0-195</sqoop.version>
<storm.version>1.2.0</storm.version>
<elasticsearch.version>7.1.0</elasticsearch.version>Remove the hbase‑zookeeper dependency from pom.xml and replace HBase classes (e.g., ColumnFamilyDescriptor → HColumnDescriptor).
export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean package -Pdist -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -Dmaven.source.skip=trueCopy the generated tarballs to /opt/package/atlas/ and unzip them under /opt/service/atlas/.
2.2 Deploy
Modify configuration files ( atlas-env.sh, atlas-application.properties, atlas-log4j.xml) and set environment variables.
export JAVA_HOME=/usr/local/jdk
export HBASE_CONF_DIR=/usr/local/service/hbase/conf
export ATLAS_SERVER_HEAP="-Xms2048m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m" # hbase configuration
atlas.graph.storage.hbase.table=apache_atlas_janus_test
atlas.graph.storage.hostname=10.0.11.6
# elasticsearch index backend
atlas.graph.index.search.backend=elasticsearch
atlas.graph.index.search.index-name=test_janusgraph
# kafka settings
atlas.kafka.bootstrap.servers=10.0.12.95:9092
atlas.kafka.zookeeper.connect=10.0.11.6:2181Set ATLAS_HOME and update PATH in /etc/profile, then source the file.
3. Run
3.1 Start
atlas_start.pyFirst start may take about an hour for initialization.
3.2 Access
Open http://localhost:21000/ with username admin and password admin.
4. Hook Configuration
4.1 Hive Hook
Configure each Hive node:
mkdir -p /opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT/conf
# copy atlas‑application.properties, atlas‑env.sh, atlas‑log4j.xml to the conf directory export ATLAS_HOME=/opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT
export JAVA_TOOL_OPTIONS="-Datlas.conf=$ATLAS_HOME/conf"
source /etc/profileUpload the Hive hook package via nc, unzip, and add third‑party dependencies (e.g., elasticsearch‑hadoop‑hive‑7.1.0.jar, hive‑kudu‑handler‑1.10.0.jar).
<property>
<name>hive.exec.post.hooks</name>
<value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>Restart Hive after configuration.
4.2 Sqoop Hook
Similar steps as Hive Hook: set configuration directory, environment variables, upload the Sqoop hook package, unzip, and create symbolic links.
mkdir -p /opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT/conf
# copy configuration files as for Hive
export ATLAS_HOME=/opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT
export JAVA_TOOL_OPTIONS="-Datlas.conf=$ATLAS_HOME/conf"
source /etc/profile <property>
<name>sqoop.job.data.publish.class</name>
<value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property>After linking the JARs, restart Sqoop to activate the hook.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
