Big Data 18 min read

Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

This article provides a step‑by‑step tutorial on using Apache Atlas for data lineage, including SQL execution, custom data maps, tagging, field search, detailed installation procedures, runtime commands, and the configuration of Hive and Sqoop hooks for a complete big‑data governance solution.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Comprehensive Guide to Installing and Configuring Apache Atlas with Hive and Sqoop Hooks

1. Application

1.1 Execute SQL

--创建临时表(取出最新一条访问记录)
DROP TABLE IF EXISTS tmp.tmp_ng;
CREATE TABLE tmp.tmp_ng STORED AS parquet AS
SELECT userid,
    pageid,
    pagetype,
    os,
    terminal_type,
    pagetime
FROM (
    SELECT uid userid,
        CAST(pageid AS INT) pageid,
        pagetype,
        os,
        terminal_type,
        pagetime,
        row_number() OVER(
            PARTITION BY uid,
            pageid
            ORDER BY pagetime DESC
        ) rk
    FROM dw.dw_nginxlog
    WHERE dt = '2020-11-02'
) t1
WHERE rk = 1;
--统计三个页面的访问及设备 
INSERT OVERWRITE TABLE dwd.dwd_hive_fact_nginx_visit_ext_dt PARTITION(dt = '2020-11-03')
SELECT zbid,
    userid,
    os,
    terminal_type
FROM (
    SELECT zbid,
        userid,
        os,
        terminal_type,
        row_number() OVER(
            PARTITION BY zbid,
            userid
            ORDER BY pagetime DESC
        ) rk
    FROM (
        SELECT tp.zbid,
            t1.userid,
            t1.os,
            t1.terminal_type,
            t1.pagetime
        FROM (
            SELECT userid,
                pageid,
                os,
                terminal_type,
                pagetime
            FROM tmp.tmp_ng
            WHERE pagetype = 1 --话题类型
            ) t1
            LEFT JOIN (
                SELECT zbid,
                    topicid
                FROM dwd.dwd_minisns_zbtopics_ds
                WHERE dt = '2020-11-02'
                ) tp --快照表
            ON t1.pageid = tp.topicid
        UNION ALL
        SELECT tp.zbid,
            t2.userid,
            t2.os,
            t2.terminal_type,
            t2.pagetime
        FROM (
            SELECT userid,
                pageid,
                os,
                terminal_type,
                pagetime
            FROM tmp.tmp_ng
            WHERE pagetype = 2 --频道类型
            ) t2
            LEFT JOIN dw.dw_zbchannel tp ON t2.pageid = tp.channelid
        ) t3
    ) t4
WHERE rk = 1;

1.2 Hand‑written Data Map

1.3 Atlas Lineage Analysis

Explanation: Atlas parses all SQL scripts to display upstream and downstream relationships, enabling domain‑level organization and easier iterative maintenance.

1.4 Tagging

1.4.1 Classification

Explanation: Different dimensions can be defined according to project requirements.

1.4.2 Glossary

Explanation: A data‑warehouse project often contains many domains and layers; a glossary helps organize them.

1.5 Field Search

1.5.1 View Table Fields

1.5.2 Trace Field Relationships

Explanation: Powerful lineage lets you view the complete data chain of a field.

2. Installation

2.1 Compile and Install

Download source, modify component versions, fix compatibility errors, compile, package, copy and unzip.

git clone --branch branch-2.0 https://gitee.com/mirrors/apache-atlas.git apache-atlas_branch-2.0
<hadoop.version>2.8.5</hadoop.version>
<hbase.version>1.4.9</hbase.version>
<kafka.version>2.0.0</kafka.version>
<hive.version>2.3.5</hive.version>
<zookeeper.version>3.4.6</zookeeper.version>
<falcon.version>0.8</falcon.version>
<sqoop.version>1.4.6.2.3.99.0-195</sqoop.version>
<storm.version>1.2.0</storm.version>
<elasticsearch.version>7.1.0</elasticsearch.version>

Remove the hbase‑zookeeper dependency from pom.xml and replace HBase classes (e.g., ColumnFamilyDescriptor → HColumnDescriptor).

export MAVEN_OPTS="-Xms2g -Xmx2g"
mvn clean package -Pdist -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -Dmaven.source.skip=true

Copy the generated tarballs to /opt/package/atlas/ and unzip them under /opt/service/atlas/.

2.2 Deploy

Modify configuration files ( atlas-env.sh, atlas-application.properties, atlas-log4j.xml) and set environment variables.

export JAVA_HOME=/usr/local/jdk
export HBASE_CONF_DIR=/usr/local/service/hbase/conf
export ATLAS_SERVER_HEAP="-Xms2048m -Xmx15360m -XX:MaxNewSize=5120m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m"
# hbase configuration
atlas.graph.storage.hbase.table=apache_atlas_janus_test
atlas.graph.storage.hostname=10.0.11.6
# elasticsearch index backend
atlas.graph.index.search.backend=elasticsearch
atlas.graph.index.search.index-name=test_janusgraph
# kafka settings
atlas.kafka.bootstrap.servers=10.0.12.95:9092
atlas.kafka.zookeeper.connect=10.0.11.6:2181

Set ATLAS_HOME and update PATH in /etc/profile, then source the file.

3. Run

3.1 Start

atlas_start.py

First start may take about an hour for initialization.

3.2 Access

Open http://localhost:21000/ with username admin and password admin.

4. Hook Configuration

4.1 Hive Hook

Configure each Hive node:

mkdir -p /opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT/conf
# copy atlas‑application.properties, atlas‑env.sh, atlas‑log4j.xml to the conf directory
export ATLAS_HOME=/opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT
export JAVA_TOOL_OPTIONS="-Datlas.conf=$ATLAS_HOME/conf"
source /etc/profile

Upload the Hive hook package via nc, unzip, and add third‑party dependencies (e.g., elasticsearch‑hadoop‑hive‑7.1.0.jar, hive‑kudu‑handler‑1.10.0.jar).

<property>
    <name>hive.exec.post.hooks</name>
    <value>org.apache.atlas.hive.hook.HiveHook</value>
</property>
<property>
    <name>hive.metastore.event.listeners</name>
    <value>org.apache.atlas.hive.hook.HiveMetastoreHook</value>
</property>

Restart Hive after configuration.

4.2 Sqoop Hook

Similar steps as Hive Hook: set configuration directory, environment variables, upload the Sqoop hook package, unzip, and create symbolic links.

mkdir -p /opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT/conf
# copy configuration files as for Hive
export ATLAS_HOME=/opt/service/atlas/apache-atlas-2.2.0-SNAPSHOT
export JAVA_TOOL_OPTIONS="-Datlas.conf=$ATLAS_HOME/conf"
source /etc/profile
<property>
    <name>sqoop.job.data.publish.class</name>
    <value>org.apache.atlas.sqoop.hook.SqoopHook</value>
</property>

After linking the JARs, restart Sqoop to activate the hook.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSQLinstallation guideApache AtlasHive HookSqoop Hook
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.