Big Data 6 min read

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big Data Technology & Architecture

Sep 1, 2020

Configuring Hadoop to Support LZO Compression

Hadoop does not natively support LZO compression, so the twitter‑provided hadoop‑lzo component must be installed and compiled against Hadoop and the LZO library.

1. Place the compiled hadoop‑lzo-0.4.20.jar into /opt/module/hadoop-2.7.2/share/hadoop/common/ on the master node:

[atguigu@hadoop102 common]$ pwd</code>
<code>/opt/module/hadoop-2.7.2/share/hadoop/common</code>
<code>[atguigu@hadoop102 common]$ ls</code>
<code>hadoop-lzo-0.4.20.jar

2. Synchronize the JAR to the other Hadoop nodes (hadoop103, hadoop104) using the xsync command:

[atguigu@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar

3. Add LZO support to core-site.xml by inserting the following properties:

<?xml version="1.0" encoding="UTF-8"?></code>
<code><configuration></code>
<code>  <property></code>
<code>    <name>io.compression.codecs</name></code>
<code>    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value></code>
<code>  </property></code>
<code>  <property></code>
<code>    <name>io.compression.codec.lzo.class</name></code>
<code>    <value>com.hadoop.compression.lzo.LzoCodec</value></code>
<code>  </property></code>
<code></configuration>

4. Synchronize the updated core-site.xml to the other nodes: [atguigu@hadoop102 hadoop]$ xsync core-site.xml 5. Start the Hadoop cluster:

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh</code>
<code>[atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

6. Because LZO files are splittable only when an index file exists, create an index for each LZO output file using the DistributedLzoIndexer:

[atguigu@hadoop202 bin]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.DistributedLzoIndexer /output/part-r-00000.lzo

7. Example: run the built‑in WordCount job with LZO compression.

a) Create a small word.txt file and upload it to HDFS:

[atguigu@hadoop202 bin]$ vim word.txt</code>
<code>[atguigu@hadoop202 bin]$ hdfs dfs -put word.txt /input

b) Execute the WordCount job with LZO output compression:

[atguigu@hadoop202 bin]$ hadoop jar /opt/module/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output

The job produces LZO‑compressed part files. After indexing, the files appear with accompanying .index files, enabling Hadoop to split them for parallel processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Configuration MapReduce compression Hadoop LZO

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.