Big Data 10 min read

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

This article explains Hadoop's DistributedCache mechanism, its APIs for adding cache files and archives, common use cases, important considerations, the basic workflow, and provides a complete Java Map-side join example demonstrating how to distribute and access cached data in MapReduce jobs.

Big Data Technology & Architecture

Apr 10, 2019

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

1. Introduction

DistributedCache is a mechanism provided by the Hadoop framework that can distribute files specified by a job to the machines executing the tasks before the job runs, with related mechanisms for managing the cached files.

DistributedCache can effectively place large, read‑only files related to a specific application. It is a feature of the Map/Reduce framework that can cache files needed by the application (including text, archive files, jar files, etc.).

Before all tasks of a job are executed, the Map‑Reduce framework copies the necessary files to the slave nodes. It is efficient because each job's files are copied only once and cached on slave nodes that lack the files.

DistributedCache tracks cached documents by modification timestamps. During job execution, the current application or external programs cannot modify the cached files.

DistributedCache can distribute simple read‑only data or text files, as well as complex files such as archives and jar files. Archive files (zip, tar, tgz, tar.gz) are un‑archived on the slave nodes and can be given execution permissions.

Users can set mapred.cache.files or mapred.cache.archives to distribute files. Multiple files can be separated by commas.

DistributedCache can be used as a basic software distribution mechanism in map/reduce tasks, for distributing jar packages and native libraries.

The APIs DistributedCache.addArchiveToClassPath(Path, Configuration) and DistributedCache.addFileToClassPath(Path, Configuration) can cache files and jar packages and add them to the child JVM's classpath. The same effect can be achieved by setting mapred.job.classpath.files or mapred.job.classpath.archives in the configuration.

Adding cache files:

DistributedCache.addCacheFile(URI, conf)
DistributedCache.addCacheArchive(URI, conf)
DistributedCache.setCacheFiles(URIs, conf)
DistributedCache.setCacheArchives(URIs, conf)

Where the URI follows the HDFS directory format.

Caching JARs:

The method DistributedCache.createSymlink(Configuration) creates symbolic links to cached files in the current working directory, or you can set the configuration property mapred.create.symlink to yes. DistributedCache extracts the fragment after ‘#’ in the URI as the link name.

For example, the URI hdfs://namenode:port/lib.so.1#lib.so creates a link named lib.so in the task’s working directory that points to the cached lib.so.1.

2. Common Application Scenarios

Distribute third‑party libraries (jar, .so, etc.)

Distribute dictionary files required by algorithms

Distribute configuration files needed at runtime

Distribute small tables for join operations

3. Precautions

DistributedCache should only be used in distributed environments (pseudo‑distributed or fully distributed). Some APIs may have portability issues.

Files to be distributed must be placed on HDFS beforehand; the default path prefix is hdfs://, not file://.

The files to be distributed should remain read‑only during execution.

Avoid distributing very large files (e.g., large compressed archives) as they may slow task startup.

When creating a Job instance in the driver, be sure to pass the Configuration object; otherwise calls such as DistributedCache.getLocalCacheFiles(conf) in Mapper/Reducer will return null.

4. Basic Process

Each TaskTracker starts a TrackerDistributedCacheManager to manage cache files for all tasks on that node.

When a job is submitted, JobClient validates the files to be cached (existence, size, modification time, and permissions).

When a task initializes the job on a TaskTracker, a TaskDistributedCacheManager is created to manage that task’s cache files.

The task’s TaskDistributedCacheManager fetches and extracts the relevant cache files to local directories; if the files already exist locally and are up‑to‑date, they are reused.

When a task finishes, the cache reference count is decremented. TrackerDistributedCacheManager runs a cleanup thread every minute to remove unused cache directories that exceed size or file‑count thresholds.

5. Application Example

public class MapJoinByCache {
    public static class MapJoiner extends Mapper<LongWritable, Text, Text, Text> {
        static Map<String, String> movies = new HashMap<String, String>();
        public void setup(Context context) {
            try {
                FileReader reader = new FileReader("movies.dat");
                BufferedReader br = new BufferedReader(reader);
                String s1 = null;
                while ((s1 = br.readLine()) != null) {
                    System.out.println(s1);
                    String[] splits = s1.split("::");
                    String movieId = splits[0];
                    String movieName = splits[1];
                    movies.put(movieId, movieName);
                }
                br.close();
                reader.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        private Text outKey = new Text();
        private Text outVal = new Text();
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            if (value != null || value.toString() != null) {
                String[] splits = value.toString().split("::");
                String movieId = splits[1];
                String movieName = movies.get(movieId);
                outKey.set(movieId);
                outVal.set(movieName + "::" + value.toString());
                context.write(outKey, outVal);
            }
        }
    }
    public static class DirectReducer extends Reducer<Text, Text, NullWritable, Text> {
        NullWritable outKey = NullWritable.get();
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            for (Text value : values) {
                context.write(outKey, value);
            }
        }
    }
    public static void main(String[] args) throws URISyntaxException, IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        DistributedCache.createSymlink(conf);
        DistributedCache.addCacheFile(new URI("hdfs://mylinux:9000/data/exam/movie/movies.dat#movies.dat"), conf);
        Job job = new Job(conf);
        job.setJobName("Join on Map Side");
        job.setJarByClass(MapJoinByCache.class);
        job.setMapperClass(MapJoiner.class);
        job.setReducerClass(DirectReducer.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java MapReduce bigdata Hadoop DistributedCache

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.