Big Data 24 min read

From Java Streams to Spark: Basic Big Data Operations Explained

This article demonstrates how developers familiar with Java Stream APIs can quickly grasp fundamental Spark operations—including map, flatMap, groupBy, and reduce—by translating stream examples into Spark code, providing complete code snippets, explanations of transformations versus actions, and practical tips for handling exceptions in distributed processing.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
From Java Streams to Spark: Basic Big Data Operations Explained

This article shows how developers who know Java Stream APIs can quickly learn basic Spark operations by mapping stream examples to Spark code.

Preparation

Test data file contains lines with name, age, department, and position.

张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工

Create an Employee class:

@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static class Employee implements Serializable {
    private String name;
    private Integer age;
    private String department;
    private String level;
}

Map Operations

Java Stream map

List
list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List
employeeList = list.stream().map(word -> {
    List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);

Spark map

SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset
reader = session.read().text("F:/test.txt");
Dataset
employeeDataset = reader.map(new MapFunction
() {
    @Override
    public Employee call(Row row) throws Exception {
        List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        return new Employee(list.get(0), Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
    }
}, Encoders.bean(Employee.class));
employeeDataset.show();

FlatMap Operations

Java Stream flatMap

List
employeeList2 = list.stream().flatMap(word -> {
    List
words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    List
lists = new ArrayList<>();
    Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    lists.add(e1);
    Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    lists.add(e2);
    return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);

Spark flatMap

Dataset
employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction
() {
    @Override
    public Iterator
call(Row row) throws Exception {
        List
list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        List
employeeList = new ArrayList<>();
        Employee e1 = new Employee(list.get(0), Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
        employeeList.add(e1);
        Employee e2 = new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
        employeeList.add(e2);
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();

GroupBy Operations

Java Stream groupBy

Map
countByDept = employeeList.stream()
    .collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(countByDept);

Spark groupBy

RelationalGroupedDataset group = employeeDataset.groupBy("department");
group.count().show();
group.avg("age").withColumnRenamed("avg(age)", "avgAge").show();

Reduce Operations

Java Stream reduce

int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
System.out.println(totalAge);
Employee reduced = employeeList.stream().reduce(new Employee(),
    (pre, cur) -> {
        if (pre.getAge() == null) {
            cur.setAge(cur.getAge());
        } else {
            cur.setAge(pre.getAge() + cur.getAge());
        }
        return cur;
    });
System.out.println(reduced);

Spark reduce

Employee datasetReduce = employeeDataset.reduce(new ReduceFunction
() {
    @Override
    public Employee call(Employee t1, Employee t2) throws Exception {
        t2.setAge(t1.getAge() + t2.getAge());
        return t2;
    }
});
System.out.println(datasetReduce);

Other Common Operations

Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
System.out.println(e);
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();

The article concludes that by leveraging the similarity between Java Stream and Spark APIs, beginners can quickly start big‑data development without setting up a full cluster, only needing the Spark Maven dependency.

Big DatamapFlatMapreduceSparkgroupbyJava Stream
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.