Big Data 24 min read

From Java Streams to Spark: Basic Big Data Operations Explained

This article demonstrates how developers familiar with Java Stream APIs can quickly grasp fundamental Spark operations—including map, flatMap, groupBy, and reduce—by translating stream examples into Spark code, providing complete code snippets, explanations of transformations versus actions, and practical tips for handling exceptions in distributed processing.

Code Ape Tech Column

Jun 21, 2023

From Java Streams to Spark: Basic Big Data Operations Explained

This article shows how developers who know Java Stream APIs can quickly learn basic Spark operations by mapping stream examples to Spark code.

Preparation

Test data file contains lines with name, age, department, and position.

张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工

Create an Employee class:

@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static class Employee implements Serializable {
    private String name;
    private Integer age;
    private String department;
    private String level;
}

Map Operations

Java Stream map

List<String> list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List<Employee> employeeList = list.stream().map(word -> {
    List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    return employee;
}).collect(Collectors.toList());
employeeList.forEach(System.out::println);

Spark map

SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset<Row> reader = session.read().text("F:/test.txt");
Dataset<Employee> employeeDataset = reader.map(new MapFunction<Row, Employee>() {
    @Override
    public Employee call(Row row) throws Exception {
        List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        return new Employee(list.get(0), Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
    }
}, Encoders.bean(Employee.class));
employeeDataset.show();

FlatMap Operations

Java Stream flatMap

List<Employee> employeeList2 = list.stream().flatMap(word -> {
    List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    List<Employee> lists = new ArrayList<>();
    Employee e1 = new Employee(words.get(0), Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    lists.add(e1);
    Employee e2 = new Employee(words.get(0) + "_2", Integer.parseInt(words.get(1)),
        words.get(2), words.get(3));
    lists.add(e2);
    return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);

Spark flatMap

Dataset<Employee> employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction<Row, Employee>() {
    @Override
    public Iterator<Employee> call(Row row) throws Exception {
        List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
        List<Employee> employeeList = new ArrayList<>();
        Employee e1 = new Employee(list.get(0), Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
        employeeList.add(e1);
        Employee e2 = new Employee(list.get(0) + "_2", Integer.parseInt(list.get(1)),
            list.get(2), list.get(3));
        employeeList.add(e2);
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();

GroupBy Operations

Java Stream groupBy

Map<String, Long> countByDept = employeeList.stream()
    .collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
System.out.println(countByDept);

Spark groupBy

RelationalGroupedDataset group = employeeDataset.groupBy("department");
group.count().show();
group.avg("age").withColumnRenamed("avg(age)", "avgAge").show();

Reduce Operations

Java Stream reduce

int totalAge = employeeList.stream().mapToInt(e -> e.getAge()).sum();
System.out.println(totalAge);
Employee reduced = employeeList.stream().reduce(new Employee(),
    (pre, cur) -> {
        if (pre.getAge() == null) {
            cur.setAge(cur.getAge());
        } else {
            cur.setAge(pre.getAge() + cur.getAge());
        }
        return cur;
    });
System.out.println(reduced);

Spark reduce

Employee datasetReduce = employeeDataset.reduce(new ReduceFunction<Employee>() {
    @Override
    public Employee call(Employee t1, Employee t2) throws Exception {
        t2.setAge(t1.getAge() + t2.getAge());
        return t2;
    }
});
System.out.println(datasetReduce);

Other Common Operations

Employee e = employeeDataset.filter("age > 30").limit(3).sort("age").first();
System.out.println(e);
employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();

The article concludes that by leveraging the similarity between Java Stream and Spark APIs, beginners can quickly start big‑data development without setting up a full cluster, only needing the Spark Maven dependency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MAP flatMap reduce Spark groupby Java Stream

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.