流Stream的Map-Reduce操作

写在前面

Stream 的 Map-Reduce 操作是Java 函数式编程的精华所在，同时也是最为复杂的部分。但一旦你啃下了这块硬骨头，那你就真正熟悉Java的函数式编程了。

如果你有大数据的编程经验，你会对术语 Map-Reduce 十分熟悉亲切。如果你不熟悉大数据编程，也无所谓，通过本文的学习，相信你会对 Map-Reduce 会有一定的理解。下面我们将开始一次有趣的历程。

如有疑问，欢迎加群讨论。

本文的示例代码可从gitee上获取：https://gitee.com/cnmemset/javafp

Stream的map操作

map操作又称为映射操作，是处理Stream的重要操作。它的作用是将当前Stream中的每个元素都映射转换为另一个元素，从而得到一个新的Stream。转换前后的元素类型也可以不同。

下面介绍 Stream 中常用的 Map 方法。

map()

map的方法签名是：

1	<R> Stream<R> map(Function<? super T, ? extends R> mapper);

map方法是一个中间操作，作用是将当前Stream中的每个元素通过参数 mapper 转换为另一个元素，转换前的元素类型为T，转换后的元素类型为 R。

一个简单例子是字符串转换为字符串的长度：

public static void mapStream() {
    List<String> words = Arrays.asList("hello", "world", "I", "love", "you");
    words.stream()
            .map(String::length)
            .forEach(System.out::println);
}

上述代码输出每个单词的长度：

mapToInt()、mapToLong()和mapToDouble()

它们的方法签名分别是：

IntStream mapToInt(ToIntFunction<? super T> mapper);

LongStream mapToLong(ToLongFunction<? super T> mapper);

DoubleStream mapToDouble(ToDoubleFunction<? super T> mapper);

它们和map()方法大同小异，分别是针对基础类型 int 、long 和 double 的特殊处理，省去了装拆箱的消耗。

flatMap()

flatMap的方法签名是：

1 2	<R> Stream<R> flatMap( Function<? super T, ? extends Stream<? extends R>> mapper);

flatMap是一个中间操作，作用是将当前Stream的每个元素通过参数 mapper 转换成一个类型为 Stream 的元素，然后将这些 Stream 合并为一个新的 Stream。顾名思义，flat的含义就是将当前Stream中的元素“摊平”，从一个单独的元素，转换为多个元素组成的Stream。

文字表述总是苍白无力，我们先用一个实例来辅助说明：

public static void flatMapStream() {
    Stream<List<Integer>> stream = Stream.of(Arrays.asList(1,2), Arrays.asList(3, 4, 5));
    stream.flatMap(list -> list.stream())
            .forEach(System.out::println);
}

上述代码的输出为：

stream的元素类型是一个 List，总共有两个元素 —— [1, 2] 和 [3, 4, 5]。

在 flatMap 方法中，首先将2个 List 转换为2个 Stream，然后再将这2个Stream合并为一个新的Stream并返回。图解如下：

Stream的reduce操作

reduce操作（reduction operation），翻译为规约操作，是Stream中最复杂的操作。

规约操作，是通过重复执行指定的合并操作（combining operation），将Stream中的所有元素合并得到一个汇总结果的过程。例如，求和（sum）、求最大或最小值（max / min）、求平均数（average）、求元素总个数（count）、将所有元素汇总到一个列表（collect），这些都属于规约操作。

规约操作都属于终止操作（terminal operations）。

Stream类库有两个通用的规约操作 reduce() 和collect()。下面我们着重介绍相关的方法。

reduce()

reduce方法有3种重写形式：

Optional<T> reduce(BinaryOperator<T> accumulator);

T reduce(T identity, BinaryOperator<T> accumulator);

<U> U reduce(U identity,
             BiFunction<U, ? super T, U> accumulator,
             BinaryOperator<U> combiner);

虽然参数和返回值不同，但它们的语义是相似的。下面逐一介绍。

reduce(BinaryOperator)

先看第一个reduce方法：

1	Optional<T> reduce(BinaryOperator<T> accumulator);

其中 T 是 Stream 的泛型类型。

参数 accumulator 是指定的合并操作（combining operation）。

在串行执行时，整个方法等价于下面的伪代码：

boolean foundAny = false;
T result = null;
for (T element : this stream) {
    if (!foundAny) {
        foundAny = true;
        result = element;
    }
    else
        result = accumulator.apply(result, element);
}
return foundAny ? Optional.of(result) : Optional.empty();

要注意的是，参数 accumulator 定义的函数必须满足结合律（associative），否则在一些顺序不确定的或并行的场景中会导致不正确的结果。譬如数据源是一个HashSet的话，其中的元素顺序是不确定的。

结合律（associative）就是我们在小学时候学的结合律（加法结合律，乘法结合律）。对于一个函数或操作 op ，给定三个操作数 a、b、c，当 op 满足结合律时，即：
(a op b) op c == a op (b op c)

以上述的 accumulator 为例，accumulator 满足结合律，即：

accumulator(accumulator(a, b), c) == accumulator(a, accumulator(b, c))

示例代码：

public static void reduceStream() {
    Stream<Integer> stream = Stream.of(1, 3, 5, 7, 9);
    Integer sum = stream.reduce((x, y) -> x + y).get();
    System.out.println(sum);
}

上述代码输出为：

reduce(T, BinaryOperator)

第二个reduce的方法签名是：

1	T reduce(T identity, BinaryOperator<T> accumulator);

其中 T 是 Stream 的泛型类型。

与第一个reduce方法比较，多了一个参数 identity 。

参数 identity 是reduce操作的初始值。

参数accumulator 要求满足结合律（associative）。

在串行的场景中，整个方法等价于下面的伪代码：

T result = identity;
for (T element : this stream)
    result = accumulator.apply(result, element)
return result;

和第一个reduce方法一样，参数 accumulator 定义的函数必须满足结合律（associative），否则在一些顺序不确定的或并行的场景中会导致不正确的结果。

此外，如果涉及到并行操作（parallel operations），对参数 identity 还有一个要求：

对任意值 t，要满足 accumulator(identity, t) == t 。否则，会导致错误的结果。

还是求和的场景，示例代码如下：

public static void reduceStream2() {
    List<Integer> list = Arrays.asList(1, 3, 5, 7, 9);

    // 串行执行，对参数 identity 并没有实际的约束。
    Integer sum = list.stream().reduce(0, (x, y) -> x + y);
    System.out.println(sum);  // sum = 0+1+3+5+7+9 = 25

    // 串行执行，对参数 identity 并没有实际的约束。
    sum = list.stream().reduce(5, (x, y) -> x + y);
    System.out.println(sum);  // sum = 5+1+3+5+7+9 = 30

    // 并行执行。这是正确的范例：因为数字 0 是累加操作的 identity 。
    sum = list.parallelStream().reduce(0, (x, y) -> x + y);
    System.out.println(sum);  // sum = 0+1+3+5+7+9 = 25

    // 并行执行。这是错误的范例：因为数字 5 并不是累加操作的 identity 。
    sum = list.parallelStream().reduce(5, (x, y) -> x + y);
    System.out.println(sum);  // 会输出一个大于 30 的数字
}

上述代码输出类似：

可以看到，在最后一个范例中，得出了一个错误的结果（正确结果应该是30）。

reduce(U, BiFunction, BinaryOperator)

第三个reduce方法的签名是：

1
2
3

<U> U reduce(U identity,
             BiFunction<U, ? super T, U> accumulator,
             BinaryOperator<U> combiner);

其中 U 是返回值的类型，T 是 Stream 的泛型类型。

参数 identity 是规约操作的初始值。

参数accumulator 是与Stream中单个元素的合并操作，等同于函数 U apply(U u, T t)。

参数 combiner 是将并行执行得到的多个中间结果进行合并的操作，等同于函数 U apply(U u1, U u2)。

图解如下：

在串行的场景中，整个方法等价于下面的伪代码：

U result = identity;
for (T element : this stream)
    result = accumulator.apply(result, element)
return result;

从伪代码中可以看到，串行时不涉及到参数 combiner ，串行时甚至可以将其设置为任一个非null值即可，不影响执行。

但在并行编程中，对3个参数都有一些特殊要求：

参数 combiner 必须满足结合律
参数 identity，对于任意值 u，必须满足 combiner(identity, u) == u
参数 accumulator 和 combiner 两者必须兼容，即对于任意值 u 和 t，必须满足：

combiner(u, accumulator(identity, t)) == accumulator(u, t)

假设一个场景，我们要求一篇文章中字母的总长度，示例代码：

public static void reduceStream3() {
    List<String> article = Arrays.asList("hello", "world", "I", "love", "you");
    Integer letterCount = article.stream().reduce(
            0, // identity 初始值　
            (count, str) -> count + str.length(), // accumulator 累加器，也起到了 map 的作用
            (a, b) -> a + b // combiner 拼接器，并行执行时才会用到
    );
 
    // 输出 18
    // 5(hello) + 5(world) + 1(I) + 4(love) + 3(you) = 18
    System.out.println(letterCount);
}

在上述示例中，

combiner 是求和函数，满足结合律；
identity 是0，也满足 0 + u == u;
对于任意的整数 count 和字符串 str，也满足 count + (0 + str.length()) == count + str.length()

因此，上述的示例是可以通过并行的方式执行的：

public static void reduceStream4() {
    List<String> article = Arrays.asList("hello", "world", "I", "love", "you");
 
    // parallelStream()：以并行的方式执行
    Integer letterCount = article.parallelStream().reduce(
            0, // identity 初始值　
            (count, str) -> count + str.length(), // accumulator 累加器，也起到了 map 的作用
            (a, b) -> a + b // combiner 拼接器，并行执行时才会用到
    );
 
    // 输出 18
    // 5(hello) + 5(world) + 1(I) + 4(love) + 3(you) = 18
    System.out.println(letterCount);
}

对于第三个reduce方法，参数 accumulator 同时也是一个mapper（映射器），在进行合并操作的同时，也做了map操作。因此，我们是可以通过 “map方法 + 第二个reduce方法”来实现第三个reduce方法的。但在某些场景中，将mapper和accumulator 混合起来，可以避免一些不必要的计算操作，使得程序更有效率。

用“map方法 + 第二个reduce方法”实现同样的功能，示例代码：

public static void reduceStream5() {
    List<String> article = Arrays.asList("hello", "world", "I", "love", "you");
 
    // parallelStream()：以并行的方式执行
    // 分开的 map + accumulator
    Integer letterCount = article.parallelStream()
            .map(s->s.length())
            .reduce(0, (a, b) -> a + b);
 
    // 输出 18
    // 5(hello) + 5(world) + 1(I) + 4(love) + 3(you) = 18
    System.out.println(letterCount);
}

collect()

collect方法，顾名思义，它的作用是将Stream中的元素“收集”起来。它是Stream类库中最灵活、最通用的方法之一。一个常见的应用场景就是通过collect方法将Stream中的汇总到一个List中。

先给一个简单的例子直观感受一下：

public static void collectToList() {
    Stream<String> stream = Stream.of("hello", "world", "I", "love", "you");
    List<String> list = stream.collect(Collectors.toList());
    System.out.println(list);
}

上述代码是collect方法最简单的应用：将一个Stream转换为一个List。

collect方法有2种重写形式：

<R> R collect(Supplier<R> supplier,
              BiConsumer<R, ? super T> accumulator,
              BiConsumer<R, R> combiner);
 
<R, A> R collect(Collector<? super T, A, R> collector);

这2种重写形式的语义是一致的，虽然细节上有差异，但仍然可以认为第二个collect方法的参数 collector 就是对第一个collect方法中三个参数supplier、accumulator和combiner的封装。

collect(Supplier, BiConsumer, BiConsumer)

第一个collect方法的签名是：

1
2
3

<R> R collect(Supplier<R> supplier,
              BiConsumer<R, ? super T> accumulator,
              BiConsumer<R, R> combiner);

其中 R 是返回值的类型，通常是一个容器类（例如 Collection 或 Map）。T 是Stream中的元素类型。

在解释3个参数的作用之前，我们先思考一个问题：如果要把Stream中的元素“收集”到一个容器中，需要哪些信息呢？很显然：

首先我们要知道 1) 是哪个容器（supplier）；

其次我们要知道 2) 如何将单个元素加入到该容器中（accumulator）；

最后我们要知道 3) 在并行执行的时候，如何将多个中间结果的容器合并为一个（combiner）。

对应参数的含义也自然而然出来了：

参数 supplier 是用来创建一个容器实例的函数。

参数 accumulator 是将Stream中的一个元素合并到容器中的函数。

参数 combiner 是将两个容器归并为一个容器的函数，只在并行执行的时候用到。

在串行执行的场景下，整个方法等价于以下的伪代码：

R result = supplier.get();
for (T element : this stream)
    accumulator.accept(result, element);
return result;

而在并行执行的场景下，我们有一些额外的要求：

combiner函数满足结合律
要求combiner 和 accumulator 是兼容的（compatible），即对于任意的r和t，满足 combiner(r, accumulator(supplier.get(), t)) == accumulator(r, t)

以一个简单的例子加以说明，假设我们要将Stream中的字符串“collect”到一个ArrayList中，示例代码如下：

public static void collectToList1() {
    Stream<String> stream = Stream.of("hello", "world", "I", "love", "you");
    List<String> list = stream.collect(
        ArrayList::new, // supplier 创建一个 ArrayList 实例
        ArrayList::add, // accumulator将一个 String 加入到 ArrayList 中
        ArrayList::addAll // combiner 将两个 ArrayList 合并成一个
    );
 
    System.out.println(list);
}

上述代码也是符合并行执行的要求的：ArrayList的addAll方法满足结合律；addAll方法是与add方法兼容的（compatible）。因此，在上述的collect过程中，我们允许以并行的方式来执行 —— 即使 ArrayList 不是线程安全的，我们也无需考虑这个问题，这是Stream并行编程的优势之一。

collect(Collector)

第二个collect方法的签名是：

1	<R, A> R collect(Collector<? super T, A, R> collector);

其中，T是Stream元素的类型；R是返回值的类型；A是一个中间结果的类型，最后需要将结果从A转换到R。

类Collector（收集器）可以看做是对前一个collect方法中的三个参数supplier、accumulator和combiner的封装，但Collector更加灵活和通用。

类Collector的原理和源码相对比较复杂，限于篇幅，本文就不做详细阐述，如果读者感兴趣，可以加群讨论。

Collector是如此的灵活，我们决定从一个现实场景出发，逐步向大家展示Collector的强大功能。

场景描述

假设一个场景：我们接到了一个公司的需求，需要对公司的信息进行一些分析，包括性别、部门、薪酬等维度。为简单起见，我们不考虑员工重名的情形。

首先，我们定义一个Employee 的类：

public class Employee {
    /** 姓名 */
    private String name;
    /** 性别：0 女 1 男 */
    private int gender;
    /** 部门 */
    private String department;
    /** 薪酬 */
    private int salary;

    // getter and setter
    ...
}

然后，我们假定员工的信息可以通过一个工具类获取：

public class Utils {
    public static List<Employee> makeEmployees() {
        Employee e1 = new Employee("张三", 1, "DEV", 17200);
        Employee e2 = new Employee("李四", 1, "OPS", 12000);
        Employee e3 = new Employee("韩梅梅", 0, "HR", 9000);
        Employee e4 = new Employee("李雷", 1, "DEV", 20000);
        Employee e5 = new Employee("杜芳芳", 0, "OPS", 16500);
        return new ArrayList<>(Arrays.asList(e1, e2, e3, e4, e5));
    }
}

需求1：要将所有员工的姓名转换为一个List

实现这个需求的代码很简单：

public static void collectEmployeeNamesToList() {
    List<Employee> employees = Utils.makeEmployees();
    List<String> names = employees.stream()
            .map(Employee::getName)
            .collect(Collectors.toList());
    // 如果要指定返回的List具体类型，譬如指定为 ArrayList
    ArrayList<String> arrayNames = employees.stream()
            .map(Employee::getName)
            .collect(Collectors.toCollection(ArrayList::new));
            
    System.out.println(arrayNames);
}

上述代码输出为：

1	[张三, 李四, 韩梅梅, 李雷, 杜芳芳]

Collectors工具类提供了一系列内置的Collector，包括:

a. Collectors.toList(): 转换为List

b. Collectors.toSet()：转换为Set

c. Collectors.toCollection(Supplier)：转换为指定的Collection类

一个有趣的问题：为什么没有toQueue()？先不给答案了，有兴趣的同学可以加群讨论。

需求2：将员工列表转换成<姓名，薪酬>组成的Map

Collector除了可以将Stream转换为Collection之外，还可以转换为Map。

示例代码如下：

public static void collectEmployeeNamesToMap() {
    List<Employee> employees = Utils.makeEmployees();
    Map<String, Integer> nameScoreMap = employees.stream()
            .collect(Collectors.toMap(Employee::getName, Employee::getSalary));
    System.out.println(nameScoreMap);
}

上述代码输出为：

1	{李四=12000, 张三=17200, 李雷=20000, 杜芳芳=16500, 韩梅梅=9000}

示例代码中，Employee::getName用来生成Map的key ，而Employee::getSalary则用来生成Map中key对应的value。

Collectors.toMap方法还有两个重写形式，主要用来处理key重复时的情形以及指定Map的具体类型。

需求3：将员工按男女分成两组

对于这个需求，使用方法 Collectors.partitioningBy 。示例代码：

public static void partitionEmployeesToMap() {
    List<Employee> employees = Utils.makeEmployees();
    Map<Boolean, List<Employee>> map = employees.stream()
            .collect(Collectors.partitioningBy(e -> e.getGender() == 1));
    System.out.println(map);
}

上述代码输出为：

1	{false=[Employee{name='韩梅梅}, Employee{name='杜芳芳}], true=[Employee{name='张三}, Employee{name='李四}, Employee{name='李雷}]}

Collectors.partitioningBy 可以用更通用的 Collectors.groupingBy 来实现。下面接着介绍 Collectors.groupingBy 。

需求4：将员工按照部门分组

使用简化版 Collectors.groupingBy(Function) 方法来实现，示例代码：

public static void groupEmployeesToMap() {
    List<Employee> employees = Utils.makeEmployees();
    Map<String, List<Employee>> map = employees.stream()
            .collect(Collectors.groupingBy(Employee::getDepartment));
    System.out.println(map);
}

上述代码输出为：

1	{OPS=[Employee{name='李四}, Employee{name='杜芳芳}], DEV=[Employee{name='张三}, Employee{name='李雷}], HR=[Employee{name='韩梅梅}]}

需求5：将员工按照部门分组后，计算每个部门的员工薪酬总数

使用通用版 Collectors.groupingBy(Function, Collector) 方法来实现，示例代码：

public static void groupEmployeesToMap1() {
    List<Employee> employees = Utils.makeEmployees();
 
    // 使用通用版的 groupingBy
    Map<String, Integer> map = employees.stream()
            .collect(
                    // 上游收集器
                    Collectors.groupingBy(
                            Employee::getDepartment,
                            // 下游收集器 downstream collector
                            Collectors.summingInt(Employee::getSalary)
                    )
            );
 
    System.out.println(map);
}

上述代码输出为：

1	{OPS=28500, DEV=37200, HR=9000}

通用版 Collectors.groupingBy 方法签名为：

public static <T, K, A, D>
Collector<T, ?, Map<K, D>> groupingBy(
           Function<? super T, ? extends K> classifier,
           Collector<? super T, A, D> downstream);

首先通过参数 classifier 定义的函数对Stream的元素分组，然后使用下游收集器（downstream collector），对分组后的元素进行再处理（甚至可以再次分组）。

阅读源码可以发现，需求4中的简化版 groupingBy ，实际上是通用版groupingBy 的简写：

groupingBy(classifier) == groupingBy(classifier, toList())

其中，toList() 是 groupingBy 的下游收集器。

自定义Collector

除了可以使用Collectors工具类已经封装好的收集器，我们还可以自定义收集器，收集任何形式你想要的信息。

但是，不夸张的说，Collectors工具类中内置的Collector，基本能满足我们所有的需求。在你决定要自定义一个Collector之前，请务必确认内置的Collector无法实现你的需求。

具体如何自定义Collector，限于篇幅，在本文不做详细描述，有兴趣的同学可以加群讨论。

结语

本文介绍了 Stream 的 Map-Reduce 操作。

如果你从头到尾认真阅读了本文，那么恭喜你，你的Java函数式编程已经正式入门了。