1. Introduction

One of the finer points of the functional programming community has been its propensity for recognizing abstract patterns in code that have well-defined, lawful representations in mathematics. These abstractions are frequently described in terms of algebraic structures and algebraic laws.

In this tutorial, we’ll define these terms in the context of functional programming. Next, we’ll proceed to build our intuition for algebraic structures — in particular, monoids and semigroups. However, it’s worthwhile to start with a concise definition in plain English for monoids first.

A monoid is a function that takes two arguments and follows two laws: associativity and identity.

2. Algebraic Structure

An algebraic structure refers to some operations and the set they operate over. In functional languages such as Scala, the sets are the data types, whereas the operations are the type classes. Monoids and semigroups are two examples of such algebraic structures.

The general notion here is to separate behavior from the data. The algebraic structure provides the context for describing the desired behavior through operations applied to the data.

Before analyzing the algebraic structures monoids and semigroups, it’s worthwhile to do a quick review of types and type classes.

2.1. Types

Types are collections of values. Int, Double, and String are examples of types. Types are static and resolve at compile time in Scala. A type system defines the associations between different parts of a program and checks that all the parts fit together in a logically consistent and provably correct way.

In functional programming, the fundamental building blocks of data types are products and coproducts. These are collectively referred to as Algebraic Data Types (ADTs) — a functional programming concept with a fancy name but a very simple meaning. They’re an idiomatic way of representing data using “ands” and “ors”. For example:

  • a shape is a rectangle or a circle
  • a rectangle has a width and a height
  • a circle has a radius

In ADT terminology, “and” types such as rectangle and circle are products, whereas the “or” types such as shape are coproducts. In Scala, we typically represent products using case classes and coproducts using sealed traits:

sealed trait Shape
final case class Rectangle(width: Double, height: Double) extends Shape
final case class Circle(radius: Double) extends Shape

val rect: Shape = Rectangle(3.0, 4.0)
val circ: Shape = Circle(1.0)

Next, we’ll do a quick review of type classes and their implementations in Scala.

2.2. Type Class

A type class is a type system construct that supports certain overloaded operations called methods. They are defined by specifying a set of functions or constant names, together with their respective types. For instance, monoids and semigroups are among such type classes. There are three important components to the type class:

  1. the type class itself
  2. the type class instances for particular types
  3. methods that use type classes

Let’s define our monoid and semigroup type classes:

trait Semigroup[A] {
  def op(x: A, y: => A): A
}
trait Monoid[A] extends Semigroup[A] {
  def zero: A
}

Thus far, we’ve managed to build our intuition for types, type class, and type class instance. Next, we’ll use this intuition to study algebraic data structures — specifically, monoids and semigroups.

3. Monoids

The algebraic structure semigroup is a set and an associated binary operator that combines two elements from the set. A monoid is a semigroup with the added identity element. To build our intuition about monoids, let’s encode these definitions in Scala.

We’ll start by Scala encoding of the binary operation first. A binary operation is a function with arity-2 that takes two arguments and returns the result of applying that function to the arguments:

def[A] |+| (left: A, right: A): A

Associativity of the monoid binary operation means the order of operations is irrelevant — in other words, we can drop the parentheses. In Scala terms, the following expressions are equivalent:

(x |+| y) |+|  z = x |+| (y |+| z)

Finally, the monoid algebraic structure has an identity element, the “do-nothing” element with respect to the binary operation:

e |+|  x = x |+|  e = x

And that’s all there is to monoids! The nice thing about them is that they are all over the place. The best way to understand them is to look at a few examples:

  • Integers form a monoid under summation and multiplication
  • List concentration

Basically, anything that we can smash together while observing the monoidal laws is qualified to be a monoid.

Now, armed with this new abstraction, we’ll do some word-counting in a MapReduce kind of way. Counting word frequencies in a collection of documents is the “Hello World” of tools like Hadoop, Spark, and Flink, and with good reason. It’s a not-too-contrived task whose underlying structure is a natural fit for distributed computation.

Let’s count some words by implementing a non-trivial monoid instance.

4. A Monoid for Implementing MapReduce

In this section, we develop a MapReduce function using monoids. We start by defining our function signatures — in other words, we codify the Input/Output of our functions in Scala. This approach will also reveal the types our monoid must support.

def wordCount(string: String): Map[String, Int] = ???
def mapMonoid[V:Numeric](map1: Map[K, V], map2[K, V]) : Map[K, V] = ???
def frequency(wordCounts: Map[String, Int]*)(implicit monoid: Monoid[Map[String, Int]]): Map[String, Int]=???

The first function, wordCount, is a word frequency counter. This function will be distributed among many compute nodes. The function is inherently thread-safe by the virtue of using immutable Maps

The second function, mapMonoid, is a binary operation, arity-2, that combines it’s arguments and returns the result. This is our monoid type class instance for the data type Map[K, V]. The function signature’s type-annotation, [V: Numeric], is a type constraint. It’s a mechanism to force our clients to use a Numerics as the value portion of the Map. This constraint is required so that the monoid instance can safely combine its arguments.

Finally, the variadic function frequency is a wrapper that exploits the monoid instance. This function is the reduce phase of MapReduce. It takes the computed word-counts from all nodes, in any order, and reduces them to the final answer. The function type signature requires an implicit parameter of the type Monoid[Map[String, Int]].

One key observation here is that the reduction and aggregation order is irrelevant to the final outcome. That guarantee comes from the fact that our monoid instance is associative. Such orderless and stateless computations are ideal targets for concurrent programming.

Next, we’ll start implementing each function.

4.1. Counting Words

Let’s implement the wordCount function based on our definition above:

object WordCount {
  case class WordCount(word: String, count: Int)
  def wordCount(s: String): Map[String, Int] =
    s.split("\\s+")
      .map(x => WordCount(x, 1))
      .groupBy(w => w.word)
      .map(x => (x._1 -> x._2.foldLeft(0)((a, c) => c.count + a)))
      .toMap

4.2. A Monoid Implementation

Next, we’ll create a monoid type class instance, mapMonoid, for the data type Map[K, V], with the constraint that the parameter V is of type Numeric:

object MapMonoidInstance {
  implicit def mapMonoid[K, V: Numeric]: Monoid[Map[K, V]] = 
    new Monoid[Map[K, V]] {
      override def zero: Map[K, V] = Map()
      override def op(l: Map[K, V], r: => Map[K, V]): Map[K, V] = 
        l.keySet.union(r.keySet)
          .map(k =>(k -> implicitly[Numeric[V]].plus(
                      l.getOrElse(k, implicitly[Numeric[V]].zero),
                      r.getOrElse(k, implicitly[Numeric[V]].zero)))).toMap
    }
}

4.3. Using Monoids

Finally, the reduce operation combines the elements into a single report using a monoid homomorphism. The function requires a monoid instance as an implicit parameter. It exploits the monoid functionality to fold through the structure, combining all elements to the reduced result of Map[String, Int]:

def frequency(wordCounts: Map[String, Int] *)(implicit monoid: Monoid[Map[String, Int]]): Map[String, Int] =
   wordCounts.foldLeft(monoid.zero)(monoid.op(_, _))

5. Conclusion

In this article, we looked at monoid and semigroup algebraic structures.

We first reviewed type class and type class instances. Then, we developed our own type class for both monoids and semigroups. Finally, we developed a thread-safe MapReduce program using monoids.

As always, the code can be found over on GitHub.

guest
0 Comments
Inline Feedbacks
View all comments