In this tutorial, we’ll discuss computer clusters, their types, use cases, and applications. Initially, systems were designed to run on a single, high-priced computer. The cost of such a computer was so high that only governments and big corporations could afford it. Even so, as soon as we created computer networks, people started to connect multiple computer systems. Their motivation was to overcome issues that still haunt us all: Faster results and better resilience. The fact is, our computing needs grow at least as fast as the available processing capacity. And our reliance on computer systems is extreme so far. That’s why computer clusters are so important.
2. What Is a Computer Cluster?
In simple terms, a computer cluster is a set of computers (nodes) that work together as a single system. We can use clusters to enhance the processing power or increase resilience. In order to work correctly, a cluster needs management nodes that will:
- coordinate the load sharing
- detect node failure and schedule its replacement
Usually, it implies the need for high compatibility between the nodes in the hardware and software aspects. The nodes keep pinging each other’s services to check if they are up — a technique called heartbeat. Besides that, they strongly rely on the data network connecting them. By the way, in most cases, we’ll use redundant network paths between the nodes. That way, the cluster can differ from a node failure to a network outage.
3. Computing Cluster Reference Architecture
- Computing Nodes: servers that process the user load; they range from simple desktop-class computers to massive high-end servers
- Managing Nodes: servers that monitor the cluster hardware and software, taking measures to reconfigure it according to any event. The managing node software can run on computing nodes to minimize the needed resources
- Private Network(s): where the communication between nodes takes place. It is responsible for the ‘are you there?’ messages that the nodes use to verify what servers are up. Also, the command messages need to reconfigure and synchronize the cluster
- Shared redundant storage is where the data is available to all computing nodes. For a computing node to take over a failed one, it needs to access the common data
- Public-access layer: will virtualize access to the cluster so the cluster will look like a single system. It can operate by creating virtual IPs that will host the service entry points. It will distribute the incoming requisitions to the currently active hosts
3. Types of Clustering
There are several types of computer clusters, each one privileges some specific non-functional requirements. In fact, those types are mostly to help us understand the multiple ways we can configure computer clusters. In practice, the various kinds can coexist at the same time. For instance, a Load-balancing computing cluster can have a Fail-server configuration for its management cluster.
3.1. Fail-over or High Availability Cluster
In the fail-over configuration, services run in one computing node while the other waits to take over during outages. It is mainly used to add failure resiliency. If any main node service fails, the manager node moves the virtual IP to its backup node. Also, the failed node loses data access to its standby. That helps avoid the risk of multiple writes to duplicate files. So, when the backup node takes over, it will take the necessary steps to re-establish the services. For instance, checking the data integrity, reapplying uncommitted journal entries, and so on. The transactions that were in process during the outage will fail. So, we’ll see some downtime while the cluster is reconfigured. Then, the design of such a cluster must access the maximum acceptable downtime. This configuration is appropriate when the software systems do not support concurrent service instances consistently. The main benefit of fail-over clusters is that they don’t require modifications to existing software. For Linux, the more known open-source implementation is the Linux HA. However, a handful of commercial software implementations from Legato, Veritas, Oracle, IBM, and others exist.
3.2. Load Balancing
In the load balancing cluster, the load is distributed among the available computing nodes. The techniques to distribute the load varies, round-robing the user requests or connections is the more common. The more transactions are independent of each other, the best. That means we want the computing nodes as independent as possible from each other. One transaction from one node should not need to wait for another transaction in another host. That is the concept of parallelism, we have a good tutorial on parallel processing that shows how it works. Load balancing is cost-friendlier than fail-over. As the computing nodes share the load, the overall transaction throughput improves. In node failure events, the load balancer redistributes the requests to the remaining online nodes. This may create some service-level degradation. So, designing a load-balancing cluster considers the maximum outage performance loss. The best part is that we can scale up the overall performance by adding processing nodes. Concurrent data sharing requires very specific tunning measures to prevent inconsistency.
3.3. High-Performance Computing – HPC
For really computing intense workloads, there are the High-performance Computing clusters. In this cluster type, we will want all the available computing resources running. Its main aspects are:
- CPU bound processing, i.e., CPU-intensive computations
- Massive data transfers
- Low-latency communication between nodes to simulate shared memory among nodes
- Data flow can run sequentially through multiple computing nodes
These characteristics are common to a multitude of technical and scientific applications. Its use includes weather forecasting, fluid dynamics, drug design, and rendering farms, to name a few. In the past, they required dedicated proprietary supercomputers. Nowadays, supercomputers can be built using hundreds (or thousands) computing nodes equipped with commodity CPUs. Some may have millions of CPU cores distributed on thousands of servers. The top500 supercomputers list ranks the more powerful non-distributed systems in the world. The first design concern in this type of cluster is raw processing power and low network latency. Many systems add GPU processing power to their cores. Also, its design considers how to schedule the load among nodes for higher efficiency.
4. Other Computing Arrangements
The previous computing clusters assume non-geographically distributed computing needs. Their services ran on a single data center or very near data centers. For highly distributed, cloud, or serverless computing, we have architectures that abstract the need for tight-coupled cluster nodes. And that does make sense for use cases in which the transactions are quite independent of each other. In this case, we can distribute the processing in a mixed environment. We are using, thus, resources from different hardware and software configurations.
4.1. Grid Computing
A grid computer system is a loosely connected set of heterogeneous devices contributing to the same goal. In this configuration, computer nodes are sparsely distributed. Indeed, they do not share network or direct disk connections. Grid Computing works by having two main perspectives:
- The Provider-side: a management center controlling the grid. It allows client-side resources to join the grid, sends them data, and evaluates their responses
- The Client-side: each computing resource that subscribes to the grid. It gets and process data sent by the provider. Reporting back its results.
This is used in volunteer-computing initiatives. A famous example is the drug development Folding@home.
4.2. Cloud, Kubernetes, and Serverless Computing
What about the Cloud, Kubernetes, and Serverless? How do they fit? To get started, we can use some of the concepts derived from regular non-distributed clusters to achieve similar results. This means we have to virtualize cluster components. There are many network virtualization options for Kubernetes. For instance, a software-defined network behaves as a single network, even if it spreads through multiple data centers. Then virtual load balancers will adapt the cluster to any new node configurations. They ensure that the requests go only to online nodes. Therefore, the same goes for the data stores, software-defined storage assigns the data to sparse data centers. It also gives redundancy according to any desired rule. We may decide the number of data replicas. The computing nodes might fetch data from the closest replicas. Since the managing nodes assess service demands, they create or destroy virtual nodes as needed. So the services scale up and down, trying nodes as near as possible to the users. And the computing node life-cycle can even ‘follow the sun’. So, to enjoy these perks, we must first design the system so it can work in a load-balancing cluster.
In this tutorial, we’ve studied computer clusters, types, uses, and applications. As we saw, we can use a cluster when single servers can’t fulfill our performance or availability needs. Given that clustered systems can scale to enormous demands. The design of cluster-friendly software is important to reach a broader customer base.