Introduction to Storage for Data Centers | Baeldung on Computer Science

1. Overview

In this tutorial, we shall discuss the various types of data storage used for professional computing. In computing systems, we can identify some basic resources: processor units (CPUs) handle the data processing, volatile memory (RAM) hold in-processing data, and persistent data storage (among other data input/output devices). Through time, many technologies and approaches were used. We’re going to name the more common, how they differ, and their best use-case scenarios.

2. Direct Attached Storage – DAS

The first mass-storage device used for computing that comes to mind is direct-attached storage (DAS). In this approach, the disks, whatever technology is used, are connected to I/O boards installed on the servers. The more advanced boards have features like disk hot swapping, non-volatile caching, integrated mirroring, and error correction: The non-volatile caching minimizes the risk of data loss during power failures. For mirroring and error correction will use a technique known as Redundant Array of Inexpensive Disks (RAID) that can be hardware-assisted or software-based. RAID arrangements are categorized into levels, the more common are:

0: multiple disks are stacked together forming a larger disk. To increase performance, the data is split and distributed among all disks using block-aligned chunks. This improves both reading and writing performances
1: two disks are combined as a single one holding the same data. Each write operation goes to both simultaneously, and the read operations will go to the less busy, improving read performances
1 + 0: a combination of the above arrangements: creating a larger disk with full redundancy and performance gains of data splitting among multiple disks
5 or 6: three or more disks form a structure with data redundancy using checksums. That way, if one (raid 5) or two (raid 6) of the disks fail, the data may be rebuilt from the remaining disks’ redundancy information. The write performance is not as good due to the need for checksum calculation

RAID levels 2, 3, and 4 are rarely or never used. RAID arrangements make a lot more sense for performance improvements with conventional magnetic hard drives. In those, the mechanical movement of the magnetic heads adds a lot of latency for random access operations. However, even modern solid-state drives (SSD) can see improvement with RAID arrangements.

3. Storage Area Network – SAN

In larger environments, we can have the disks on external chassis holding hundreds or even thousands of individual disks. In such cases, it’s hard for a single server to use and manage too many disk devices efficiently. Also, in such scenarios, it’s useful to have multiple servers seeing the same disks. This is needed for several distributed computing architectures such as clusters: Even though the same disk enclosures can serve multiple servers, it doesn’t scale well. For instance, if a server is not transacting at one moment, its I/O channel to the disks will be idle. Consequently, we’ll lose part of the performance gains of using massive RAID arrangements. Instead of doing that, we can use storage network switches that will multiplex the access to the disks among a large number of servers. That is exactly what we call Storage Area Network. It has a lot of similarities with standard local area networks but operates with storage-aware specific interfaces and protocols. It also can add some storage virtualization functionalities, hiding from the servers the distribution and specifics of each storage chassis used. Another nice thing is that those switches can connect various types of storage (flash-only storage, conventional hard drives, and even backup tape drives) to multiple types of servers, such as x86 or RISC servers, and mainframes. This technology is useful whenever multiple servers need access to a large number of disks requiring high throughput and little latency. Also, the servers will see the RAID arrangements as disks similar to what they would see with direct attached storage. That means that the servers will manage the data on them using their native filesystems.

4. Network Attached Storage – NAS

While SAN gives servers access that is as close as possible to directly attached storage, sometimes it is best to share files among servers. That is exactly why we have been using distributed file systems such as Microsft SMB/CIFS or Posix NFS3/4 for a long time now. They are examples of what we call Network Area Storage: In this scenario, one (or more servers) share files that they manage (using their storage, directly attached, or through a storage area network) with other servers or client equipment. The main advantage is that the file systems are managed on fewer servers, simplifying access and management. The performance, however, is lower than the previous alternatives since it adds some overhead. The distributed filesystem has to deal with multiple clients trying to access the same files, for instance.

5. Software-Defined Storage – SDS

A newer trend that we must know about is Software-Defined Storage. This architecture’s premise is to achieve even higher abstraction from the storage hardware used. In this case, we take the storage virtualization to an even higher level. It uses inexpensive direct attached disks without any special use controller (such as RAID boards) and lets the software-defined storage deal with redundancy, data distribution, and access. It can add even more advanced virtualization features at lower costs. For instance, it can do data deduplication, compression, more data replicas, easier provisioning, and decommissioning of storage nodes. It can even spread the storage clusters in globally distributed data centers, giving access to location-aware data. I.e., clients will access data on closer storage nodes. Of course, this level of abstraction also imposes some overhead. However, adding more nodes and optimizing data distribution can mitigate this. Most software-defined storage optimizes data distribution with minimal user intervention. Finally, the virtual storage can have multiple access methods such as block-oriented, key/value storage, or file-oriented. That way, various types of clients can use the same storage cluster. This architecture is progressively becoming the standard for large distributed containerized deployments. Storage nodes may even be containers themselves.

6. Conclusion

In this tutorial, we introduced the main storage architectures available for data centers today. There is no silver bullet, depending on what we want or need, we may have to use a different architecture. Actually, it is common to have all of them currently running data centers. Choosing the right one for each system deployment can greatly help with the required redundancy, performance, and functionality fulfillment.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex