Blog · 2018-08-29 · Wojciech

Chunkserver Hardware Requirements

Chunkservers spend their lives shuttling data between disk and network — get the sizing right and your cluster flies; get it wrong and you’ll hit bottlenecks fast. This guide covers RAM calculation, JBOD vs RAID, SSD–HDD mixing, pre-fetch tuning, and network interface planning.

Server hardware for MooseFS chunkservers

The Chunkserver’s main duty is to transfer data: read from disk & send to network and receive from network & write to disk. It doesn’t perform a lot of calculations or memory operations. For erasure coding configurations, Chunkservers are also responsible for calculating erasure codes. However, the parity calculation algorithm is extremely fast (up to 5 GB/s with average hardware) and should not affect Chunkserver resources.

Chunkservers store data in the form of “chunks”, on local disks. Each chunk is a file ranging from 64 KiB to 64 MiB plus 8 KiB chunk header. Files longer than 64 MiB are divided into many chunks and spread across many Chunkservers. This is why files smaller than 64 KiB occupy minimum 64+8=72 KiB of disk space.

A new data server can be connected to the system at any point in time and the new capacity is used immediately.

Memory

During normal operations, the Chunkserver keeps information about its chunks in memory. Each chunk takes approximately 150–200 bytes. Additionally, approximately 350 MiB (Virtual) / 200 MiB (Resident) of memory should be available for the processes.

The required memory size depends on the desired maximum number of chunks on a server. The maximum number of chunks on a server (in terms of how many chunks the server disks may store) may be calculated roughly from the Chunkserver’s total available (for MooseFS) disk space divided by average chunk size. Average chunk size depends on the average size of files. When storing files much larger than 64 MiB it should be about 64 MiB and for smaller files, average chunk size will be between 72 KiB and 64 MiB.

Example for memory requirement calculation:

Chunkserver has 20 × 6 TiB disks available for storing data,
Users store files of approx. size ~100 MiB each,
It makes ~50 MiB average chunk size (avg. each file is kept in 2 chunks)
Memory required: (20 × 6 TiB / 50 MiB) × 170 B + ~350 MiB = ~408 MiB + 350 MiB = ~758 MiB

The Chunkserver may use the remaining available memory for caching data as it boosts its performance. From our experience a typical memory size for the Chunkserver is 8–12 GiB.

CPU

Since Chunkserver daemon is a multi-threaded process, it is recommended to use multi-core processors. Chunkserver would usually utilize equivalent of ~1 core of the CPU for cluster related operations.

Utilization of CPU may increase due to erasure codes calculations (for EC cluster configurations). However, the utilization increase factor depends on the EC configuration, mainly on the amount of data written with erasure codes. EC calculations are not performed during normal read operations. They are calculated only during write operations and while repairing corrupted chunks.

Disks

It is recommended to connect disks as JBODs. Each disk should be formatted with POSIX compliant file system and mounted separately to the operating system (e.g. as /mnt/chunk01, /mnt/chunk02, …). The file system recommended for local disks is XFS.

Moreover, it is not recommended to use RAID controllers in the Chunkservers as it is MooseFS’s role to keep data redundant and safe. There are at least two reasons for not using underlying RAID controllers:

MooseFS has a mechanism of checking if the hard disk is in a good condition or not. MooseFS can discover broken disks, replicate data and mark such disks as damaged. The situation is different with RAID: MooseFS algorithms do not support RAIDs state checking, therefore corrupted RAID arrays may be falsely reported as healthy.
The other aspect is the time of replication. Let’s assume there is a replication goal set to 2 for the whole MooseFS instance. If one 2 TiB drive breaks, the replication (from another copy) will take about 20–60 minutes. However, if one big RAID (e.g. 36 TiB) becomes corrupted, replication can take even 12–18 hours. So, until the replication process is finished, some of the data is in danger, because there is only one valid copy and if another disk or RAID fails during that time, some of the data may be irrevocably lost. Thus, the longer replication period puts data in greater danger.

How many disks per Chunkserver?

Theoretically, there is no limit on the number of disks in a Chunkserver apart from hardware limits: chassis size, cooling or disk controller limits. An appropriate number of disks in a chunk server should be determined to take into consideration the way storage is to be used and the hardware parameters. For better overall cluster performance, it is recommended to have more Chunkservers with a smaller number of disks:

as each Chunkserver has its limitations for serving client I/O requests – more Chunkservers may serve more requests for many clients in parallel,
usually, either a network bandwidth limit or a controller bandwidth limit is reached when there are too many (e.g. fast SSD) disks in one server,

One may consider putting more disks in a Chunkserver when either a cluster has to keep “colder” data (more files but used infrequently) or when just a few clients are using the cluster. If just a few (or just one) clients are accessing the cluster, an increase in the number of Chunkservers will not increase the speed beyond a certain level as it is limited by the client’s network bandwidth and its machine performance.

There are other factors also regarding Chunkservers/disks ratio which should be taken into account. For example, Erasure Coding requires at least 8+n Chunkservers in a cluster so for desired cluster size the minimum number of disks per Chunkserver will be: cluster_size / ((8+n) * disk_size).

SSDs vs. HDDs

SSD disks are getting more and more attention these days as they are faster (both latency and throughput) than spinning drives and they are getting cheaper each year. There is a trend for filling up storage clusters with SSD drives. This is possible with MooseFS, where one may build an SSD-only cluster or mix SSDs with HDDs thereby building a kind of tiered storage solution.

Building such a mixed solution is easy with so-called storage classes of MooseFS. If there are Chunkservers containing disks of a single type, appropriate storage classes may be used for grouping Chunkservers and assigning policies for storing data on certain groups. Such as files not used for a long period may be automatically moved from SSD- to HDD-only Chunkservers. It is expected that SSD-only Chunkservers may be faster but more expensive and HDD-only Chunkservers may be slower but cheaper.

It is important to note that too many SSD disks in a single server may not speed up Chunkservers as expected. This is because of the constraint of either a disk controller bandwidth limit or a network bandwidth limit. Utilizing the full transfer speed of SSD drives is possible when the controller and network speed is greater than the transfer speed of a disk multiplied by the number of disks. One should also refer to the controller specification to check its limits on operating many disks at once.

Mixing SSDs and HDDs

Mixing SSDs and HDDs for the same Chunkserver process is not recommended. Chunkserver doesn’t differentiate SSDs from HDDs. Neither storage classes don’t support single disk assignment nor Chunkserver algorithms use particular SSD disks for data caching. If – for any reason – there have to be both disk types in one Chunkserver it’s possible to run separate Chunkserver processes on the same machine. Each process should have assigned a separate set of disks: SSDs and HDDs respectively. Thus, each process is available for a cluster as separate Chunkserver and storage classes may be applied.

Due to its low latency SSD disks may boost a cluster when many random I/O operations are expected e.g. storing SQL database files, home directories, web assets etc.

Pre-fetch & Read-ahead

MooseFS’s read-ahead and pre-fetch algorithms make HDD-only Chunkservers very effective. They pre-fetch a single chunk data when a request regarding a chunk appears. It makes data required by a client available in Chunkserver’s memory prior to the I/O operation requesting it. MooseFS client may even read in advance following chunks – when a data stream is read sequentially. These algorithms alleviate the problem of a slower random data access and slower data transfer for HDD disks. Data caching and read-ahead/pre-fetch algorithms only use RAM for storing cached data. Once again, these algorithms do not use SSDs.

Pre-fetch and read-ahead algorithms enhance the HDD disks clusters performance especially, for stream-like cluster access patterns e.g. storing and serving video files, logging etc.

Networking

Chunkserver sends and receives a lot of data. Its networking interface is essential for robust operations. Always keep in mind that at any given point in time, each Chunkserver serves many clients simultaneously and each client connects to many Chunkservers even when dealing with one (large enough) file. In addition, each Chunkserver sends and receives data from other Chunkservers thereby balancing operations and auto repairs.

Chunkserver may use two separated network interfaces. The first one for serving clients and the second one for the server to server communication (which is not LACP).

If you want to know more, download the MooseFS Hardware Guide.