10 MooseFS Best Practices to maximize performance!

August 27th, 2018 | Karol Majek post_thumbnail

Here are 10 MooseFS Best Practices! Many people are asking us about the technical aspects of setting up MooseFS instances.
In order to answer these questions, we are publishing a list of best practices and hardware recommendations. Follow these to achieve the best reliability of your MooseFS installation.

1. The minimum goal set to 2

The first from MooseFS Best Practices is to set the minimum goal to 2. In order to keep your data safe, we recommend setting the minimum goal to 2 for the whole MooseFS instance. A goal is a number of copies of files’ chunks distributed among Chunkservers. It is one of the most crucial aspects of keeping data safe. If you have set the goal to 2, in case of a drive or a Chunkserver failure, the missing chunk copy is replicated from another copy to another chunkserver to fulfill the goal, and your data is safe. If you have the goal set to 1, in case of such failure, the chunks that existed on a broken disk, are missing, and consequently, files that these chunks belonged to, are also missing. Having a goal set to 1 will eventually lead to data loss.

To set the goal to 2 for the whole instance, run the following command on the server that MooseFS is mounted on (e.g. in /mnt/mfs):

mfssetgoal -r 2 /mnt/mfs

You should also prevent the users from setting a goal lower than 2. To do so, edit your /etc/mfs/mfsexports.cfg file on every Master Server and set mingoal appropriately in each export:

* / rw,alldirs,mingoal=2,maproot=0:0

After modifying /etc/mfs/mfsexports.cfg you need to reload your Master Server(s):

mfsmaster reload

or

service moosefs-master reload
service moosefs-pro-master reload

or

kill -HUP `pidof mfsmaster`

For big instances (like 1 PiB or above) we recommend using minimum goal set to 3 because the probability of disk failure in such a big instance is higher.

2. Enough space for metadata dumps

We had a number of support cases raised connected to the metadata loss. Most of them were caused by a lack of free space for /var/lib/mfs directory on Master Servers.

The free space needed for metadata in /var/lib/mfs can be calculated by the following formula:

  • RAM is the amount of RAM memory
  • BACK_LOGS is a number of metadata changelog files (default is 50 – from /etc/mfs/mfsmaster.cfg)
  • BACK_META_KEEP_PREVIOUS is a number of previous metadata files to be kept (default is 1 – also from /etc/mfs/mfsmaster.cfg)
SPACE = RAM * (BACK_META_KEEP_PREVIOUS + 2) + 1 * (BACK_LOGS + 1)  GiB

(If default values from /etc/mfs/mfsmaster.cfg are used, it is RAM * 3 + 51 [GiB])

The value 1 (before multiplying by BACK_LOGS + 1) is an estimation of the size used by one changelog.[number].mfs file. In a highly loaded instance, it uses a bit less than 1 GB.

Example:

If you have 128 GiB of RAM on your Master Server, using the given formula, you should reserve for /var/lib/mfs on Master Server(s):

128*3 + 51 = 384 + 51 = 435 GiB minimum.

3. RAID 1 or RAID 1+0 for storing metadata

We recommend to set up a dedicated RAID 1 or RAID 1+0 array for storing metadata dumps and changelogs. Such array should be mounted on /var/lib/mfs directory and should not be smaller than the value calculated in the previous point.

We do not recommend to store metadata over the network (e.g. SANs, NFSes, etc.).

4. Virtual Machines and MooseFS

For high-performance computing systems, we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.

5. JBOD and XFS for Chunkservers

We recommend connecting to Chunkserver(s) JBODs. Just format the drive as XFS and mount on e.g. /mnt/chunk01, /mnt/chunk02, … and put these paths into /etc/mfs/mfshdd.cfg. That’s all.

We recommend such configuration mainly because of two reasons:

  • MooseFS has a mechanism of checking if the hard disk is in a good condition or not. MooseFS can discover broken disks, replicate the data and mark such disks as damaged. The situation is different with RAID: MooseFS algorithms do not work with RAIDs, therefore corrupted RAID arrays may be falsely reported as healthy/ok.
  • The other aspect is the time of replication. Let’s assume you have the goal set to 2 for the whole MooseFS instance. If one 2 TiB drive breaks, the replication (from another copy) will last about 40-60 minutes. If one big RAID (e.g. 36 TiB) becomes corrupted, replication can last even for 12-18 hours. Until the replication process is finished, some of your data is in danger, because you have only one valid copy. If another disk or RAID fails during that time, some of your data may be irrevocably lost. So the longer replication period puts your data in greater danger.

6. Network Bandwidth

We recommend having at least 1 Gbps network. Of course, MooseFS will perform better in 10 Gbps network (in our tests we saturated the 10 Gbps network).

We recommend to set LACP between two switches and connect each machine to both of them to enable redundancy of your network connection.

7. overcommit_memory on Master Servers (Linux only)

If you have an entry similar to the following one in /var/log/syslog or /var/log/messages:

fork error (store data in the foreground - it will block master for a while)

you may encounter (or are encountering) problems with your master server, such as timeouts and dropped connections from clients. This happens, because your system does not allow the MooseFS Master process to fork and store its metadata information in the background.

Linux systems use several different algorithms for estimating how much memory a single process needs when it is created. One of these algorithms assumes that if we fork a process, it will need exactly the same amount of memory as its parent. With a process taking 24 GB of memory and a total amount of 40 GB (32 GB physical plus 8 GB virtual) and this algorithm, the forking would always be unsuccessful.

But in reality, the fork command does not copy the entire memory, only the modified fragments are copied as needed. Since the child process in MFS master only reads this memory and dumps it into a file, it is safe to assume not much of the memory content will change.

Therefore such “careful” estimating algorithm is not needed. The solution is to switch the estimating algorithm the system uses. It can be done one-time by a root command:

echo 1 > /proc/sys/vm/overcommit_memory

To switch it permanently, so it stays this way even after the system is restarted, you need to put the following line into your /etc/sysctl.conf file:

vm.overcommit_memory=1

8. Disabled updateDB feature (Linux only)

Updatedb is part of mlocate which is simply an indexing system, that keeps a database listing all the files on your server. This database is used by the locate command to do searches.

Updatedb is not recommended for network distributed filesystems.

To disable Updatedb feature for MooseFS, add fuse.mfs to variable PRUNEFS in /etc/updatedb.conf (it should look similar to this):

PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs usbfs udf fuse.glusterfs fuse.sshfs fuse.mfs curlftpfs ecryptfs fusesmb devtmpfs"

9. Up-to-date operating system

We recommend using the up-to-date operating system. It doesn’t matter if your OS is Linux, FreeBSD or MacOS X. It needs to be up-to-date. For example, some features added in MooseFS 3.0 will not work with old FUSE version (which is e.g. present on Debian 5).

Here is the link to instruction How to install MooseFS on various operating systems

10. Hardware recommendation

Finally, the last from MooseFS Best Practices. Since MooseFS Master Server is a single-threaded process, we recommend to use modern processors with high clock and a low number of cores for Master Servers, e.g.:

  • Intel(R) Xeon(R) CPU E3-1280 v6 @ 3.90GHz
  • Intel(R) Xeon(R) CPU E3-1281 v3 @ 3.70GHz
  • Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz
  • Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz
  • Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz

A good point to start while choosing CPU for Master Server is single-thread performance rating on CPU Benchmark.

We also recommend disabling hyper-threading CPU feature for Master Servers.

For Chunkservers we recommend modern multi-core processors.

Minimal recommended and supported HA configuration for MooseFS Pro is 2 Master Servers and 3 Chunkservers. If you have 3 Chunkservers, and one of them goes down, your data is still accessible and is being replicated and the system still works. If you have only 2 Chunkservers and one of them goes down, MooseFS waits for it and is not able to perform any operations.

Minumum number of Chunkservers required to run MooseFS Pro properly is 3.