Documentation

Best practices

(last update: January 13, 2016)

Lots of people are asking us about technical aspects of setting up MooseFS instances.
In order to answer these questions, we are publishing here a list of best practices and hardware recommendations. Follow these to achieve best reliability of your MooseFS installation.

  1. Minimum goal set to 2
  2. Enough space for metadata dumps
  3. RAID 1 or RAID 1+0 for storing metadata
  4. Virtual Machines and MooseFS
  5. JBOD and XFS for Chunkservers
  6. Network
  7. overcommit_memory on Master Servers (Linux only)
  8. Disabled updateDB feature (Linux only)
  9. Up-to-date operating system
  10. Hardware recommendation

  1. Minimum goal set to 2
  2. In order to keep your data safe, we recommend to set the minimum goal to 2 for the whole MooseFS instance.

    The goal is a number of copies of files' chunks distributed among Chunkservers. It is one of the most crucial aspects of keeping data safe.

    If you have set the goal to 2, in case of a drive or a Chunkserver failure, the missing chunk copy is replicated from another copy to another chunkserver to fulfill the goal, and your data is safe.

    If you have the goal set to 1, in case of such failure, the chunks that existed on a broken disk, are missing, and consequently, files that these chunks belonged to, are also missing. Having goal set to 1 will eventually lead to data loss.

    To set the goal to 2 for the whole instance, run the following command on the server that MooseFS is mounted on (e.g. in /mnt/mfs):

    # mfssetgoal -r 2 /mnt/mfs

    You should also prevent the users from setting goal lower than 2. To do so, edit your /etc/mfs/mfsexports.cfg file on every Master Server and set mingoal appropriately in each export:

    *    /    rw,alldirs,mingoal=2,maproot=0:0

    After modifying /etc/mfs/mfsexports.cfg you need to reload your Master Server(s):

    # mfsmaster reload
    or
    # service moosefs-master reload
    # service moosefs-pro-master reload
    or
    # kill -HUP `pidof mfsmaster`

    For big instances (like 1 PiB or above) we recommend to use minimum goal set to 3, because probability of disk failure in such a big instance is higher.

  3. Enough space for metadata dumps
  4. We had a number of support cases raised connected to the metadata loss. Most of them were caused by a lack of free space for /var/lib/mfs directory on Master Servers.

    The free space needed for metadata in /var/lib/mfs can be calculated by the following formula:

    • RAM is amount of RAM
    • BACK_LOGS is a number of metadata change log files (default is 50 - from /etc/mfs/mfsmaster.cfg)
    • BACK_META_KEEP_PREVIOUS is a number of previous metadata files to be kept (default is 1 - also from /etc/mfs/mfsmaster.cfg)

    SPACE = RAM * (BACK_META_KEEP_PREVIOUS + 2) + 1 * (BACK_LOGS + 1) [GiB]

    (If default values from /etc/mfs/mfsmaster.cfg are used, it is RAM * 3 + 51 [GiB])

    The value 1 (before multiplying by BACK_LOGS + 1) is an estimation of size used by one changelog.[number].mfs file. In highly loaded instance it uses a bit less than 1 GB.

    Example:
    If you have 128 GiB of RAM on your Master Server, using the given formula, you should reserve for /var/lib/mfs on Master Server(s):

    128*3 + 51 = 384 + 51 = 435 GiB   minimum.

  5. RAID 1 or RAID 1+0 for storing metadata
  6. We recommend to set up a dedicated RAID 1 or RAID 1+0 array for storing metadata dumps and changelogs. Such array should be mounted on /var/lib/mfs directory and should not be smaller than the value calculated in the previous point.

    We do not recommend to store metadata over the network (e.g. SANs, NFSes, etc.).

  7. Virtual Machines and MooseFS
  8. For high-performance computing systems, we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.

  9. JBOD and XFS for Chunkservers
  10. We recommend to connect to Chunkserver(s) JBODs. Just format the drive as XFS and mount on e.g. /mnt/chunk01, /mnt/chunk02, ... and put these paths into /etc/mfs/mfshdd.cfg. That's all.

    We recommend such configuration mainly because of two reasons:

    • MooseFS has a mechanism of checking if the hard disk is in a good condition or not. MooseFS can discover broken disks, replicate the data and mark such disks as damaged. The situation is different with RAID: MooseFS algorithms do not work with RAIDs, therefore corrupted RAID arrays may be falsely reported as healthy/ok.

    • The other aspect is time of replication. Let's assume you have goal set to 2 for the whole MooseFS instance. If one 2 TiB drive breaks, the replication (from another copy) will last about 40-60 minutes. If one big RAID (e.g. 36 TiB) becomes corrupted, replication can last even for 12-18 hours. Until the replication process is finished, some of your data is in danger, because you have only one valid copy. If another disk or RAID fails during that time, some of your data may be irrevocably lost. So the longer replication period puts your data in greater danger.

  11. Network
  12. We recommend to have at least 1 Gbps network. Of course, MooseFS will perform better in 10 Gbps network (in our tests we saturated the 10 Gbps network).

    We recommend to set LACP between two switches and connect each machine to both of them to enable redundancy of your network connection.

  13. overcommit_memory on Master Servers (Linux only)
  14. If you have an entry similar to the following one in /var/log/syslog or /var/log/messages:

    fork error (store data in foreground - it will block master for a while)
    you may encounter (or are encountering) problems with your master server, such as timeouts and dropped connections from clients. This happens, because your system does not allow MFS Master process to fork and store its metadata information in background.

    Linux systems use several different algorithms of estimating how much memory a single process needs when it is created. One of these algorithms assumes that if we fork a process, it will need exactly the same amount of memory as its parent. With a process taking 24 GB of memory and total amount of 40 GB (32 GB physical plus 8 GB virtual) and this algorithm, the forking would always be unsuccessful.

    But in reality, the fork commant does not copy the entire memory, only the modified fragments are copied as needed. Since the child process in MFS master only reads this memory and dumps it into a file, it is safe to assume not much of the memory content will change.

    Therefore such "careful" estimating algorithm is not needed. The solution is to switch the estimating algorithm the system uses. It can be done one-time by a root command:

    # echo 1 > /proc/sys/vm/overcommit_memory

    To switch it permanently, so it stays this way even after the system is restarted, you need to put the following line into your /etc/sysctl.conf file:

    vm.overcommit_memory=1

  15. Disabled updateDB feature (Linux only)
  16. Updatedb is part of mlocate which is simply an indexing system, that keeps a database listing all the files on your server. This database is used by the locate command to do searches.

    Updatedb is not recommended for network distributed filesystems.

    To disable Updatedb feature for MooseFS, add fuse.mfs to variable PRUNEFS in /etc/updatedb.conf (it should look similar to this):

    PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs usbfs udf fuse.glusterfs fuse.sshfs fuse.mfs curlftpfs ecryptfs fusesmb devtmpfs"

  17. Up-to-date operating system
  18. We recommend to use up-to-date operating system. It doesn't matter if your OS is Linux, FreeBSD or MacOS X. It needs to be up-to-date. For example, some features added in MooseFS 3.0 will not work with old FUSE version (which is e.g. present on Debian 5).

  19. Hardware recommendation
  20. Since MooseFS Master Server is a single-threaded process, we recommend to use modern processors with high clock and low number of cores for Master Servers, e.g.:

    • Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz
    • Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz

    We also recommend to disable hyper-threading CPU feature for Master Servers.

    Minimum recommended and supported HA configuration for MooseFS Pro is 2 Master Servers and 3 Chunkservers. If you have 3 Chunkservers, and one of them goes down, your data is still accessible and is being replicated and system still works. If you have only 2 Chunkservers and one of them goes down, MooseFS waits for it and is not able to perform any operations.

    Minumum number of Chunkservers required to run MooseFS Pro properly is 3.

Follow us
CONTACT US
Core Technology 57 West 57th Street,
New York City,
New York, 10019
Tel: +1 646-416-7918
Fax: +1 646-416-8001
Contact form:
* - information required