Documentation for MooseFS 3.0:
Detailed documentation for MooseFS 3.0 can be found in:
MooseFS Support Status
|Major version||released||Supported version||released||Supported||until|
|MooseFS Pro 4.x||May 2019||4.49.0||Aug 2023||TBA|
|MooseFS Pro 3.x||Jun 2016||3.0.117||Feb 2023||TBA|
|MooseFS Pro 2.x||Jul 2014||–||–||Dec 31st, 2017|
|MooseFS 3.x||Jun 2016||3.0.117||Feb 2023||TBA|
|MooseFS 2.x||Jul 2014||–||–||Dec 31st, 2017|
|MooseFS 1.6.x||Dec 2009||–||–||Dec 31st, 2015|
|MooseFS 1.5.x||May 2008||–||–||Dec 31st, 2010|
– End of life (EOL)
Many people are asking us about technical aspects of setting up MooseFS instances.
In order to answer these questions, we are publishing a list of best practices and hardware recommendations. Follow these to achieve best reliability of your MooseFS installation.
1. Minimum goal set to 2
In order to keep your data safe, we recommend to set the minimum goal to 2 for the whole MooseFS instance.
The goal is a number of copies of files' chunks distributed among Chunkservers. It is one of the most crucial aspects of keeping data safe.
If you have set the goal to 2, in case of a drive or a Chunkserver failure, the missing chunk copy is replicated from another copy to another chunkserver to fulfill the goal, and your data is safe.
If you have the goal set to 1, in case of such failure, the chunks that existed on a broken disk, are missing, and consequently, files that these chunks belonged to, are also missing. Having goal set to 1 will eventually lead to data loss.
To set the goal to 2 for the whole instance, run the following command on the server that MooseFS is mounted on (e.g. in /mnt/mfs):# mfssetgoal -r 2 /mnt/mfs
You should also prevent the users from setting goal lower than 2. To do so, edit your /etc/mfs/mfsexports.cfg file on every Master Server and set mingoal appropriately in each export:* / rw,alldirs,mingoal=2,maproot=0:0
After modifying /etc/mfs/mfsexports.cfg you need to reload your Master Server(s):
# mfsmaster reload or # service moosefs-master reload
# service moosefs-pro-master reload or # kill -HUP `pidof mfsmaster`
For big instances (like 1 PiB or above) we recommend to use minimum goal set to 3, because probability of disk failure in such a big instance is higher.
2. Enough space for metadata dumps
We had a number of support cases raised connected to the metadata loss. Most of them were caused by a lack of free space for /var/lib/mfs directory on Master Servers.
The free space needed for metadata in /var/lib/mfs can be calculated by the following formula:
- RAM is amount of RAM
- BACK_LOGS is a number of metadata change log files (default is 50 - from /etc/mfs/mfsmaster.cfg)
- BACK_META_KEEP_PREVIOUS is a number of previous metadata files to be kept (default is 1 - also from /etc/mfs/mfsmaster.cfg)
SPACE = RAM * (BACK_META_KEEP_PREVIOUS + 2) + 1 * (BACK_LOGS + 1) [GiB]
(If default values from /etc/mfs/mfsmaster.cfg are used, it is RAM * 3 + 51 [GiB])
The value 1 (before multiplying by BACK_LOGS + 1) is an estimation of size used by one changelog.[number].mfs file. In highly loaded instance it uses a bit less than 1 GB.
If you have 128 GiB of RAM on your Master Server, using the given formula, you should reserve for /var/lib/mfs on Master Server(s):
128*3 + 51 = 384 + 51 = 435 GiB minimum.
3. RAID 1 or RAID 1+0 for storing metadata
We recommend to set up a dedicated RAID 1 or RAID 1+0 array for storing metadata dumps and changelogs. Such array should be mounted on /var/lib/mfs directory and should not be smaller than the value calculated in the previous point.
We do not recommend to store metadata over the network (e.g. SANs, NFSes, etc.).
4. Virtual Machines and MooseFS
For high-performance computing systems, we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.
5. JBOD and XFS for Chunkservers
We recommend to connect to Chunkserver(s) JBODs. Just format the drive as XFS and mount on e.g. /mnt/chunk01, /mnt/chunk02, ... and put these paths into /etc/mfs/mfshdd.cfg. That's all.
We recommend such configuration mainly because of two reasons:
MooseFS has a mechanism of checking if the hard disk is in a good condition or not. MooseFS can discover broken disks, replicate the data and mark such disks as damaged. The situation is different with RAID: MooseFS algorithms do not work with RAIDs, therefore corrupted RAID arrays may be falsely reported as healthy/ok.
The other aspect is time of replication. Let's assume you have goal set to 2 for the whole MooseFS instance. If one 2 TiB drive breaks, the replication (from another copy) will last about 40-60 minutes. If one big RAID (e.g. 36 TiB) becomes corrupted, replication can last even for 12-18 hours. Until the replication process is finished, some of your data is in danger, because you have only one valid copy. If another disk or RAID fails during that time, some of your data may be irrevocably lost. So the longer replication period puts your data in greater danger.
We recommend to have at least 1 Gbps network. Of course, MooseFS will perform better in 10 Gbps network (in our tests we saturated the 10 Gbps network).
We recommend to set LACP between two switches and connect each machine to both of them to enable redundancy of your network connection.
7. overcommit_memory on Master Servers (Linux only)
If you have an entry similar to the following one in /var/log/syslog or /var/log/messages: fork error (store data in foreground - it will block master for a while) you may encounter (or are encountering) problems with your master server, such as timeouts and dropped connections from clients. This happens, because your system does not allow MFS Master process to fork and store its metadata information in background.
Linux systems use several different algorithms of estimating how much memory a single process needs when it is created. One of these algorithms assumes that if we fork a process, it will need exactly the same amount of memory as its parent. With a process taking 24 GB of memory and total amount of 40 GB (32 GB physical plus 8 GB virtual) and this algorithm, the forking would always be unsuccessful.
But in reality, the fork command does not copy the entire memory, only the modified fragments are copied as needed. Since the child process in MFS master only reads this memory and dumps it into a file, it is safe to assume not much of the memory content will change.
Therefore such "careful" estimating algorithm is not needed. The solution is to switch the estimating algorithm the system uses. It can be done one-time by a root command: # echo 1 > /proc/sys/vm/overcommit_memory
To switch it permanently, so it stays this way even after the system is restarted, you need to put the following line into your /etc/sysctl.conf file: vm.overcommit_memory=1
8. Disabled updateDB feature (Linux only)
Updatedb is part of mlocate which is simply an indexing system, that keeps a database listing all the files on your server. This database is used by the locate command to do searches.
Updatedb is not recommended for network distributed filesystems.
To disable Updatedb feature for MooseFS, add fuse.mfs to variable PRUNEFS in /etc/updatedb.conf (it should look similar to this): PRUNEFS="NFS nfs nfs4 rpc_pipefs afs binfmt_misc proc smbfs autofs iso9660 ncpfs coda devpts ftpfs devfs mfs shfs sysfs cifs lustre tmpfs usbfs udf fuse.glusterfs fuse.sshfs fuse.mfs curlftpfs ecryptfs fusesmb devtmpfs"
9. Up-to-date operating system
We recommend to use up-to-date operating system. It doesn't matter if your OS is Linux, FreeBSD or MacOS X. It needs to be up-to-date. For example, some features added in MooseFS 3.0 will not work with old FUSE version (which is e.g. present on Debian 5).
10. Hardware recommendation
Since MooseFS Master Server is a single-threaded process, we recommend to use modern processors with high clock and low number of cores for Master Servers, e.g.:
- Intel(R) Xeon(R) CPU E3-1280 v6 @ 3.90GHz
- Intel(R) Xeon(R) CPU E3-1281 v3 @ 3.70GHz
- Intel(R) Xeon(R) CPU E3-1276 v3 @ 3.60GHz
- Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz
- Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
A good point to start while choosing CPU for Master Server is single-thread performance rating on CPU Benchmark.
We also recommend to disable hyper-threading CPU feature for Master Servers.
For Chunkservers we recommend modern multi-core processors.
Minimal recommended and supported HA configuration for MooseFS Pro is 2 Master Servers and 3 Chunkservers. If you have 3 Chunkservers, and one of them goes down, your data is still accessible and is being replicated and system still works. If you have only 2 Chunkservers and one of them goes down, MooseFS waits for it and is not able to perform any operations.
Minumum number of Chunkservers required to run MooseFS Pro properly is 3.