(last update: January 13, 2016)
Lots of people are asking us about technical aspects of setting up MooseFS instances.
In order to answer these questions, we are publishing here a list of best practices and hardware recommendations. Follow these to achieve best reliability of your MooseFS installation.
In order to keep your data safe, we recommend to set the minimum goal to 2 for the whole MooseFS instance.
The goal is a number of copies of files' chunks distributed among Chunkservers. It is one of the most crucial aspects of keeping data safe.
If you have set the goal to 2, in case of a drive or a Chunkserver failure, the missing chunk copy is replicated from another copy to another chunkserver to fulfill the goal, and your data is safe.
If you have the goal set to 1, in case of such failure, the chunks that existed on a broken disk, are missing, and consequently, files that these chunks belonged to, are also missing. Having goal set to 1 will eventually lead to data loss.
To set the goal to 2 for the whole instance, run the following command on the server that MooseFS is mounted on (e.g. in /mnt/mfs):
You should also prevent the users from setting goal lower than 2. To do so, edit your /etc/mfs/mfsexports.cfg file on every Master Server and set mingoal appropriately in each export:
After modifying /etc/mfs/mfsexports.cfg you need to reload your Master Server(s):
For big instances (like 1 PiB or above) we recommend to use minimum goal set to 3, because probability of disk failure in such a big instance is higher.
We had a number of support cases raised connected to the metadata loss. Most of them were caused by a lack of free space for /var/lib/mfs directory on Master Servers.
The free space needed for metadata in /var/lib/mfs can be calculated by the following formula:
(If default values from /etc/mfs/mfsmaster.cfg are used, it is RAM * 3 + 51 [GiB])
The value 1 (before multiplying by BACK_LOGS + 1) is an estimation of size used by one changelog.[number].mfs file. In highly loaded instance it uses a bit less than 1 GB.
If you have 128 GiB of RAM on your Master Server, using the given formula, you should reserve for /var/lib/mfs on Master Server(s):
128*3 + 51 = 384 + 51 = 435 GiB minimum.
We recommend to set up a dedicated RAID 1 or RAID 1+0 array for storing metadata dumps and changelogs. Such array should be mounted on /var/lib/mfs directory and should not be smaller than the value calculated in the previous point.
We do not recommend to store metadata over the network (e.g. SANs, NFSes, etc.).
For high-performance computing systems, we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.
We recommend to connect to Chunkserver(s) JBODs. Just format the drive as XFS and mount on e.g. /mnt/chunk01, /mnt/chunk02, ... and put these paths into /etc/mfs/mfshdd.cfg. That's all.
We recommend such configuration mainly because of two reasons:
MooseFS has a mechanism of checking if the hard disk is in a good condition or not. MooseFS can discover broken disks, replicate the data and mark such disks as damaged. The situation is different with RAID: MooseFS algorithms do not work with RAIDs, therefore corrupted RAID arrays may be falsely reported as healthy/ok.
The other aspect is time of replication. Let's assume you have goal set to 2 for the whole MooseFS instance. If one 2 TiB drive breaks, the replication (from another copy) will last about 40-60 minutes. If one big RAID (e.g. 36 TiB) becomes corrupted, replication can last even for 12-18 hours. Until the replication process is finished, some of your data is in danger, because you have only one valid copy. If another disk or RAID fails during that time, some of your data may be irrevocably lost. So the longer replication period puts your data in greater danger.
We recommend to have at least 1 Gbps network. Of course, MooseFS will perform better in 10 Gbps network (in our tests we saturated the 10 Gbps network).
We recommend to set LACP between two switches and connect each machine to both of them to enable redundancy of your network connection.
If you have an entry similar to the following one in /var/log/syslog or /var/log/messages:
Linux systems use several different algorithms of estimating how much memory a single process needs when it is created. One of these algorithms assumes that if we fork a process, it will need exactly the same amount of memory as its parent. With a process taking 24 GB of memory and total amount of 40 GB (32 GB physical plus 8 GB virtual) and this algorithm, the forking would always be unsuccessful.
But in reality, the fork commant does not copy the entire memory, only the modified fragments are copied as needed. Since the child process in MFS master only reads this memory and dumps it into a file, it is safe to assume not much of the memory content will change.
Therefore such "careful" estimating algorithm is not needed. The solution is to switch the estimating algorithm the system uses. It can be done one-time by a root command:
To switch it permanently, so it stays this way even after the system is restarted, you need to put the following line into your /etc/sysctl.conf file:
Updatedb is part of mlocate which is simply an indexing system, that keeps a database listing all the files on your server. This database is used by the locate command to do searches.
Updatedb is not recommended for network distributed filesystems.
To disable Updatedb feature for MooseFS, add fuse.mfs to variable PRUNEFS in /etc/updatedb.conf (it should look similar to this):
We recommend to use up-to-date operating system. It doesn't matter if your OS is Linux, FreeBSD or MacOS X. It needs to be up-to-date. For example, some features added in MooseFS 3.0 will not work with old FUSE version (which is e.g. present on Debian 5).
Since MooseFS Master Server is a single-threaded process, we recommend to use modern processors with high clock and low number of cores for Master Servers, e.g.:
We also recommend to disable hyper-threading CPU feature for Master Servers.
Minimum recommended and supported HA configuration for MooseFS Pro is 2 Master Servers and 3 Chunkservers. If you have 3 Chunkservers, and one of them goes down, your data is still accessible and is being replicated and system still works. If you have only 2 Chunkservers and one of them goes down, MooseFS waits for it and is not able to perform any operations.
Minumum number of Chunkservers required to run MooseFS Pro properly is 3.