mfsmaster.cfg

NAME

mfsmaster.cfg - main configuration file for mfsmaster

DESCRIPTION

The file mfsmaster.cfg contains configuration of MooseFS master process.

SYNTAX

Syntax is:

OPTION = VALUE

Lines starting with # character are ignored as comments.

STARTUP OPTIONS

Changes in this section require process restart.

WORKING_USER

user to run daemon as

WORKING_GROUP

group to run daemon as; optional value - if empty, then user's default group will be used

SYSLOG_IDENT

name of process to place in syslog messages; default is mfsmaster

LOCK_MEMORY

whether to perform mlockall() to avoid swapping out mfsmaster process; default is 0, i.e. no

LIMIT_GLIBC_MALLOC_ARENAS

limit malloc arenas to given value - prevents server from using huge amount of virtual memory (Linux only, default is 4)

DISABLE_OOM_KILLER

disable out of memory killer (Linux only, default is 1)

NICE_LEVEL

nice level to run daemon with; default is -19; note: process must be started as root to increase priority, if setting of priority fails, process retains the nice level it started with

FILE_UMASK

set default umask for group and others (user always has 0); default is 027 - block write for group and block all for others

DATA_PATH

where to store metadata files and lock file

RUNTIME OPTIONS

Changes in this section require only process reload.

LICENCE_FILENAME

(pro only) alternate location/name of mfslicence.bin file

EXPORTS_FILENAME

alternate location/name of mfsexports.cfg file

TOPOLOGY_FILENAME

alternate location/name of mfstopology.cfg file

INSTANCE_NAME

optional name of MooseFS instance (cluster) that will be displayed in CGI interface (default is empty string)

BACK_LOGS

number of metadata change log files (default is 50)

METADATA_SAVE_OFFSET

offset used for options: METADATA_SAVE_FREQ, (pro only) METADATA_DOWNLOAD_FREQ, (pro only) METADATA_CRCCHECK_FREQ (minutes - default is 0); format X:Y is also accepted and means X hours and Y minutes, suffix "L" can be added to mean local(server) time zone instead of UTC; if hours are provided, modulo value of METADATA_SAVE_FREQ, METADATA_DOWNLOAD_FREQ, METADATA_CRCCHECK_FREQ will be calculated and used, see NOTES

METADATA_SAVE_FREQ

how often (in hours) master (single or follower) will store metadata (default is 1)

METADATA_DOWNLOAD_FREQ

(pro only) how often (in hours) leader will download metadata from followers (default is 0 - which means never)

METADATA_CRCCHECK_FREQ

(pro only) how often (in hours) leader will check metadata consistency with followers (also saves metadata ; default is 24)

METADATA_DOWNLOAD_LIMIT

(pro only) metadata downloading limit in mebibytes per second (0 means no limit, default is 50 - 50MiB/s = ~52.4MB/s)

DEPUTY_ELECTION_TIME

(pro only) how many seconds until master-elect turns into master-deputy (a.k.a master-readonly) if there are not enough chunkservers connected to become master-leader (default is 0 - which means never)

BACK_META_KEEP_PREVIOUS

number of previous metadata files to be kept (default is 1)

CHANGELOG_PRESERVE_SECONDS

how many seconds of change logs have to be preserved in memory (default is 5000; this sets the minimum, actual number may be a bit bigger due to logs being kept in 5k blocks; zero disables extra logs storage)

CHANGELOG_PRESERVE_MB

how much RAM can be used to preserve change logs in memory in megabytes (deafult is 500; this sets the minimum, actual number may be a bit bigger due to logs being kept in 5k blocks; zero disables extra logs storage); master will always preserve the lower number of 5k blocks defined by CHANGELOG_PRESERVE_SECONDS and CHANGELOG_PRESERVE_MB, so be sure to set both appropriately

CHANGELOG_SAVE_MODE

Changelog save mode (default is 0)
0 - write in background by different process (less safe, but doesn't make master stop in case of heavy hdd load)
1 - write in foreground without syncing data (master waits for every changelog to be saved to hdd, but without syncing - a little more safe than the background option, but may cause master to stop and wait for flushing hdd buffers)
2 - write in foreground with fsync after each write (very safe, but may make your master very slow unless you have very sophisticated hardware)

MISSING_LOG_CAPACITY

how many missing chunks will be stored in master (up to 100*MISSING_LOG_CAPACITY bytes of memory will be allocated ; default value is 100000)

CONFIGURATION_TEST_PERIOD

(pro only) how often (in hours) master leader should run a test, that checks if configuration options are consistent betewen modules (default is 24, max possible value is 168, 0 means no tests are performed)

SYSLOG_MIN_LEVEL

minimum level of messages that will be reported by master; levels in order of importance: ERROR, WARNING, NOTICE, INFO, DEBUG (default is INFO)

SYSLOG_ELEVATE_TO

reported messages of level lower than set by this option will be elevated to this level (i.e. if SYSLOG_MIN_LEVEL is set to DEBUG and SYSLOG_ELEVATE_TO is set to NOTICE, all INFO and DEBUG messages will be sent to syslog as NOTICE ; default is NOTICE)

COMMAND CONNECTION OPTIONS

Changes in this section require only process reload.

MATOML_LISTEN_HOST

IP address to listen on for metalogger, (pro only) masters and supervisors connections (* means any)

MATOML_LISTEN_PORT

port to listen on for metalogger, (pro only) masters and supervisors connections

MATOML_TIMEOUT

default timeout in seconds for master-metalogger (or master follower, pro only) connection (default is 10)

MATOML_FORCE_TIMEOUT

forced timeout in seconds for master-metalogger (or master follower, pro only) connection (default is 0 - do not force timeouts)

MASTER-LEADER CONNECTION OPTIONS

Changes in this section require only process reload.

MASTER_HOST

(pro only) MooseFS master host (default is mfsmaster)

MASTER_RECONNECTION_DELAY

(pro only) delay in seconds before next try to reconnect to master-leader if not connected (default is 5)

MASTER_TIMEOUT

(pro only) timeout in seconds for master-leader connections (default is 10)

BIND_HOST

(pro only) local address to use for connecting with master-leader (default is *, i.e. default local address)

CHUNKSERVER CONNECTION OPTIONS

Changes in this section require only process reload.

MATOCS_LISTEN_HOST

IP address to listen on for chunkserver connections (* means any)

MATOCS_LISTEN_PORT

port to listen on for chunkserver connections

MATOCS_TIMEOUT

default timeout in seconds for master-chunkserver connection (default is 10)

MATOCS_FORCE_TIMEOUT

forced timeout in seconds for master-chunkserver connection (default is 0 - do not force timeouts)

AUTH_CODE

Optional authentication string. When defined - then only chunkservers with the same AUTH_CODE are allowed to connect to this master. When not defined (default) - then all chunkservers are allowed. If you want to switch on chunkserver authentication, then first define AUTH_CODE in all your chunkservers (and reload/restart them), then define this option in master and reload/restart it. Remember, that after reload currently connected chunkservers are NOT disconnected. New AUTH_CODE will be used only when chunkservers will make a new connection.

REMAP_BITS, REMAP_SOURCE_IP_CLASS, REMAP_DESTINATION_IP_CLASS

Optional IP class remapping. Remap chunkserver IP addresses with first REMAP_BITS of IP equal to first REMAP_BITS of REMAP_SOURCE_IP_CLASS. During remapping system will set first REMAP_BITS of IP address to first REMAP_BITS of REMAP_DESTINATION_IP_CLASS. All three option need to be defined for the remapping to work.

MULTILAN_BITS, MULTILAN_CLASSES

Optional LAN IP class remapping. Remap chunkservers IP addresses with first MULTILAN_BITS of client's IP for all client IPs matching one of the MULTILAN_CLASSES (classes must be separated by comma). All masters and chunkservers must have valid IPs from each of the MULTILAN_CLASSES classes. Both options need to be defined for the remapping to work.

MULTILAN_IPMAP_FILENAME

alternate location/name of mfsipmap.cfg file. This file defines optional custom IP mapping.

CHUNKSERVER WORKING OPTIONS

Changes in this section require only process reload.

REPLICATIONS_DELAY_INIT

initial delay in seconds before starting replications (default is 60)

REPLICATIONS_RESPECT_TOPOLOGY

whether to make undergoal replications respect topology (default is 0)
0 - do not respect topology
1 - pick a destination server at random, but then choose the best source server
2 - try to find a destination server in the same rack as one of the existing copies and then replicate the chunk locally (in the same rack)

CREATIONS_RESPECT_TOPOLOGY

whether new chunks should be recorded with respect to topology (default is 0)
0 - do not respect topology
N (N>0) - first try to create new chunks on servers with topological distance LOWER than N from the client; if not possible, for example because of storage class, chunk servers being busy or lacking space, then try servers with distance greater or equal to N

CHUNKS_UNIQUE_MODE

avoid using same ip/rack for different chunk copies (default is 0)
0 - ignore ip and rackid (standard behaviour)
1 - avoid storing more than one copy on chunkservers using same IP number
2 - avoid storing more than one copy on chunkservers using IP number from the same rack id
NOTICE! This parameter is available for backward compatibility purposes and should not be set to a value other than 0. Instead, distinguish feature from storage classes should be used. However, if it is set to a value other than 0, it will override any distinguish definintion in storage classes. For more informations about storage classes and distinguish feature refer to mfsscadmin (1) manual.

CHUNKS_LOOP_MAX_CPS

Chunks loop shouldn't check more chunks per seconds than given number (default is 100000)

CHUNKS_LOOP_MIN_TIME

Chunks loop shouldn't be done in less seconds than given number (default is 300)

CHUNKS_SOFT_DEL_LIMIT

Soft maximum number of chunks to delete on one chunkserver (default is 10)

CHUNKS_HARD_DEL_LIMIT

Hard maximum number of chunks to delete on one chunkserver (default is 25)

CHUNKS_WRITE_REP_LIMIT

Maximum number of chunks to replicate to one chunkserver (default is 2,1,1,4,4 - see NOTES)

CHUNKS_READ_REP_LIMIT

Maximum number of chunks to replicate from one chunkserver (default is 10,5,2,5,10 - see NOTES)

CS_HEAVY_LOAD_THRESHOLD

Threshold for chunkserver load. (default is 150 - see NOTES)

CS_HEAVY_LOAD_RATIO_THRESHOLD

Threshold ratio for chunkserver load (default is 3.0 - see NOTES)

CS_HEAVY_LOAD_GRACE_PERIOD

Defines how long chunkservers will remain in 'grace' mode (default is 900 - see NOTES)

ACCEPTABLE_PERCENTAGE_DIFFERENCE

Maximum percentage difference between space usage of chunkservers (default is 1 = 1%)

PRIORITY_QUEUES_LENGTH

Length of priority queues (for endangered, undergoal etc. chunks - chunks that should be processed first - default is 1000000)

CS_MAINTENANCE_MODE_TIMEOUT

Maximum time server can be in maintenance mode (default value is 0 - which means 'forever'); for value formatting see TIME

CS_TEMP_MAINTENANCE_MODE_TIMEOUT

Maximum time server can be in "temporary" maintenance mode (server is switched to this mode whenever it is stopped gracefully, after reconnection server is switched back to normal mode automatically ; default value: 30m); for value formatting see TIME

CS_DAYS_TO_REMOVE_UNUSED

How many days unused (disconnected) chunkserver should be kept in master data structures (valid values: 0 - 365 ; 0 means indefinitely ; default value: 7)

CLIENTS CONNECTION OPTIONS

Changes in this section require only process reload.

MATOCL_LISTEN_HOST

IP address to listen on for client (mount) connections (* means any)

MATOCL_LISTEN_PORT

port to listen on for client (mount) connections

MATOCL_TIMEOUT

default timeout in seconds for master-client connection (default is 10)

MATOCL_FORCE_TIMEOUT

forced timeout in seconds for master-client connection (default is 0 - do not force timeouts)

RESTRICT_INCOMPATIBLE_CLIENT_VERSIONS

Whether MooseFS should prevent connections from clients that are unable to read all data (especially erasure encoded data). Default is 1 - prevent connections. If this option is set to 0, clients that try to read data in a format they do not understand will return read errors. Use with caution.

CLIENTS WORKING OPTIONS

Changes in this section require only process reload.

SESSION_SUSTAIN_TIME

How long to sustain a disconnected client session (default is 1 day); for value formatting see TIME

FILE SYSTEM OPTIONS

Changes in this section require only process reload.

QUOTA_DEFAULT_GRACE_PERIOD

Default grace for soft quota (default is 7 days); for value formatting see TIME

ATIME_MODE

Set atime modification mode (default is 2 : similar to 'relatime' - see NOTES) .TP KEEP_EMPTY_FILES_IN_TRASH Move empty files to trash after unlink? (default is 0 - delete empty files immediately regardless of trash retention settings)

RESERVE_SPACE

Set amount of space reserved for superuser (default is 0 = do not reserve space for superuser - see NOTES)

MAX_ALLOWED_HARD_LINKS

Define limit for number of hardlinks allowed for one object (default is 32767; possible values are from 8 to 65000)

INODE_REUSE_DELAY

Delay time after which inodes of deleted objects will be reused. BE AWARE if you change this value below 1 day you MUST ensure that this value is higher than any of the following timeouts in all clients: mfsattrcacheto, mfsxattrcacheto, mfsentrycacheto, mfsdirentrycacheto, mfsnegentrycacheto, mfssymlinkcacheto. (default is 1d; possible values are from 300 to 3000000 seconds); for value formatting see TIME TP DEFAULT_EC_DATA_PARTS How many data parts should the system use when old style EC definition is in use (@n instead of @8+n or @4+n; default is 8; possible values are 4 or 8)

TIME

For config variables that define time without requiring a single, specific unit, time can be defined as a number of seconds (integer) or a time period in one of two possible formats:

first format: #.#T where T is one of: s-seconds, m-minutes, h-hours, d-days or w-weeks; fractions of seconds will be rounded to full seconds

second format: #w#d#h#m#s, any number of definitions can be ommited, but the remaining definitions must be in order (so #d#m is still a valid definition, but #m#d is not); ranges: s,m: 0 to 59, h: 0 to 23, d: 0 t o 6, w is unlimited and the first definition is also always unlimited (i.e. for #d#h#m d will be unlimited)

Examples:

1.5h is the same as 1h30m, is the same as 90m, is the same as 5400s, is the same as 5400

2.5d is the same as 2d12h, is the same as 60h; 1d36h is not a valid time period (h is not the first definition, so it is bound by range 0 to 23)

1.03m is the same as 62s (61.8 seconds will be rounded up to 62)

NOTES

Chunks in master are tested in a loop. Speed (or frequency) is regulated by two options: CHUNKS_LOOP_MIN_TIME and CHUNKS_LOOP_MAX_CPS. First defines minimal time between iterations of the loop and second defines maximal number of chunk tests per second. Typically at the beginning, when the number of chunks is small, time is constant, regulated by CHUNK_LOOP_MIN_TIME, but the when number of chunks becomes bigger, then time of loop can increase according to CHUNKS_LOOP_MAX_CPS.

Example: CHUNKS_LOOP_MIN_TIME is set to 300, CHUNKS_LOOP_MAX_CPS is set to 100000 and there is 1000000 (one million) chunks in the system. 1000000/100000 = 10, which is less than 300, so one loop iteration will take 300 seconds. With 1000000000 (one billion) chunks the system needs 10000 seconds for one iteration of the loop.

Deletion limits are defined as 'soft' and 'hard' limit. When number of chunks to delete increases from loop to loop, current limit can be temporarily increased above soft limit, but never above hard limit.

Replication limits are divided into five cases:

  • first limit is for endangered chunks (chunks with only one copy)

  • second limit is for undergoal chunks (chunks with number of copies lower than specified goal)

  • third limit is for rebalance between servers with space usage close to arithmetic mean

  • fourth limit is for rebalance between other servers (very low or very high space usage)

  • fifth limit is for recovery replications caused by I/O operations (read/write)

Usually first number should be greater than or equal to second, second greater than or equal to third, and fourth greater than or equal to third ( 1st >= 2nd >= 3rd <= 4th ). Fifth limit should be equal or greater than any of the other limits. If one number is given, then all limits are set to this number (for backward compatibility). If only four numbers are given, the fifth will be set as maximum of the four (also for backward compatibility).

Whenever chunkserver load is higher than CS_HEAVY_LOAD_THRESHOLD and CS_HEAVY_LOAD_RATIO_THRESHOLD times higher than average load, then chunkserver is switched into 'grace' mode. Chunkserver stays in grace mode for CS_HEAVY_LOAD_GRACE_PERIOD seconds.

There are five possible values for ATIME_MODE (all other values are treated as 0):

  • 0 = Always modify atime for files, folders and symlinks.

  • 1 = Always modify atime but only in case of files (do not modify atime in case of folders and symlinks).

  • 2 = Modify atime only when it is lower than ctime or mtime and when current time is higher than ctime or mtime respectively, also modify atime when current atime is older than 24h. Do it for all objects during access (like "relatime" option in Linux).

  • 3 = Same as above but only in case of files. In case of folders and symlinks do not modify atime.

  • 4 = Never modify atime during access (like "noatime" option).

You can reserve space for superuser using RESERVE_SPACE option. You can define it as number of bytes, percent of total space, capacity of biggest chunkserver, etc.

  • # or #B = number of bytes reserved for superuser. Standard metric prefixes can be used - SI and IEC (k,K,M,Mi,G,Gi etc.)

  • #% or #.#% = percent of total capacity of MooseFS instance

  • #U or #.#U = multiplies of "U" value; U is defined as maximum number of bytes currently used by a single chunkserver

  • #C or #.#C = multiplies of "C" value; C is defined as maximum total capacity of a single chunkserver

When your network has two (or more) IP classes you may want to use one network for standard communication between MFS modules and separate network only for I/O. It can be done by setting REMAP_BITS, REMAP_SOURCE_IP_CLASS and REMAP_DESTINATION_IP_CLASS. When you set these options then master will change internally IP addresses of chunkservers and will send them as chunk locations, so clients will make connections with chunkservers using new (destination) IP for all I/O, but still communicate with master using original (source) IP. Also chunkservers will use original IP to communicate with master, but they will use new IP's to communicate between themselves during replication. Beware that all clients and chunkservers must have access to both networks, but masters, metaloggers etc. will need only access to the source network.

When your clients are separated into two or more LAN or VLAN networks, you may want them to connect to masters and chunkservers using IPs from their network. It can be done by setting MULTILAN_BITS and MULTILAN_CLASSES. Each time a client connects, the master will check whether the connection came from one of the defined LAN classes and if yes, it will remap the first MULTILAN_BITS of any chunkserver IP before it sends the chunkserver's IP to the client. Connections coming from other IP addresses will be treated as usual (i.e. original chunkserver IPs will be sent in response). All masters and chunkservers need to have one IP from each of the defined MULTILAN_CLASSES and one module (chunkserver or master) needs to have the same IP suffix in each class. Proper DNS configuration is also required: clients in each LAN must either get different IP when querying DNS about master host or must use different master host names that are resolved to IPs in their class. See mfsmount (8), mfsbdev (8) for more info about master host name.

The two above sets of options can be used together. One purpose would be to create a separate network for replication of data between chunkservers while also maintaning several separate networks (LANs) for separate sets of clients. For example, with the following configuration:

REMAP_BITS = 24
REMAP_SOURCE_IP_CLASS = 10.0.0.0
REMAP_DESTINATION_IP_CLASS = 10.0.1.0
MULTILAN_BITS = 24
MULTILAN_CLASSES = 192.168.1.0, 192.168.2.0, 192.168.3.0, 10.0.1.0

all network traffic from clients matching one of the LAN classes will be handled on that network (so a client with IP 192.168.1.17 will connect to the master and chunkservers using IPs with prefix 192.168.1 and a client with IP 192.168.2.13 will connect to the master and chunkservers using IPs with prefix 192.168.2), all metadata traffic between master and chunkservers will be handled on network with prefix 10.0.0 and all direct communication between chunkservers (i.e. chunk replications) will be handled on network with prefix 10.0.1. Be aware that any client from outside of the defined LAN classes will connect to the chunkservers via IPs defined by REMAP_DESTINATION_IP_CLASS. This also assumes proper DNS configuration, that is, if master server uses IP suffix 1, a client with IP 192.168.1.17 should resolve master host name as 192.168.1.1 and a client with IP 192.168.2.13 should resolve master host name as 192.168.2.1. A client from outside of the defined LAN classes may use any of master server IPs, although preferably 10.0.0.1

Masters save metadata to a file on a local disk. The exact times of these operations are regulated by four variables: METADATA_SAVE_OFFSET, METADATA_SAVE_FREQ, (pro only) METADATA_DOWNLOAD_FREQ, (pro only) METADATA_CRCCHECK_FREQ.

A single master will save metadata every METADATA_SAVE_FREQ hours. First save of the day happens at midnight, every other one is after METADATA_SAVE_FREQ from previous one. This can be changed with METADATA_SAVE_OFFSET. METADATA_SAVE_OFFSET set to a single value will mean minutes. So settings like this:

METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 15

will mean saving at 00:15, 02:15, 04:15 etc., up to 22:15. If hour is provided, this is also taken into account, but only as modulo, so:

METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15

will behave exactly the same as:

METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 5:15

and will save metadata at 01:15, 03:15, 05:15 etc.

All times mentioned above are calculated in UTC, unless "L" suffix is used. So, if your servers are in CET zone (UTC-1), this setting:

METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15

will mean the master will save metadata at 01:15, 3:15, etc. UTC, which means 0:15, 2:15, etc. local (CET) time. To save at 01:15 (and every 2 hours from that) local time, you need to write:

METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15L

(pro only) With multiple masters the leader will save metadata according to the setting from METADATA_CRCCHECK_FREQ, but also taking into account the offset. So setting like:

METADATA_CRCCHECK_FREQ = 12
METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15

will mean that the leader saves (and compares against followers) metadata at 01:15 and 13:15, while followers save at 01:15, 03:15, 05:15 etc.

METADATA_DOWNLOAD_FREQ means the leader will download metadata from a follower and that will also take into account the METADATA_SAVE_OFFSET in the manner identical that METADATA_CRCCHECK_FREQ does.

Times mentioned above are calculated in UTC. So, if your servers are in PST zone (UTC-8), this setting:

METADATA_CRCCHECK_FREQ = 12
METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15

will mean the leader will save metadata at 01:15 and 13:15 UTC, which means 17:15 and 5:15 (5:15 PM and 5:15 AM) local (PST) time. To save at 01:15 and 13:15 (1:15 AM and 1:15 PM) local time, you need to write:

METADATA_CRCCHECK_FREQ = 12
METADATA_SAVE_FREQ = 2
METADATA_SAVE_OFFSET = 1:15L

Setting:

METADATA_SAVE_OFFSET = 0L

can be used to adjust saving times to local timezone without changing the default metadata saving schedule.

Copyright (C) 2024 Jakub Kruszona-Zawadzki, Saglabs SA

This file is part of MooseFS.

MooseFS is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2 (only).

MooseFS is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with MooseFS; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02111-1301, USA or visit http://www.gnu.org/licenses/gpl-2.0.html

SEE ALSO

mfsmaster(8), mfsexports.cfg(5) mfstopology.cfg(5) mfsipmap.cfg(5)