Our Blog

When Replication Isn’t Atomic: Lessons from MooseFS Chunk Migration

September 24th, 2025 | MooseFS Team

post_thumbnail

A real-world MooseFS migration story recently shed light on the dangers of certain configuration decisions in distributed file systems. What seemed like a replication bug turned out to be a reminder of why best practices matter, and why relying on single-copy storage is a recipe for trouble.

The Problem: Missing Chunks During Replication

During a migration of millions of chunks to a more reliable Chunkserver, a user noticed something worrying: after rebooting the destination server, several chunks went missing.

Key observations:

  • Replication reduced the goal from two replicas to one before migration was fully complete.
  • Three chunks were lost, appearing as “Invalid copies” in mfsfileinfo.
  • The errors originated from read issues on the ageing source HDDs.

At first glance, this looked like a flaw in MooseFS replication – as if redundant copies were deleted before the system ensured safe placement on the new server.

Why the Chunks Went Missing

In reality, replication in MooseFS is safe: a Chunkserver only reports success once an entire chunk is written and synced (HDD_FSYNC_BEFORE_CLOSE = 1 ensures data is flushed to disk). Only after that confirmation does the Master instruct deletion of the old copy.

So what went wrong?

The source copy itself was already deteriorating. With the replica goal reduced to one, MooseFS had no safety net. When it tried to replicate the bad copy, the corruption was discovered, leaving the file invalid.

Priorities in Chunk Management

MooseFS doesn’t try to evaluate which replica is “better” before deleting. Instead, it resolves chunk “issues” by priority:

  1. Endangered
  2. Undergoal
  3. Wrong label (stored on the wrong class of server)
  4. Overgoal (too many copies)

In this case, overgoal took precedence over migrating data to the correct label. That meant the system dropped extra copies before ensuring new ones were in place.

This priority order wasn’t chosen at random. Historically, “wrong label” had higher priority, but that caused disk space problems: deletions were blocked or delayed while MooseFS tried to replicate first. Changing the order solved capacity issues for most users, but it also means that reducing a dataset to a single replica is fragile – especially if the only remaining copy is on unreliable storage.

The Real Issue

The missing chunks were the result of:

  • Reducing replica count to one copy for data that wasn’t disposable.
  • Migrating from unreliable source storage where corruption had already set in.
  • Expecting MooseFS to make “smart” decisions about which replica to keep, even though the system doesn’t evaluate replica quality.

In other words, this was a configuration problem, not a software flaw.

Best Practices to Avoid Data Loss

  1. Always keep at least two replicas for any data you care about. One copy is never safe.
  2. Treat single-copy goals as disposable only – use them for caches or temporary calculations, never for production data.
  3. Monitor disk health – don’t wait for replication to reveal corruption.
  4. Understand MooseFS priorities – overgoal always comes first, so plan migrations and replica reductions accordingly.

Final Thought

Distributed file systems like MooseFS make trade-offs to balance efficiency, capacity, and safety. But those trade-offs assume sane configuration. Relying on a single copy of data – especially on unreliable hardware – is a gamble that MooseFS cannot protect you from.


Same Goal, Different Paths: Why Method Matters

August 28th, 2025 | MooseFS Team

post_thumbnail

In the world of system administration, you quickly learn that there’s rarely only one way to get something done. But what’s easy to forget is that two methods that look like they do the same thing can have wildly different performance, resource usage, and side effects.

We recently ran a set of stress tests in our MooseFS lab that made this point crystal clear. The goal was simple: delete a huge number of files – about 34 million per directory – but the way we did it changed the outcome dramatically.

The Test Setup

We built a test directory containing 34 million empty files split into two subdirectories. Using MooseFS’s snapshot mechanics, we replicated that directory into multiple copies so we could run tests in parallel.

We then tried two different deletion methods:

  1. The traditional POSIX way: rm -rf
  2. A custom GO script that removes files using GO’s built-in os.RemoveAll() function

Test 1: rm -rf

For this test, we mounted the filesystem on 8 different machines and deleted 8 separate copies of the directory – each containing 34 million files.

Results:

  • Master CPU: ~99.9% usage, but still responsive to other commands.
  • Network load: Around 15 Mbits in / 0.5 Gbits out at two brief peaks; negligible otherwise.
  • Completion time: Just over 4 hours to delete 272 million files.

rm -rf works by listing the directory once, then performing a lookup and unlink for each file. It’s CPU-heavy, but relatively network-efficient.

Test 2: The GO Script

We modified the provided GO script so it would accept a path and directly remove files. The key difference: it deletes files in batches of 1024, then re-lists the entire directory before deleting the next batch. This behavior comes from Go’s built-in os.RemoveAll() function.

Results:

  • Master CPU: ~55% usage
  • Network load: A steady 2.5 Gbits/sec outbound
  • Estimated completion time: ~20 days (we stopped early)

Why so slow?
Because after each batch, the GO script:

  1. Closes the directory (releasedir)
  2. Opens it again (opendir)
  3. Re-reads millions of filenames (over the network!)

This repeated readdir process is hugely expensive in network terms. The actual deletion is slow enough to be impractical for large directories.

Why Virtualization Might Make It Worse

Our lab runs on bare metal, and even under load, the master node stayed responsive. In virtualized environments, especially if network performance is impacted by hypervisor overhead, these differences might be magnified. That could explain the extreme CPU usage and unresponsiveness some people see when using the GO script.

A Third Way: Snapshot Trick

MooseFS snapshots can offer a much faster way to delete massive directories – if you understand the trade-offs.

Snapshots in MooseFS are atomic lazy copies: only metadata is copied immediately, and data chunks are only duplicated when modified. If you set the snapshot flag on a directory and then remove it as a snapshot, the whole directory disappears almost instantly.

Example:

mfsseteattr -f snapshot -r my_directory_to_delete
mfsrmsnapshot my_directory_to_delete

On our test directory with 34 million files:

  • Setting the snapshot flag: 0.7 seconds
  • Removing the snapshot: 12 seconds (empty files)
  • Estimated with real data: ~24 seconds

This method does block the master briefly, but it’s predictable and far shorter than hours-long traditional deletion.

Takeaways

  1. The “same” operation can behave very differently depending on how it’s done.
    rm -rf and the GO script both “delete files,” but their performance, CPU load, and network footprint couldn’t be more different.
  2. Understand your filesystem’s mechanics.
    In MooseFS, readdir operations over huge directories are expensive, especially when repeated unnecessarily.
  3. Think about your environment.
    Virtualization can amplify inefficiencies – especially network-heavy ones.
  4. Sometimes unconventional is best.
    If you can handle a brief master-blocking event, the snapshot method can be a lifesaver.

Bottom line

Before you hit “Enter” on a seemingly routine command, remember: the how matters just as much as the what. The wrong method for the right job can turn a minutes-long task into a multi-day ordeal.

Looking Forward

After investigating this phenomenon, we made changes in MooseFS to improve how readdir works. It is now more resistant to the kind of repeated directory scans triggered by scripts like GO’s os.RemoveAll(), and the execution time of such scripts has improved noticeably.

That said, it’s important to remember that in some filesystems, deleting a file only marks it as removed. In those cases, repeatedly re-reading large directories can still lead to serious performance issues.

This improvement to readdir will be available in the next release of MooseFS.


How to install MooseFS

October 16th, 2018 | Wojciech Kostański

post_thumbnail

After preparation, you are ready for the installation process. With this article, you install Master Server, Chunkserver, MooseFS CGI, MooseFS CLI, Metalogger and MooseFS client.

Read More


Things to do before installing MooseFS

September 21st, 2018 | Wojciech Kostański

post_thumbnail

Before you install MooseFS you should set up the DNS and add the MooseFS repository. Read this article to learn more and successfully complete the installation process!

Read More


10 MooseFS Best Practices to maximize performance!

August 27th, 2018 | Karol Majek

post_thumbnail

Here are 10 MooseFS Best Practices! Many people are asking us about the technical aspects of setting up MooseFS instances.
In order to answer these questions, we are publishing a list of best practices and hardware recommendations. Follow these to achieve the best reliability of your MooseFS installation.

Read More