Skip to main content

Does Linux Need a Defragmentor?

In computing terminology, file system fragmentation sometimes also called file system aging, is the inability of a file system to lay out related data pieces sequentially or contiguously on the disk storage media. This is an inherent problem in storage-backed file systems that allow in-place or live modification of their contents. This is one of the primary use cases of data fragmentation. 

File system fragmentation increases disk head movement or seeks required for fetching the whole file, which are known to hinder throughput and performance of disk reads and writes. As a remedial action, the motive is to reorganize files and free space back into contiguous areas or blocks so as to keep all pieces of a file data together at one place, a process called defragmentation.

Ok, give me an example please?

let's attempt at explaining in a simple, non-technical manner as to why some file systems suffer more from fragmentation than others. Rather than simply stumble through lots of dry technical explanations, lets work it visually.

This is a representation of a small hard drive, as yet completely empty. The a-z’s at the top and the left side of the grid are used to locate each individual byte of data in form of a 2d matrix. Its much same as the address field (the y-axis) and the list of location offsets (x-axis) in that address space.

Lets begin by understanding a simple filesystem that most users are familiar with, FAT, one that will need defragmenting occasionally. FAT remains important to both Windows and Linux users, if only for USB flash drives, FAT is still widely used - unfortunately, it suffers badly from fragmentation.

Let's add a file to our filesystem, and our hard drive now looks something like this:

You can explain what you see as follows:

The first four rows of the disk are given over for a "Table of contents", or TOC. This TOC stores the location of every file present on the filesystem. In the above example, the TOC contains one file, named "hello.txt", and says that the contents of this file are to be found between locations ae and le. We look at these locations, and see that the file contents are "Hello, world".

Ok that was simple. Now let’s try to add another file, "bye.txt" to the filesystem.

As you can see, the contents of the second file has been added immediately after the contents of the first file at offset ae. The logic here is that if all your files are kept together, then accessing them will be quicker and easier. This means faster seek operations. One of the slowest part of the hard drives is the digital stylus, the less it has to move the quicker your read/write times will be. This all looks fine and makes sense, but the problem this causes can be seen when we decide to edit our first file. 

So what's the problem then?

Let’s say we want to add some stuff to our first file hello.txt. If you check closely, you will realize that there are no room left on our filesystem after the original content of hello.txt i.e at offset ae, since the contents of bye.txt file are in the way. 

We now have only two options to solve this, however neither is ideal:

  1. We move the hello.txt file. We delete it from its original position, and move the new, bigger file somewhere else. Either at the end of the second file or maybe some other part of filesystem that has big enough space. The downside of this approach is that lots of reading, writing and seeking is involved. Bigger the file, higher is the overhead.

  2. We fragment (divide) the file, so that it exists in two places (i.e. more than one place) but ensure there are no empty spaces in between (i.e reduce byte offset wastage). This seems better than the first approach. It requires lesser reads and writes so is quick to do, but will slow down all subsequent direct file accesses for hello.txt.

To illustrate, here is approach 1:


and here is approach two:

FAT based allocation follows approach 2. As you can see, compared to approach 1, approach 2 makes more sense and looks better. But over time as the file system fills up, approach 2 performs even worse. Adopting approach 2 is why such file systems need defragmentation regularly. All files are placed right next to each other, consistent with the faster disk seeks ideology. So any time a file grows, it fragments, and if a file shrinks, it leaves a gap. Soon the hard drive becomes a mass of fragments and gaps, and disc access performance starts to suffer over time.

Hmm....So what's the alternative?

Let’s see what happens when we use a different philosophy. The first type of filesystem is ideal if you have a single user, accessing files in more-or-less the order they were created in, one after the other, with very few edits (think MS-DOS). Linux, however, was always intended as a multi-user system. It was guaranteed that you would have more than one user trying to access more than one file at the same time. That was the basic philosophy here. So a different approach to storing files is needed. When we create "hello.txt" on a more Linux-focussed filesystem, it looks like this:

and then when another file "bye.txt" is added:

The cleverness of this approach is that the disk’s stylus can sit in the middle, and most files, on average (in reality more than 90% of times), will be fairly nearby. That’s how the theory of averages works, after all. Make the most probable estimated mean as the starting point, and you will have a higher probability of  getting a hit close by.

Now when we add more content to our files in this filesystem, let's observe how much trouble it causes:

That’s right: Absolutely none.

The first filesystem tries to put all files as close to the start of the hard drive as it can, thus it constantly fragments files when they grow larger and there’s no free space available. The second scatters files all over the disk so there’s plenty of free space if the file’s size changes. It can also re-arrange files on-the-fly, since it has plenty of empty space to shuffle around. Defragging the first type of filesystem is a more intensive process and not really practical to run during normal system use. 

Second type of file system rarely needs defragmentation. Fragmentation thus only becomes an issue on this latter type of system when a disk is so full that there just aren’t any gaps a large file can be put into without splitting it up. So long as the disk is less than about 80% to 85% full (rule of the thumb actually), this is very unlikely to happen.

It is also worth knowing that even when an OS says a drive is completely defragmented, due to the nature of general hard drive geometry, fragmentation may still be present. A typical physical hard drive actually has multiple stacked disks, AKA platters, inside it.

Plattered disk seek is more involved but the fundamental concepts still stay the same. Let’s say that our example hard drive is actually on two platters, with aa to zm being the first and an to zz the second:

The following file would be considered non-fragmented, because it goes from row m to row n, but this ignores the fact that the stylus will have to move from the very end of the platter to the very beginning in order to read this file.

In Conclusion

I hope this article has helped you understand why some filesystems can suffer badly from fragmentation, whilst others barely suffer at all. This can also help you understand why your Windows systems running the FAT filesystems turn sluggish as they age, and why no defragging software came with your Linux installation.


Comments

  1. Awesome post! Thanks for sharing useful information with us. I will bookmark it for new update post.
    churn prevention software

    ReplyDelete

Post a Comment

Questions? Thoughts? Feedback? Corrections? Please do let us know. Thank you.

Popular posts from this blog

Concurrency in Java – Part 3, State Visibility Guarantees

This post is the part 3 story in the  Concurrency in Java – Series  that touches the core concepts of concurrency in Java and provides a balanced view from the JVM memory model specifications as well as from the programmer's perspective. To see the list of posts in this series, please visit  here . ----------------********---------------- In the previous post , we discussed at length about the object behaviour heuristics. In this installment of the series we shall try to examine the common State Visibility Guarantees that we get from the JVM specification and JVM implementations. Locking idioms not just help us in controlling access around program ‘critical paths’ but also allows us to coordinate access and guarantee consistency and durability of state visibility operations. This is an important consideration that JVM specification offers as a subtle guarantee. Without this guarantee, there would actually be little or no sense

Are we DevOps, or SRE?

SRE Vs. DevOps has been a ranging battle for quite sometime now. Ask any operations or infra team in today's bubbling tech shops and, they will tell you, without a flutter, that they are SRE. Scratch them a bit more and, they will tell you how they follow the SRE principles with stories from their daily grind and link them to Google's SRE handbooks. Then drop the big question, "But isn't the same thing DevOps too?" and see them getting confused and incoherent a bit. Now, if you ask, "Or maybe, yours is more of a hybrid model than pure DevOps/SRE?". Now, you might have turned a few heads and even made some ponder further away. Managing "Operations" as a disciplined practice has always been an arduous task. Many companies today have dedicated operations departments engaged in planning and executions, but more often than not, they fail badly. The tech operations landscape is no different. There are always, generally unsolved questions about how t