Published at

Update on libradosfs

Authors
  • avatar
    Name
    Joaquim Rocha
    Twitter

This is a draft. It might be incomplete or have errors.

Two months ago I introduced here the project that I have been working on at CERN — libradosfs.

During these past couple of months the project has been improved with many bug fixes and new features from which I highlight the Python bindings (thanks to my colleague Michał) and the Quota functionality (now that numops CLS is merged to Ceph — yay!).

However, the update is actually an excuse to update the numbers that I had given last time. I was spent some time benchmarking and checking how I could improve those numbers, basically trying to reduce the round trips to the server.

In the process I found out something weird: the faster cluster (using SSD disks) was actually giving me worse numbers than the supposedly not so fast cluster (HDD disks), specifically the SSD cluster was twice as slow. So after digging for a while, I realized that it was actually the reading, not the writing that was slower and, more importantly, this was true not only for libradosfs but also for Ceph’s own rados benchmark!

In the end, the cluster admin investigated it and found out that the main reason was that TCMalloc was being run with the default cache size (16 MB). After increasing this value to 256 MB, the numbers I get now make sense. The SSD cluster is now faster then the HDD one (as it should be) and the libradosfs’s benchmark operations that use mainly this cluster are around 4x faster than in my last post.

Here are the updated measurements:

<th>
  Avg files/sec
</th>
<td>
  170.12
</td>
<td>
  80.43
</td>
<td>
  10.94
</td>
<td>
  0.22
</td>
<td>
  0.11
</td>
File size
0 (touching a file)
1KB (inline file)
1MB (inline + inode)
500MB (inline + inode)
1GB (inline + inode)

Just like last time, the tests were run for 5 minutes from my development machine (an i3 3.3 GHz dual core machine, 8 GB RAM running Fedora 22), using a cluster of 8 machines with 128 OSDs. As you can notice the biggest the file size, the more similar the values are with the previous ones, this is because nothing was changed for the HDD cluster where the data pools are stored.

Sharing is caring!