Efficient File I/O, Part 2: Why is fwrite
so slow?
Published:
My last post talked about some of the things I discovered when looking into how to optimize my current research project, ExoLabel. Since then, I’ve made some big progress improvements in terms of speed, and I thought it would be worth it to break them down. Building efficient external memory algorithms is a really cool process; every potential inefficiency mattters a lot more than with RAM-based algorithms, so you start to understand how the computer works at a deeper level.
The previous version of ExoLabel successfully clustered a network with around two billion nodes and 20 billion edges in roughly 24 hours. ExoLabel takes as input a tsv file containing edges specified as A B W
, where A
and B
are node names, and W
is the weight of the edge that connects them. The core algorithm has three main steps:
- Read over the edgelist file once to reindex node names. Characters are hard to work with, so we convert them to 0-indexed values and also set up some files for later.
- Read over the edgelist file again to add the edges to a CSR-compressed format. This step is covered in my original post on ExoLabel.
- Cluster the network.
- Write the results to a file.
On that network of 2/20 billion nodes/edges, the timing breakdown looked like this:
- Step 1 took 6 hours
- Step 2 took 18 hours
- Step 3 took…30 minutes!
Not exactly what I was expecting to see, if I’m being honest. My obvious next question was: “why is step 2 taking so long??”. Let’s find out.
Where is the slowdown?
I’ve broken down the inner processes of this step before, but I’ll briefly review it here. The purpose of this step is to transform an edgelist file into something that’s easier to work with. The format we transform it into is called compressed sparse-row (CSR) format. This looks like the following:
Edgelist (no weights):
0 1
1 2
1 3
2 3
CSR:
0 1 3 4
1 2 3 3
The first n+1
values store offsets for each node, and the following numbers store the edges. For node i
, we first look at value i
and i+1
from the CSR file. If we were looking at node 1, we’d find the values 1
and 3
. This means that the edges from node 1 are edges 1-2 in the file (0-indexed, and not including the final index). Skipping past the offsets, we see that the two edges from node 1 are 2,3
. This is better for later processing because the neighbors of each node are stored in a contiguous block that we can easily access.
The slowest part of this algorithm is building this file. We set up the scaffold of the CSR file in Step 1 by recording the node degree for each node (which gives us the offsets), and then in Step 2 we populate the edges themselves (the values after the offsets). The pseudocode basically looks like this:
f = edgelist_file
csr = csr_file
n = num_nodes
for edge in f:
edge_start = edge[0]
edge_end = edge[1]
// get the offset
fseek(f, edge_start, SEEK_SET)
fread(offset, ..., f)
// write the edge
fseek(f, num_nodes + offset + 1, SEEK_SET)
fwrite(edge_end, ..., f)
There’s some additional logic that goes into determining where in the block of values allocated for each node to write to, but it doesn’t really matter for this use-case. This code is pretty simple, and I wasn’t really sure how to improve it further.
After talking with my advisor, we decided the best route is to try just benchmarking how long it takes to copy the entire file without doing any of the extra stuff. In other words, if we just had to write the same amount raw amount of data but without doing any other processing, how long does it take? If my implementation is close in speed to this version, we’d know that the implementation is pretty close to optimal (and that the slowdown is really just due to inherent limitations with data I/O).
I used a file with 74 million edges so I wouldn’t have to wait an entire day for each run. My code took 630 seconds to build the CSR list. Just writing the file without any other processing took…25 seconds. The bad news is that that’s a huge difference, so my code probably isn’t optimal. The good news is that there’s a lot of room to improve!
So, what’s causing this, and why didn’t this show up in my initial timing benchmarks? In short, I was too focused on speeding up fwrite
specifically that I forgot to check how fwrite
interacts with fread
and fseek
. The conceptual reason for the slowdown is that hard drives really do not enjoy random access, even on SSDs. Yes, an SSD will have much faster random access than an HDD, but all these devices will prefetch data past the read point to improve future lookups. If you’re reading/writing in sequential order, the hardware can operate at its maximum efficiency (since that’s its most common use-case, and thus what it’s designed to do). If you’re reading/writing in random order, that extra data fetching won’t give you any benefit, essentially putting a huge performance penalty on all your data accesses.
This means that my observed slowdown isn’t because of a particular fwrite
or fseek
call, but rather the combination of an fseek
followed by an fwrite
. This slowdown even occurs when the fseek
calls traverse the file linearly rather than randomly – presorting the output buffer so that each fseek
call strictly moves the file pointer forward reduced the runtime from 630 seconds to 610 seconds. The solution, then, is to find a way to build the CSR compressed file while only reading/writing sequentially.
Conclusion and Next Steps
The broad takeaway here is that you get massive performance gains by sticking to sequential read/writes, even on an SSD. Hard drives are like really terrible caches – you’ll get better performance by reducing the number of hard drive cache flush/refills just like you would for RAM-based solutions by improving cache locality. The catch is that every inefficiency is magnified when working with external storage due to how slow I/O is compared to internal memory. Optimizing your fseek
calls can definitely improve speed, but the optimal solution will likely be one that doesn’t ever call fseek
.
As for this project, there is a way to improve it. Note that the edges in the CSR format are just the second vertices sorted in order of their first vertex. In R, that would look something like:
offsets <- cumsum(table(edgelist[,1]))
csr_verts <- edgelist[order(edgelist[,1]),2]
csr_weights <- edgelist[order(edgelist[,1]),3]
Sadly, it’s not possible to load entire edgelists into an R session because of RAM requirements. However, what we can do is build an intermediate array and then sort it with external memory. The new algorithm will look something like this (c pseudocode):
typedef struct edge_holder {
int v1;
int v2;
int weight;
} edge_t;
// comparison function
int cmp(const void *a, const void *b){
return *(const edge *)b.v1 - *(const edge *)a.v2;
}
edge_t e;
for (edge in file){
e.v1 = edge[0];
e.v2 = edge[1];
e.weight = edge[2];
fwrite(e, ..., output_file);
}
// some way to sort this file using disk space
external_sort(output_file, cmp);
This method guarantees we only use sequential access – reads from file
are sequential, and writes to output_file
are sequential. The above method isn’t exactly what I implemented; the real code includes things like buffering to reduce the number of fwrite
calls as well as some compression methods to reduce the size of the edge_t
type.
The elephant in the room is the last step: sorting with external memory. I’m going to leave the details on that to the next blog post…but if you can’t wait, it uses a k-way mergesort with a tournament tree. And it’s really, really fast.
Thanks for reading – check out the next post in this series here!