Efficient File I/O, Part 4: In-Place External Sorting

19 minute read

Published:

If you’ve been following all the posts in this series, you’ll know that by know I have a pretty good way to read in edges for ExoLabel due primarily to faster I/O and an optimized external sorting function (if you haven’t, check out the first post here!). I left off in my last post by mentioning that I wanted to change my external sort to work in-place, but I didn’t really explain why.

Ideally, I’d like to minimize the total footprint of my algorithm. My first approach to reading in graph edges (allocating the exact amount of space they require and then recording them as the edgelist file is read) was horrendously slow, but only consumed 12 bytes per edge of the graph. The approach with loser trees and external merge sorts first requires that we record all the edges. This would normally require 20 bytes per edge (two unsigned long int values and one float), but I compressed this further by packing the float and second unsigned long into a single value. If we assume we have less than $2^k$ indices, then we can pack a float and uint64_t into one value with the following approach:

uint64_t pack_ind_and_weight(uint64_t index, float weight, int k):
  int bits_to_shift = 64-k;
  uint64_t max_value = (1 << bits_to_shift) - 1;

  // minimize loss of precision by using doubles
  uint64_t quantized_weight = (uint64_t) floor(((double)weight) * max_value);
  index <<= bits_to_shift;

  return (index | quantized_weight);

With this approach, it only takes 16 bytes per edge, which is only a little more than the initial slow approach. However, the problem is in the external sorting algorithm. As we’re mergesorting the file, we copy the sorted values into a new file. The values swap between these two files until everything is completely sorted. There isn’t a way to shrink the size of a file from the top while you’re in it, so this means that the only way to free up the space consumed by the file being sorted is to delete it at the end of the pass. As a result, right before the file is deleted, we have each edge stored twice (once in each file)…which makes our actual total disk consumption 32 bytes.

Sometimes I get trapped in a cycle of overoptimizing when I don’t really need to. Before going any further, it’s worth considering what the actual impact of this is. Does it really matter if we have an extra 22 bytes of overhead per edge of the graph to process? Here’s my rationale for why the answer is “probably yes”.

First, this sorting operation has the maximal disk consumption of any part of ExoLabel. If a later step consumed more disk space, I could safely use this approach since I would’ve needed that space anyway. Unfortunately, though, I don’t.

Second, 22 bytes per edge is a lot of data when you’re working with really large graphs. Imagine we have an undirected network with a billion nodes, average node degree of 100, vertex labels with around 10 characters, and weights with three digits of precision (e.g., 0.912). Each line of the edgelist will consume an average of 28 bytes ($2 \times 10$ bytes for the two labels, two tab separators, five bytes for the weight, and one byte for the newline). A billion nodes, average node outdegree of 100, and undirected edges means that we have 50 billion edges. This means our input file size is going to be around $28 \times 50 \times 10^9=1.4\text{TB}$ of space. If the network is undirected, we have to record each edge twice in the CSR representation. The original slow approach would consume $12 \times 2 \times 50 \times 10^9 = 1.2\text{TB}$, whereas the external sort would consume $32 \times 2 \times 50 \times 10^9=3.2\text{TB}$. That’s more than double the original file size, which is not ideal.

So what can we do? The most straightforward solution is to see if we can make the external sort in-place rather than copying between different files. This would reduce the overhead by a factor of two, meaning that we’re looking at $1.6\text{TB}$ of space, or roughly the same size as the initial file. That seems more reasonable to me.

An In-Place k-Way Merge Sort

So I’ve identified that I want to do this merge in-place. How do we actually accomplish that? Let’s first review the flow of the merging algorithm. At each stage, we have some number of sorted blocks. We take the first k blocks, load the first n values of each block into k buffers, and then run a loser tree to output the smallest number from them. This smallest number is written to a buffer, which dumps its contents to a new file when it gets full. When one of the buffers empties, we reload it with the next n values (or the number of values remaining, whichever is smaller) from its corresponding block.

What I’d like to do is to instead dump that output buffer into the same file that we’re currently reading from. The trick is making sure we’re not overwriting any values we might need to read later.

I’m going to set up a really simple example. Suppose our file has 16 values, comprised of four sorted blocks of length four. Let’s call these blocks A,B,C,D, and their values a1, a2, ... d3, d4:

FILE START
============
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
============
    FILE END

The loser tree will load up to two values at a time from each block. Once a value is loaded into the loser tree, I don’t need it in the file anymore, so I’ll mark it as xx. I’ll also give it an output buffer of size two. After the initial file load, the file looks like this:

FILE START
============
xx xx a3 a4
xx xx b3 b4
xx xx c3 c4
xx xx d3 d4
============
    FILE END

Input Buffers:
[a1, a2] [b1, b2] [c1, c2] [d1, d2]

Output Buffer:
[-, -]

There are some deceptively simple cases that might arise. For example, if the file is already in sorted order, then the tree pops [a1, a2], I write them to the top of the file, refill the now-empty A bucket, and get to:

FILE START
============
a1 a2 xx xx
xx xx b3 b4
xx xx c3 c4
xx xx d3 d4
============
    FILE END

Input Buffers:
[a3, a4] [b1, b2] [c1, c2] [d1, d2]

Previous Output Buffer:
[a1, a2]

The next buffer would go into the next empty space, we’d and then the next…and that would just continue down the file forever. Unfortunately, this doesn’t work for a variety of reasons. Rather than cover every possible scenario, I’ll just skip to the worst-case one. The worst-case scenario to consider is when all of the first entries come from the final block. This would mean that we essentially need to move the entire last block to the front of the file, and shift everything else down. A final output that would result from such a scenario could look like:

SORTED FILE START
=================
d1 d2 d3 d4
a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
=================
  SORTED FILE END

Maybe there’s something to be learned from this case, though. If we knew that block was going to be moved to the top from the beginning, we could have shifted all the other elements down before writing the elements. In fact, this is the beginning of an algorithm that will work. If you want to see if you can come up with a solution on your own, here are the facts that I considered when I was looking at this problem:

  • The loser tree will always load values into its output buffer from its input buffers.
  • The loser tree only dumps its output when either the output buffer is full, or all its input buffers are empty.
  • Once a value is loaded into an input buffer, it can safely be overwritten in the tree.

This may not seem like much, but it leads to a critical insight on the problem. Again, any value in the input buffer does not need to be maintained in the file, and the output buffer only gets its values from the input buffers. This means that when the output buffer wants to dump its contents to the file, we are guaranteed to have enough xx values somewhere to be able to place all the values in the output buffer into the k blocks we’re currently processing.

Let’s go back to the example, but this time suppose that the output buffer has length four. Imagine that it’s already been filled, but we don’t know what the values are. I’ll just label them y1, y2, y3, y4:

FILE START
============
xx xx a3 a4
xx xx b3 b4
xx xx c3 c4
xx xx d3 d4
============
    FILE END

Output Buffer:
[y1, y2, y3, y4]

We want to write that buffer to the beginning of this block, but doing so would overwrite a3, a4, which we haven’t processed yet. However, we know for certain that there are enough xx values somewhere in the file to support this block, so there must be some way to shuffle around the gaps to make space.

I came up with a solution that I think is pretty straightforward and reasonably efficient. There are many papers published on this type of algorithm that include more more optimal solutions, but I think this is optimal enough for the problem. I’ll leave further optimization to future work if it becomes a problem.

Anyway, I know that I want to write to the top of the file, and I know that there’s a block of elements that haven’t yet been processed in that section about to be overwritten. To fix this, I essentially slide the unprocessed blocks to the end of the file. Starting with the last block, I align each block’s end to the next block’s beginning (or the end of the file, for the final block). Note that if the k blocks being processed don’t end at the absolute end of the file, we can treat the end of block k as the end of the file. This pushes the blocks downward:

FILE START
============
xx xx xx xx
xx xx xx xx
a3 a4 b3 b4
c3 c4 d3 d4
============
    FILE END

Output Buffer:
[y1, y2, y3, y4]

All the empty spaces are now at the top of the file, and since I know that there are guaranteed to be at least enough spaces to write the contents of the output buffer, I can safely write the output buffer to the file:

FILE START
============
y1 y2 y3 y4
xx xx xx xx
a3 a4 b3 b4
c3 c4 d3 d4
============
    FILE END

Output Buffer:
[-, -, -, -]

You can see that there’s actually room for another full output buffer, so the next time my buffer fills, I won’t have to do any moving at all. This operation of shuffling blocks is pretty expensive computationally because it bounces around the file (see my previous posts for more detail on this). Because of this, I only move blocks when there exists at least one unprocessed element within the region I’m about to write to. Moving all the blocks to the end of the segment moves all gaps to the beginning of the block, so theoretically we won’t need to move as many times. If you’re interested in a optimal algorithms for this problem (in terms of minimizing block shuffle operations), you could check out this paper as a starting point. Fair warning, they get pretty dense.

How Does It Perform?

If you remember my last post, I improved the runtime over the naive method from 640 seconds to under a minute. It’s not exactly clear to me how the choice of k impacts runtime – you might expect that more bins means lower runtime, since we need fewer passes through the file. However, more bins also means more comparisons and potentially more long-distance jumps throughout the file, which may cause slowdowns due to cache inefficiency. Benchmarking this kind of stuff is also really challenging, since the OS may have lots of other unrelated stuff going on in the background that could limit your performance.

Anyway, my implementation using a non-in-place 64-way merge took around 62 seconds in my last test on a file with 74 million undirected edges. Using an in-place merge, it went up to 84 seconds. A huge chunk of both of those times is reading in the edges in the first place (i.e., not the sorting), but the in-place merge does contribute an additional 22 seconds of runtime over the non-in-place version. I think the tradeoff is worth it, though, given the enormous reduction in disk consumption obtained in response. I also added an argument use_fast_sort to ExoLabel that lets users employ the faster sort when disk consumption isn’t an issue.

One Last Thing…

There’s one more issue I had to solve here – remember when I mentioned that I’m compressing edges by packing a long and a double into one uint64_t? Each edge I write has two uint64_t values (one of which is packed), but my subsequent functions need this to be split into a single long and a single float per edge. If we just decompress the values and write them to a new file, we’ll end up incurring an extra 12 bytes per edge (28 total), which basically undoes all the work I did here.

I mentioned that it isn’t possible to shrink a file from the beginning while you’re in it, but it is possible to truncate it from the end. Of the two uint64_t values in the edge structure, the first value isn’t needed after we’re done sorting. Thus, I can start by overwriting the whole file with just the second packed uint64_t from each edge. This doesn’t shrink the file, though – it only replaces the first half of the file with other values. However, we can call platform-specific instructions (truncate on POSIX systems and _chsize_s for Windows) to tell the OS to shrink the file by truncating the it to a specified size. That function to truncate a file looks like this (fname is the file, size is the size to truncate to):

void truncate_file(const char* fname, size_t size){
  int retval = 0;
#ifdef HAVE_UNISTD_H
  // unistd.h is imported earlier in a similar ifdef
  retval = truncate(fname, size);
#else
  #ifdef WIN32
    int filehandler = _open(fname, _O_BINARY | _O_RDWR);
    retval = _chsize_s(filehandler, size);
    _close(filehandler);
  #else
    return;
  #endif
#endif
  if(retval != 0) error("Failed to truncate file!");
  return;
}

Once that’s done, we have a file that encodes all the edges in 8 bytes per edge. It’s then fairly simple to read each edge, decompress the packed uint64_t, and then write the index and weight. The index overwrites the value just written, and the weight goes into a new file. In total, the sequence looks like:

  • Overwrite first half of file with second number (16b / edge)
  • Truncate file by 50% (8b / edge)
  • Decompress edges (12b / edge)

This keeps our maximum disk usage at 16 bytes per edge.

Conclusion and Code

As with the last post, I’ll include the code I wrote below. There’s definitely more improvements that could be made to my in-place external sort, but I think for now it’s in a good place. The clustering step is now much larger part of the runtime for medium-sized graphs, so I think further optimization would likely be directed there before revisiting this step. You can always check out the most up-to-date version of this on my GitHub. Thanks for reading!

Note: Code below only includes elements that were added for in-place merging; see my previous post for the rest of the code if necessary.

In-Place k-Way Merge Code
void reorganize_blocks(LoserTree *tree, size_t block_end, FILE *f,
                        long int *remaining, long int **offsets){
  /*
   * Reorganize blocks by shuffling all unprocessed blocks to the end
   */
  size_t size = tree->e_size;
  int output_size = tree->cur_output_i;
  long int *offs = *offsets;
  // use as much space as possible, minimize r/w calls
  void *scratch_buf = malloc(size*tree->output_size);
  int nbins = tree->nbins;
  long int write_start, write_end, read_start, read_end, to_read;

  int last_bin = nbins-1;
  while(!remaining[last_bin]) last_bin--;
  write_end = block_end;
  for(int i=last_bin; i>=0; i--){
    // last bin is always in the right place
    // have to move all the rest of the bins down
    if(remaining[i]){
      read_start = offs[i];
      read_end = offs[i] + remaining[i];
      while(read_end != offs[i]){
        R_CheckUserInterrupt();
        to_read = output_size;
        if(read_end < offs[i] + to_read) to_read = read_end - offs[i];
        read_start = read_end - to_read;
        write_start = write_end - to_read;
        fseek(f, read_start*size, SEEK_SET);
        fread(scratch_buf, size, to_read, f);
        fseek(f, write_start*size, SEEK_SET);
        fwrite(scratch_buf, size, to_read, f);
        read_end = read_start;
        write_end = write_start;
      }
      offs[i] = write_start;
    }
  }

  free(scratch_buf);
  return;
}

size_t LT_fdumpOutputInplace(LoserTree *tree, size_t block_end,
                            FILE *f, long int *remaining, long int **offsets){
  /*
   * Function to dump output in-place
   * Requires a bunch of extra values so we can keep track of stuff
   * Input Variables:
   *  -      tree: LoserTree structure
   *  -         f: file pointer for reading values
   *  - f_writing: file pointer for writing values
   *  -     start: starting line for the set of all blocks in current iteration
   *  - remaining: pointer to int* containing # of elements remaining per block
   *  -   offsets: pointer to int* with start position of each block
   */

  size_t size = tree->e_size;
  size_t start = tree->nwritten;
  int output_size = tree->cur_output_i;
  int nbins = tree->nbins;
  long int *offs = *offsets;

  if(!output_size) return start;
  int first_bin = 0;
  while(first_bin < nbins && !remaining[first_bin]) first_bin++;

  if(first_bin < nbins && offs[first_bin] < (start+output_size))
    reorganize_blocks(tree, block_end, f, remaining, offsets);

  // reset the writing pointer
  // note tree->nwritten gives us the position of the last sorted value written
  // (which is where we need to write next)
  fseek(f, (tree->nwritten)*size, SEEK_SET);
  LT_fdumpOutput(tree, f);

  return 0;
}

int LT_runInplaceFileGame(LoserTree *tree, size_t block_end,
                          FILE *f, long int *remaining, long int **offsets){
  /*
   * Same as LT_runFileGame(), but does it in-place
   * see LT_fdumpOutputInplace for argument description
   */
  int retval = -1;
  while(tree->full_bins){
    LT_popOutput(tree);
    if(tree->cur_output_i == tree->output_size)
      LT_fdumpOutputInplace(tree, block_end, f, remaining, offsets);
    if(tree->empty_bin != -1){
      retval = tree->empty_bin;
      tree->empty_bin = -1;
      return retval;
    }
    LT_updateTree(tree);
  }
  return -1;
}