Efficient File I/O, Part 3: Loser Trees and External Sorting

25 minute read

Published: December 09, 2024

In my last post, I talked about ways to improve the efficiency of fwrite calls. Essentially, it boils down to prioritizing sequential read/writes. At the end of that post, I mentioned that I need a way to sort a file on disk. These are called external sorting algorithms. While they’re not used very often today, they were incredibly important in the past. Back in the days of tape drives, computers rarely had enough RAM to load a dataset into memory for sorting, and so external sorts were used instead.

The good news is that these algorithms are very well studied, so there’s a lot of information already available on them. Wikipedia notes that k-way merges are commonly used, and even mentions that the best way to implement a k-way merge is using a data structure called a Loser Tree (I had never heard of this structure before reading about external sorting algorithms). The bad news is that the information available from places like Wikipedia are pretty sparse, and sometimes wrong. I found a terrific talk by Bryan Boreham entitled “Blazing fast merge with loser trees”, which did an amazing job of describing the structure (and gave some great performance benchmarks!), but since his code is written in Go, I had to adapt it for C.

Why do this?

If you’re like me, the first question you may ask is “why do this at all?”. The algorithm I’m working on benefits greatly from sorting the values to work with prior to working with them, so ideally I want a sorting algorithm that runs fast enough to not undo the gains we get later from having sorted values. The largest constraint on this problem is that all the values are stored in a file, and there are too many values to read into RAM. This means that I’ll have to use an external sorting algorithm.

Merge sort is a pretty standard way to sort files, since the process is very compatible with continuously streaming in data from one or more files. However, it says online that a “k-way” merge is often a better strategy for merging files. A k-way merge is a generalized version of merge sort, where instead of merging two blocks at a time, we merge k. If k=2 then we get a standard merge sort. You don’t often hear about this in the context of sorting algorithms because it doesn’t provide a huge benefit–in terms of asymptotic runtime, the sum of the number of comparisons and the number of data read/writes can be shown to be a fixed value (if all algorithms are optimal) regardless of the value of k. When we’re working in RAM, this means that there isn’t usually much value to doing more than standard merge sort, since the cost of a comparison and a data read/write are roughly the same. However, when we’re working with data stored on files, the cost of data read/writes is orders of magnitude higher than the cost of a comparison between values, so we get a big performance boost from increasing the value of k to decrease the number of file read/writes even though it similarly increases the number of comparisons. Large values of k need fewer passes through the file – standard merge sort will take $\log_2 n$ passes through n elements, whereas a k-way merge will take $\log_k n$ passes. Each pass involves reading and writing all n elements, so one less pass means n fewer reads and writes.

Conceptually, the process by which we merge k blocks is pretty simple:

Load in k blocks of memory
Find the block, i, that has the smallest value
Write the value in block i to the new file
Advance block i by one value

Unfortunately, step (2) in this process holds a lot of hidden complexity. If we take a naive approach and simply scan all k blocks to find the lowest value each time, we’ll end up with an algorithm with runtime $\Theta(nk)$ and lose out on all our performance benefits.

If you’re like me, you’re probably thinking that this sounds like a great problem for a heap. This line of thinking is mostly correct, but unfortunately it’s not the right answer. Wikipedia’s “k-way merge algorithm” page says that “The heap is more commonly used, although a tournament tree is faster in practice”, but it maddeningly doesn’t (or didn’t at least, at the time) provide much further explanation. Both these algorithms have worst-case $\Theta(n \log k)$ runtime, where n is the number of elements and k is the number of blocks to use. Heaps are significantly easier to implement, so if you don’t need to eke out every last drop of performance, I’d just go with a heap. That said, I do need to eke out every last drop of performance, so I went ahead with the tournament tree approach.

Loser (Tournament) Trees

Conceptually, a “tournament tree” is a tournament bracket, like you might see at the World Cup or March Madness. For anyone unaware of what that is, it’s a structure with the following setup:

participants start at the leaves of a tree structure
at each internal node starting from the bottom, the two child nodes “play a game” against each other
the loser of the game stays at that node, and the winner of the game progresses to play another game at the parent node.

If we consider a theoretical (incredibly boring) game where the higher rated player always wins, the below is an example of how the tree might unfold:

Initialization (ratings are in parentheses):
          ?? vs ??
       /            \
    A vs B        C vs D
   /      \      /      \
A(1.0) B(2.2) C(3.2) D(1.5)


B beats A, C beats D:
           B vs C
       /            \
      A              D
   /      \      /      \
A(1.0) B(2.2) C(3.2) D(1.5)


C beats B, so C is the overall winner:
            (C)
             |
             B
       /            \
      A              D
   /      \      /      \
A(1.0) B(2.2) C(3.2) D(1.5)

When we update a value in the tree, we start by changing the value at the leaf. Let’s say that player C really slacked off, and down to a 0.9 rating. We don’t actually have to replay all the games, since we already know the result of most of them. Instead, we just update the value of C at the leaf, and then only replay the games that are a parent of the current node:

Previous Tree:
            (C)
             |
             B
       /            \
      A              D
   /      \      /      \
A(1.0) B(2.2) C(3.2) D(1.5)


C moves down to 0.9:
             ?
             |
             B
       /            \
      A              D
   /      \      /      \
A(1.0) B(2.2) C(0.9) D(1.5)


Replay parent game:
             ?
             |
             B
       /            \
      A           C vs D: D wins
   /      \      /      \
A(1.0) B(2.2) C(0.9) D(1.5)


Continue:
             ?
             |
          B vs D: B wins
       /            \
      A              C
   /      \      /      \
A(1.0) B(2.2) C(0.9) D(1.5)

Return new overall winner:
             B
             |
             D
       /            \
      A              C
   /      \      /      \
A(1.0) B(2.2) C(0.9) D(1.5)

This optimization only works if we update the value of the overall winner. However, that’s a very easy restraint to work around – when we’re sorting, we just want to pull out the largest (or smallest) value, so we’ll only be messing with the values at the top of the tree. I’ll show how that works shortly.

A loser tree is a type of tournament tree in which the loser advances to the next round, rather than the winner. In code, the “game” played is just asking which value is larger. This means that the larger value (the winner) will stay at the internal nodes, and the smaller value (the loser) will continue on up the tree. Eventually, the smallest value from the leaves ends up at the top of the tree.

How Does a Loser Tree Do a k-way Merge?

So we have this weird data structure – how does it help us actually merge blocks? Imagine that I have k presorted blocks, maybe stored ascending in separate arrays, and I want to merge them together into one big array sorted in ascending order. If I use a loser tree, the smallest value from the four arrays will always be stored on top. For example, let’s look at the following example of four arrays, each of size 2:

         ?
         |
         ?
    /        \
   ?          ?
 /   \      /   \
3     1    2     6
4     5    8     7
A     B    C     D

Here A,B,C,D are the names of the four blocks. I’m going to use their names to store the losers/winners of each game, rather than use the values themselves. If we initialize our loser tree, we’ll find that B loses to A and C loses to D, then C loses to B, so B is the overall loser. Remember, the winners stay, and the losers continue up the tree:

         B
         |
         C
    /        \
   A          D
 /   \      /   \
3     1    2     6
4     5    8     7
A     B    C     D

Now we know that the value in B is the smallest value, so we can take its topmost value out of B and put it in our output array:

         B
         |
         C
    /        \
   A          D
 /   \      /   \
3     5    2     6
4          8     7
A     B    C     D

Output: [1]

This means that we’ve changed the value of B (it’s now 5), so it may no longer be the smallest value in our array. However, this is okay! We’ve only changed the topmost value of the tree, so we can update the tree with the optimization previously described. We start at B and play games with each parent to update the values stored in parent nodes:

B vs. A results in B winning and A continuing up the tree
A vs. C results in A winning and C continuing up the tree

         C
         |
         A
    /        \
   B          D
 /   \      /   \
3     5    2     6
4          8     7
A     B    C     D

Output: [1, ]

Then we pop the value out of C as we did before:

         C
         |
         A
    /        \
   B          D
 /   \      /   \
3     5    8     6
4                7
A     B    C     D

Output: [1,2]

And since the value of C is now changed, we update its parents and continue until all the arrays are depleted. One question that arises is what happens when one array is depleted but the others are not. In this case, we just treat its value as infinity. For example, after a few more runs the tree will look like this:

         A
         |
         D
    /        \
   B          C
 /   \      /   \
4     5    8     6
                 7
A     B    C     D

Output: [1,2,3]

Popping A results in that array being empty, so we just treat its next value as infinity. Infinity always wins, so the tree updates to:

         B
         |
         D
    /        \
   A          C
 /   \      /   \
-     5    8     6
                 7
A     B    C     D

Output: [1,2,3,4]

Once all the arrays are empty, we return our sorted list.

Another reason these are great for files is that we can “refill” these arrays instead of setting them to infinity. Suppose we want to merge four blocks, each with a million elements. We could load all four million elements at once, but this would take a lot of RAM. Instead, we can just allocate four buffers of size m, and load the first m elements from each block. When the buffer empties, we can refill it with the next m elements before we update our tree again. Similarly, we can buffer output values into a fixed size array and then dump it to an output file when it gets full. This strategy allows us to work with arbitrarily large blocks in constant space while also being relatively optimized in terms of read/write since we’re writing sequentially and reading large blocks sequentially.

While the internal structure of a loser tree is a binary tree, we can actually use a trick to represent it as a flat array. If you’re familiar with heaps, this probably won’t be super surprising. Consider the above tree, and suppose we start numbering the nodes in level order starting from the top:

Number each node starting from the top in level order:
         0
         |
         1
    /        \
   2          3
 /   \      /   \
4     5    6     7

This gives us a really nice property: the index of the parent of each node is half the indices of its children (integer division will truncate the result of dividing an odd number by two). In other words, for a k-way merge, we can easily store the entire binary tree in an array of size 2k. The leaves will be populated in indices k, k+1, ..., 2k-1, and the internal nodes will exist in 0, 1, ..., k-1. When we update the value of leaf i, we just change the value of index k+i and then find its children by repeatedly halving the index until we get to zero.

Heaps vs. Loser Trees

The first time I learned about tournament trees is via Wikipedia telling me that they’re more efficient than heaps. But, why? What makes a tournament tree “faster in practice” than a heap? Both these data structures are essentially identical internally, right?

Suppose I have the values 2,4,3,1,6,5,3,7. A heap might look like this:

     1
   /   \
  2     5
 / \   / \
3   4 6   7

Whereas a loser tree would look like this:

           1
           |
           5
      /         \
     3           7
   /   \       /   \
  2     4     6     -
 / \   / \   / \   / \
1   2 3   4 5   6 7   -

What happens if I change the 1 to 8?

In the heap approach, we have the following steps:

replace 1 with 8
compare 8 to both children (2, 5)
swap 8 with child 5, since 8 > 5
compare 8 with new children (6,7)
swap 8 with child 7, since 8 > 7

In a loser tree, we would do the following:

replace 1 with 8
compare 8 against parent (2)
8 > 2, so store winner (8) at node and continue with loser (2)
compare 2 against parent (3)
2 > 3, so store winner (3) at node and continue with loser (2)
compare 2 against parent (5)
2 > 5, so store winner (5) at node and continue with loser (2)
this is the top node, so return 2 as overall loser

This seems like more steps, but it actually isn’t. When we move down the tree in the heap approach, we have to make two comparisons at each node. Notice how we do two pairs of comparisons (8 vs 2,3; 8 vs 6,7) to update the heap in the above example. In contrast, the loser tree approach moves up the tree, so we only compare with the parent of the current node in each case. Thus, the tournament tree uses $\log k$ comparisons to update a value in the tree, whereas a heap uses $2\log k$. If we’re processing a lot of data, that difference will add up over time. The tradeoff is that the tournament tree takes twice as much space as the heap (2k values instead of k), but given that the value of k is nearly always going to be orders of magnitude smaller than the amount of data to process, this isn’t a big deal.

Performance and Implementation

This post focuses on the conceptual implementation, but I’ll also include the code below. One of the goals of this process was to speed up how fast we could read in edges from a tsv file. Here’s how this looks in practice on a dataset with 74 million lines each encoding a bidirectional edge (percentage improvements are versus the naive approach):

Naive approach (random writes): 640 seconds
Merge sort: 189 seconds (70.5% faster)
Loser tree 8-way merge: 42 seconds (93.4% faster)

I agree with Bryan Boreham – that’s blazingly fast.

The main remaining issue with this implementation is it doubles our disk space consumption, since I’m sorting by reading from one file and outputting to another. The next task will be implementing this strategy with an in-place sort, which is much more complicated than it might initially seem…but that’s a task for another blog post (which you can now check out here!).

If you’re interested in the code, it’s included below. The data structure I wrote is data-agnostic, so it should work with any input datatype.

Header File: LoserTree.h

#ifndef LOSERTREE_H
#define LOSERTREE_H

/*
 * This is intended to be a type-unaware LoserTree (Tournament Tree)
 * implementation that supports dynamic refilling of bins for external sorting.
 *
 * The expected usage is the following:
 *  - Initialize tree with LT_alloc()
 *  - for each block to merge:
 *      | read some number of elements into a buffer b
 *      | call LT_fillBin() to assign the bin to b
 *  - initialize tree values with LT_initGame()
 *  - while elements remain in blocks:
 *      | call LT_popOutput(tree)
 *      | if output buffer is full, call LT_dumpOutput
 *      | if tree->empty_bin != -1, refill the empty bin
 *          (e.g., read new elements into buffer, then LT_fillBin() again)
 *      | call LT_updateTree(tree)
 *
 *
 * For processing with output streamed to a file, use LT_runFileGame(). This
 * will keep running games until a bin is emptied, at which point the index
 * of the bin is returned. The calling function should then call LT_refillBin().
 * If the bin does not need to be refilled, the caller still needs to call
 * LT_refillBin(tree, i, 0, NULL). Pseudocode would look like:
 *  - Initialize tree with LT_alloc()
 *  - for each block to merge:
 *      | read some number of elements into a buffer b
 *      | call LT_fillBin() to assign the bin to b
 *  - initialize tree values with LT_initGame()
 *  - while blocks remain in files:
 *      | call LT_runFileGame(), returns index of empty bin
 *      | refill empty bin with LT_fillBin()
 *      | (if no elements remain, still call with nelem=0 to clear set flags)
 *  - call LT_fdumpOutput() to dump any remaining values to file
 *
 * If you don't care about dynamic bin refilling this could be greatly
 * streamlined, but because I'm assuming block size >> bin size, this
 * added complexity is necessary.
 *
 *
 * NOTE: the responsibility of keeping track of the buffers is placed on the
 *  calling function, NOT the LoserTree struct. LoserTree->bins is simply an
 *  array of pointers that will iterate along allocated void* memory. The memory
 *  pointed to by each LoserTree->bins[i] will NOT be alloc'd or free'd, only
 *  the pointers themselves.
 */

typedef struct LoserTree {
  int nbins;
  int full_bins;
  int empty_bin;
  int output_size;
  int cur_output_i;
  size_t e_size; // element size, in bytes
  int *binsize;
  void **bins;
  void *output;
  int *values;
  long nwritten;
  int (*compare)(const void *a, const void *b);
} LoserTree;

LoserTree* LT_alloc(int nbins, int output_size, size_t element_size,
                    int (*compare)(const void *a, const void *b));
void LT_fillBin(LoserTree *tree, int bin, int nelem, void *input);
void LT_initGame(LoserTree *tree);
void LT_popOutput(LoserTree *tree);
void LT_updateTree(LoserTree *tree);
void LT_refillBin(LoserTree *tree, int bin, int nelem, void *input);
size_t LT_dumpOutput(LoserTree *tree, void *output_buffer);
size_t LT_fdumpOutput(LoserTree *tree, FILE *f);
int LT_runFileGame(LoserTree *tree, FILE *f);
void LT_free(LoserTree *tree);

#endif

Main File: LoserTree.c

#include "LoserTree.h"

/*
 * See comments on expected usage in LoserTree.h
 *
 * Other small notes:
 *  - error() is included from Rdefines.h since it's used with R. perror() also works.
 *  - LT_dumpOutput() dumps to a buffer, not really used since this is used with files.
 */

LoserTree* LT_alloc(int nbins, int output_size, size_t element_size,
                    int (*compare)(const void *a, const void *b)){
  // going to fill bins later, for now just leave them

  LoserTree *tree = malloc(sizeof(LoserTree));

  // we need an even power of two for the number of bins
  int actual_bins = 1;
  while(actual_bins < nbins) actual_bins <<= 1;

  // allocating the bins
  tree->nbins = actual_bins;
  tree->full_bins = 0;

  // values will always hold indices, so we know it'll be ints
  int *values = malloc(sizeof(int) * (actual_bins) * 2);
  int *binsize = malloc(sizeof(int) * actual_bins);
  void **bins = malloc(sizeof(void *) * actual_bins);

  for(int i=0; i<actual_bins; i++){
    binsize[i] = 0;
    bins[i] = NULL;
    values[i] = -1;
    values[i+actual_bins] = i; // fill second half of array with indices
  }
  tree->binsize = binsize;
  tree->bins = bins;
  tree->values = values;

  tree->empty_bin = -1;
  tree->output_size = output_size;
  tree->output = malloc(output_size*element_size);
  tree->cur_output_i = 0;
  tree->e_size = element_size;

  tree->compare = compare;

  tree->nwritten = 0;

  return tree;
}

static void LT_playgame(LoserTree *tree, int *a, int *b){
  // helper function
  // assigns the index of smaller value (loser) to a, and the other to b
  if(!tree->binsize[*b]) return;
  if(!tree->binsize[*a] || tree->compare(tree->bins[*a], tree->bins[*b]) > 0){
    // need to swap the values
    int tmp = *a;
    *a = *b;
    *b = tmp;
  }
  return;
}

static int LT_playRecursiveGameAtNodeI(LoserTree *tree, int i){
  if(i >= tree->nbins) return (i - tree->nbins);

  int left = LT_playRecursiveGameAtNodeI(tree, 2*i); // left
  int right = LT_playRecursiveGameAtNodeI(tree, 2*i+1); // right

  int smaller=left, larger=right;
  if(!tree->binsize[right]){
    // if right doesn't exist, it counts as infinity (larger)
    larger = right;
    smaller = left;
  } else if(!tree->binsize[left]){
    // if left doesn't exist, it counts as infinity (larger)
    larger = left;
    smaller = right;
  } else {
    // otherwise, larger is left if cmp(a,b) > 0
    larger = tree->compare(tree->bins[left], tree->bins[right]) < 0 ? right : left;
    smaller = larger == left ? right : left;
  }

  // now "left" is always the index of the smaller element
  // and "right" is always the index of the larger element

  // in loser trees, the smaller element goes on, larger stays
  tree->values[i] = larger;
  return smaller;
}

void LT_fillBin(LoserTree *tree, int bin, int nelem, void *input){
  // fills a bin of the LT with some amount of data
  // input data should be preallocated, no memory copying will be done
  if(bin > tree->nbins)
    error("Attempted to fill out-of-bounds bin in LoserTree!");
  if(tree->binsize[bin] == 0 && nelem) tree->full_bins++;
  tree->binsize[bin] = nelem;
  tree->bins[bin] = nelem ? input : NULL;
  if(nelem && tree->empty_bin == bin)
    tree->empty_bin = -1;
  return;
}

void LT_refillBin(LoserTree *tree, int bin, int nelem, void *input){
  // should be called when the last popOutput() emptied a bin
  // then the top element will still be the (new) empty bin
  if(nelem) LT_fillBin(tree, bin, nelem, input);
  LT_updateTree(tree);
  return;
}

void LT_initGame(LoserTree *tree){
  tree->values[0] = LT_playRecursiveGameAtNodeI(tree, 1);
  return;
}

void LT_popOutput(LoserTree *tree){
  if(tree->output_size <= tree->cur_output_i)
    error("Tried to pop output from LoserTree but buffer is full!");
  size_t size = tree->e_size;
  int cur_min = tree->values[0];
  if(!tree->binsize[cur_min])
    error("Tried to pop LoserTree output from an empty bin!");

  // void_deref(v, i, size) is equivalent to v[i] for a void*
  void *to_write = void_deref(tree->output, (tree->cur_output_i)++, size);

  // copy the top value into the output array
  memcpy(to_write, tree->bins[cur_min], size);


  if(--(tree->binsize[cur_min])){
    // if there's still elements in the bin, advance the pointer by one
    tree->bins[cur_min] = void_deref(tree->bins[cur_min], 1, size);
    tree->empty_bin = -1;
  } else {
    // otherwise the bin is empty, mark the signal value so we can populate it
    // in the calling function
    tree->empty_bin = cur_min;
    tree->full_bins--;
    tree->bins[cur_min] = NULL;
  }

  // note that I'm not going to update the tree here
  // we'll do that in LT_updateTree
  return;
}

void LT_updateTree(LoserTree *tree){
  // this function will update the tree with the next value in the bin
  // assume we've already called LT_popOutput
  int last_popped = tree->values[0];
  int cur_node = last_popped + tree->nbins;
  int v1 = last_popped;
  int v2;

  while(cur_node) {
    v2 = tree->values[cur_node];
    LT_playgame(tree, &v1, &v2);
    tree->values[cur_node] = v2;
    cur_node /= 2; // this moves to parent node
  };

  tree->values[cur_node] = v1;

  return;
}

int LT_runFileGame(LoserTree *tree, FILE *f){
  /*
   * this function will run games until a bin is emptied
   * when this happens, control returns to the caller to
   * allow the bin to be refilled
   */
  int retval = -1;
  while(tree->full_bins){
    LT_popOutput(tree);
    if(tree->cur_output_i == tree->output_size)
      LT_fdumpOutput(tree, f);
    if(tree->empty_bin != -1){
      retval = tree->empty_bin;
      tree->empty_bin = -1;
      return retval;
    }
    LT_updateTree(tree);
  }
  return -1;
}

size_t LT_dumpOutput(LoserTree *tree, void *output_buffer){
  size_t nbytes = tree->e_size * tree->cur_output_i;
  memcpy(output_buffer, tree->output, nbytes);
  tree->cur_output_i = 0;
  return nbytes;
}

size_t LT_fdumpOutput(LoserTree *tree, FILE *f){
  // assume f is a valid file
  //printf("\n\tWriting %d values\n", tree->cur_output_i);
  size_t nbytes = tree->e_size * tree->cur_output_i;
  if (!nbytes) return 0;
  size_t nwrote = fwrite(tree->output, 1, nbytes, f);
  if(nwrote != nbytes)
    error("Failed to write to file! (tried to write %zu bytes, wrote %zu bytes)",
      nbytes, nwrote);
  tree->cur_output_i = 0;
  tree->nwritten += nwrote / tree->e_size;
  return nwrote;
}

void LT_free(LoserTree *tree){
  free(tree->bins);
  free(tree->binsize);
  free(tree->output);
  free(tree->values);
  free(tree);
  return;
}

Share on

Facebook LinkedIn

Aidan Lakshman

Efficient File I/O, Part 3: Loser Trees and External Sorting

Why do this?

Loser (Tournament) Trees

How Does a Loser Tree Do a k-way Merge?

Heaps vs. Loser Trees

Performance and Implementation

Share on

You May Also Enjoy

My SWE Interview Experiences: Amazon vs. Meta vs. Google

The ExoLabel Post: Clustering Massive Networks with Limited Resources

A Story of a Sneaky Bug

The ‘Research Scientist’ Role at Meta: What is it, exactly?