Converting vectors to dendrograms in R

7 minute read

Published:

I do really like R’s dendrogram object, but it can be really clunky to work with them. I recently implemented random forests from scratch, and the decision trees within were stored as two vectors. However, it would be really nice to be able to plot these decision trees to visualize their contents. This got me thinking–is it possible to convert vectors of values into dendrogram objects in R if we know the vectors are ordered in a known tree traversal? In other languages this would be fairly simple, but the nested list structure of dendrogram objects (and their other attributes) makes it a little trickier in R.

Setup

Let me explain the problem a little more clearly. Our input is minimally an integer vector, with an optional second vector. The integer vector encodes nodes–positive values indicate internal nodes, whereas negative values indicate leaves. The second vector is paired with data we want to include at each node; this could be thresholds for a decision tree node, labels to print, or whatever. I’m assuming the input is ordered along a breadth-first traversal. Let’s see what that would look like for a small tree:

       a
     /   \
    b     c
   / \
  d   e

 labels: a  b  c  d  e
indices: 1  1 -1 -1 -1

I’m also assuming that this tree is fully bifurcating, with no interior nodes having outdegree 1. In other words, all interior nodes have exactly two children. The output we’re expecting is a dendrogram object with the same structure as the tree implied by the labels/indices vectors.

What makes a dendrogram?

dendrogram objects are nested lists with a set of attributes. Each internal node of the dendrogram has the following attributes:

  • members: number of leaves below this node (1 for leaves)
  • height: height of the node
  • midpoint: (optional) x offset relative to the leftmost child for plotting

Leaf nodes have a members attribute of 1, usually a height attribute of 0, and no midpoint attribute. They also have the following extra attributes:

  • label: label of the leaf
  • leaf: logical always set to TRUE

One important note is that these are largely nested list objects with a bunch of attributes. What happens at the leaves? Are they just empty lists?

The answer is no. In fact, if you call unlist on a dendrogram, you’ll get back a vector of integers. The leaf nodes are single integer values with additional attributes. In the previous example, the dendrogram object internally (without all its attributes) would be equivalent to:

list(list(1L, 2L), 3L)

This is important. Lots of stuff breaks if you build a dendrogram object by hand that doesn’t have integers in the leaves.

dendrogram objects have another nice property, although it isn’t specific to the dendrogram class. Sublists in any nested list structure can be accessed using either multiple [[ calls, or vectors of indices. For example:

x <- list(list(list(1,2), list(3,4)), list(list(5,6), list(7,8)))

x[[1]][[2]][[1]]
# 3

x[[c(1,2,1)]]
# 3

This latter accession mode is going to be extremely useful.

Building the nested list structure

Breadth-first searches are typically done using queues–for each node, we process it, then add its children to the end of the queue. This is what makes this operation difficult in R…we can’t store pointers to nodes like we could do in (for example) C. However, we can actually build a queue accessing the nodes of the tree by using vectors of accessors. We can start with the root and a vector of accessors list(1,2). At any particular node x, its children are x[[1]] and x[[2]]. That means that, given an accessor i, if we append list(c(i,1) and list(c(i,2)) to our vector of accessors, we now have “pointers” to the children of the node accessed at root[[i]].

f <- function(node_indices, node_data){
  ## initialize the root node
  root <- vector('list', length=2L)
  attr(root, "members") <- 0L
  attr(root, "height") <- 0
  attr(root, "midpoint") <- 0
  attr(root, "otherdata") <- node_data[[1]]
  class(root) <- "dendrogram"

  ## set up a queue
  cur_q <- list(1,2)
  ctr <- 2L
  leafctr <- 1L
  while(ctr <= length(node_indices)){
    accessor <- cur_q[[ctr]]
    is_leaf <- node_indices[ctr] > 0
    otherdata <- node_data[[ctr]]
    if(!is_leaf){
      ## interior nodes are lists
      cur_node <- vector('list', length=2L)
      attr(cur_node, 'midpoint') <- 0

      ## add children to the queue
      cur_q <- c(cur_q, list(c(accessor, 1L), c(accessor, 2L)))
    } else {
      ## leaves are integers
      cur_node <- leafctr

      ## extra leaf attributes
      attr(cur_node, 'leaf') <- TRUE
      attr(cur_node, 'label') <- as.character(leafctr)

      leafctr <- leafctr + 1L
    }

    ## add attributes for all nodes
    attr(cur_node, 'height') <- 0
    attr(cur_node, 'members') <- 0L
    attr(cur_node, 'otherdata') <- otherdata
    class(cur_node) <- 'dendrogram'

    ## assign as a child of the current node
    root[[accessor]] <- cur_node
    ctr <- ctr + 1L
  }

  ## What are we missing here?

  root
}

Fixing the nodes

This is pretty close! There’s one small issue that you may have noticed in the prior function–all the attributes are wrong! We set members and height (and midpoint) to zero for every node, which isn’t really correct. If we try to plot the resulting object, it’ll be garbage.

The reason these are set to zero is because there isn’t a great way to determine what the correct values of these should be during the initial processing. It’s much easier to build the dendrogram object and then correct the values later. For this, I’m going to use my version of dendrapply (available in SynExtend) because it supports post-order traversals, which are important for this.

For each node x, we have to correct members, height, and midpoint. The members attribute is the number of leaves, which we can get with length(unlist(x)). The height attribute is relatively arbitrary, but each node has to be higher than its children for plotting. We’ll just set leaves to zero, and then interior nodes to max(height(x[[1]]), height(x[[2]])) + 1L.

The last attribute is midpoint, which can be relatively complicated. This attribute determines where the node is placed on the x-axis relative to its leftmost child. I’m not entirely sure why dendrogram objects plot using this parameter, but the relative referencing makes things complex. Each leaf is plotted with distance 1.0 from subsequent nodes. Thus, if the children of x are both leaves, the midpoint is 0.5–halfway between the leaves, since the leaves are distance 1.0 apart. The other cases are more complicated, but I’ll include the code below and leave the reasoning as an exercise for the reader.

f <- function(node_indices, node_data){
  ## ... everything from before

  ## here's what we were missing:
  midpoint <- \(x) attr(x, 'midpoint')
  height <- \(x) attr(x, 'height')
  root <- dendrapply(root, \(x){
    if(!is.leaf(x)){
      ## internal nodes
      attr(x, 'members') <- length(unlist(x))
      attr(x, 'height') <- max(height(x[[1]]), height(x[[2]])) + 1L

      ## midpoint calculation
      mp <- 0
      child_is_leaf <- c(is.leaf(x[[1]]), is.leaf(x[[2]]))
      if(child_is_leaf[1] && child_is_leaf[2]){
        ## both children leaves
        mp <- 0.5
      } else if(child_is_leaf[1]){
        ## only left child a leaf
        mp <- ((midpoint(x[[2]]) + 1L) / 2)
      } else if(child_is_leaf[2]){
        ## only right child is a leaf
        mp <- ((midpoint(x[[1]]) + 1L) / 2) + midpoint(x[[1]])
      } else {
        ## both children non-leaves
        mp <- (attr(x[[1]], 'members') + midpoint(x[[1]]) + midpoint(x[[2]])) / 2
      }
      attr(x, 'midpoint') <- mp
    } else {
      ## leaves
      attr(x, 'members') <- 1L
    }

    x
  }, how='post.order')

  root
}

Conclusion

This doesn’t put together the entire function–if you’re interested in the complete thing, you can check out the source code on GitHub. Thanks for reading!