My next research project involves determining groups of genes that derive from a common ancestor (usually called ‘orthologs’). To do this, we often take a bunch of gene sequences, quantify their pairwise similarity in some fashion, and then cluster them into groups of genes that are more similar to each other than to other clusters. I’m oversimplifying a lot here to keep this brief, so keep in mind that the actual calculations of those “pairwise distances” can involve a bunch of stuff.

However, the main part of the problem I’m looking at is the next step of the equation. Given that we have a ton of things with pairwise similarities among them, how can we best cluster them? If you’re a mathematician or a computer scientist, the phrases “A ton of things” and “pairwise similarities” should immediately call to mind a graph structure. Graphs are everywhere, and this is no exception. Fortunately for me, graph clustering methods are an extremely well studied problem. The typical name for these methods is “community detection algorithms”, since they try to identify communities in a network. Think of something like a social network, in which the clusters/communities we identify are communities of people that interact with each other.

Now, the problem with genetic data is that it’s huge. The UniProt database has over 250 million sequenced proteins (as of this time of this post), and that number is only growing. While many algorithms exist for community detection, lots of them have quadratic scaling. Ideally, we’d have some linear time algorithm that can cluster massive datasets like this, even if it’s just a first-pass cluster that we refine later.

Linear-time algorithms do in fact exist, and the one I settled on is called Label Propagation. Label propagation has a few nice properties: it takes linear time and memory, it works on weighted graphs, and it’s super simple. The basic algorithm looks like this:

```
initialize each node to its own community
while node_has_been_changed:
for node in graph:
sum all outgoing weights by community
reassign node to community with most weight
```

We go through all the nodes, and update each node to whichever community is most represented from its neighbors. Pretty simple.

Unfortunately, this implementation is pretty efficient. Any time a node is changed, we reevaluate all nodes in the graph. This could end up approximating quadratic runtime if we have particularly bad graphs (the observed time complexity is around `O(m^1.23)`

, `m`

the number of edges). Subsequent work introduced the Fast Label Propagation (FLP) algorithm, which instead uses a queue structure to keep track of nodes that have changed. This modifies the rough structure of the algorithm to the following:

```
q <- queue(all_nodes)
initialize each node to its own community
while length(q) > 0:
node <- dequeue(q)
sum all outgoing weights by community
reassign node to community with most weight
for(neighbor in neighbors(node))
if(cluster(neighbor) != cluster(node) && !(neighbor in queue))
enqueue(q, neighbor)
```

This ends up being much faster, and actually shows better performance at community detection compared to the classic label propagation algorithm. You can check out the previously linked paper for more details, but the worst-case runtime complexity on average graphs tends to be around `O(m+(1/n-2))`

, where `m`

the number of edges and `n-1`

the average node degree. Other runtime statistics are derived, but suffice it to say the average runtime is approximately linear in the number of nodes rather than scaling superlinearly.

Identifying and implementing the algorithm isn’t actually the tough part. If you’re interested in an FLP algorithm, you can check out either the `igraph`

implementation (available on this GitHub branch) or my own R-compatible implementation (available on my GitHub repo). My own testing on weighted graphs with power-law distributed node degree has shown FLP to outperform standard label propagation on all the graphs I tested.

The bigger issue here is that huge graphs require huge amounts of memory. While this algorithm works, it’s not trivial to analyze a graph with hundreds of millions (or even billions!) of nodes. If you consider each edge to have an integer start and end and a `double`

weight, a graph with two hundred million nodes and average node degree of 4 would require 1.28GB of space (two 32-bit `int`

s for node indices, one 64-bit `double`

for weight = 16B per edge, 16*4*200mil = 1.28bil = 1.28GB). That’s just space for the edges–we’d need additional space to store all the names of the nodes, and even more to run the algorithm itself. Asking a computer for a gigabyte of contiguous memory is typically not the best strategy, and unfortunately these issues will only get worse as graphs get larger. 32-bit ints can only support numbers up to around 4 billion, so past that we’d double in space from using 64-bit ints.

We can always say “just get a super computer!”. However, not everyone has access to enormous amounts of RAM. Memory prices in general have been fairly stagnant recently, and expecting users to be able to huge amounts of RAM to a single analysis isn’t super feasible.

No one said we have to use RAM, though. Large database management systems don’t load their entire systems into RAM, they leverage disk space to keep most of the files and copy things into RAM as needed. With FLP, we really only need to know the edges connected to a single node at a time, not the entire graph. The focus of this algorithm is developing an FLP implementation that doesn’t rely on lots of RAM usage.

The trouble I always have with big data algorithms is the number of issues you encounter that are just totally insignificant when not at scale. Ideally, all of our nodes will be numbered `1-n`

so that we don’t have to deal with variable-length names like strings. Think about this implementation–what issues are we going to have to handle? Here’s a list of problems I had to solve in the final implementation:

- How do we convert node names to indices? We have to iterate over the nodes and number them, but how?
- Where do we store all the node names? (Eventually we have to convert back from indices to names)
- How do we store the graph itself?
- How do we iterate over the graph’s nodes?

Most of these issues aren’t even related to the core part of the algorithm, they’re just data handling.

Reindexing the nodes is one of the problems that seemed so simple but turned out to be relatively challenging. At small scales this is fairly trivial–just insert them into your favorite variety of hashmap/set/list/whatever, discarding copies. However, this means that we’d have to keep all the string names in memory. Again, if we have a hundred million nodes labeled with strings that average around 8 characters, thats 800MB just for the that set. Prefix trees like a Trie could work, but they’re still going to have large memory consumption.

The solution I landed on looks like this:

```
dir = temporary_directory()
set_working_directory(dir)
ctr = 0
for (node_name in edges)
hash = hash_string(node_name)
if(!file.exists(hash))
file.create(hash)
hashfile = file.open(hash)
for(line in hashfile)
if(node_name in line)
goto_next_node_name()
file.append(node_name, ctr)
ctr += 1
```

Depending on how many bits our hash function is, we can generate more shallower files or fewer deeper files. This solution lets us record a bunch of files in `name index`

format while only requiring us to store the value `ctr`

across the entire algorithm.

This lets us reindex the label names, but we still need to actually record the graph structure itself. I used a pretty standard representation for this called Compressed Sparse Row (CSR) format. Essentially, each edge is stored as its end point and a weight back-to-back in the file. For 64bit indices and `double`

weights, that’s 16 bytes per edge. The first `n+1`

values store offsets that indicate where each node’s edges reside. Each of these values are also 64bit `int`

values, so eight bytes. For example, if the third value is 44 and the fourth value is 55, this means that edges originating from the node with index 3 are edges 44-54.

The nice thing about this representation is it works really well with our hashing strategy for node names. As each node is added to the map, we check if it already exists. If it does, we increment the number of edges it has by one. Otherwise, we initialize its value to 1. This means that, once we’ve hashed all our node names, we’ll also have a separate file that contains the degree of each node. By converting that to cumulative sums, we obtain the first `n+1`

values for our CSR format. A second trip through the data allows us to populate the edges list with all the relevant edges. This lets us store the entire graph structure using a total of `8(v+1) + 16n`

bytes, where `v`

the number of vertices and `n`

the number of edges. In the undirected case, we store each edge twice (the forward and reverse direction), so we’ll need `8(v+1) + 32n`

bytes. In terms of average node degree `d`

, that’s `(8+32d)(v) + 8`

byte for undirected graphs. For our previous example of two hundred million nodes and average node degree of four, we’ll need about 27GB total. That’s a lot of memory, but it’s not a ton of disk space.

The last dilemma is iterating over the nodes. Traditionally we’d just use a queue, but a linked list with two hundred million entries is…not the best. I instead used a pair of files stored on disk for each queue. For each iteration, we dequeue from our first file and enqueue to the second. At the end of the iteration, we swap which file we enqueue to and dequeue from. This ensures that we’re always performing read/writes sequentially to optimize cache efficiency, and keeps our queue size small. We can also optimize our enqueue operation by keeping a third file with just a big bitfield, such that the n’th bit is 1 if that index is already in the queue, and 0 otherwise. At each node, we can easily look up its edges in the CSR file, determine what its new cluster should be, and then write it to the cluster file. Initializing the queue can be done randomly using the “inside-out” Fisher-Yates shuffle.

All the code for this can be found on the GitHub repo for this project.
**This is not a finished build**. There’s a lot I still need to optimize and refine; the goal was just to get a prototype working. The biggest things I need to change are switching to `mmap`

from `fseek`

strategies–`mmap`

offers better performance for random read/write to a file, but unfortunately it’s not cross-platform compatible. While there are workarounds, but I’d rather not put that burden on the user…better to just figure out how `MapViewOfFile`

in `windows.h`

works and use preprocessor directives to switch between depending on the platform.

Scalability is about what you’d expect from the theoretical analysis. In some preliminary testing, my implementation scales linearly with respect to the number of nodes and number of edges. Performance is basically identical to my in-memory FLP implementation. Memory overhead is extremely low, but the cost for that is that we rely heavily on read/writes to disk, so runtime is pretty slow. My SSD-based machine could do an undirected graph with 10,000 nodes and 100,000 edges in 0.04s in-memory and just short of 70s out-of-memory. There’s definitely room for improvement! Most of the runtime is in the initial pass of reading in and indexing the nodes–the clustering itself goes very fast. Some anecdotal evidence: on an undirected graph of 2,000 nodes and 20,000 edges, reading in the nodes took 2x as long as reading in all the edges, and the clustering took less than a fifth of the time of both node and edge reading combined. I’ll probably optimize this later by changing the data structure used to store the node labels (like a trie or something).

Anyway, thanks for reading.

]]>There are a bunch of steps involved, but before that, I’m going to address the most obvious question people will
ask: why? There already exists the `randomForest`

package for R that does a great job at implementing Random Forests,
why reinvent the wheel? The answer is twofold.

First, my lab puts a high premium on software that doesn’t have
external dependencies. My doctoral work partly depends on using random forest predictors, and we’d like to avoid
having to rely on the `randomForest`

package. Controlling all your dependencies means you know the codebase better,
have full control on updating it, and you can ensure that it is fully interoperable with your own code. Obviously
that isn’t always feasible, but in this case it is.

The second reason is a lot simpler: it’s a great learning experience, and a great excuse to finally use the Fortran skills I’ve been practicing over winter break.

Note to anyone planning to seriously follow this writeup: I use R, Fortran, and C together for my implementation.
If you’re unfamiliar with R internals or the various ways to interface between R and Fortran/C, you may have trouble
following some sections. I recommend looking at what is essentially the bible of R programming,
Writing R Extensions, for a comprehensive description on things
like `.Call`

syntax and Fortran interfaces. If you’re interested in a more comprehensive description of Random Forests,
check out the original publication by Leo Breiman.

**This will be a work in progress until I finish my implementation**

You can check out the current codebase at https://github.com/ahl27/machineRy. If you have suggestions for improvements, feel free to contact me or open an issue on GitHub.

This is a big project, and I’m looking for a good, robust implementation. For things like this, the best first step is to plan out what you’re going to implement based on your priorities. I usually start by defining my priorities and the steps involved, and figure out which languages I’m going to use for what based on those two things together.

For myself in this project, the priorities were as follows:

- It should work, at least for classification. No crashing R sessions.
- Runtime and accuracy should be similar to
`randomForest`

. Lower runtime is acceptable if accuracy is higher. - User experience should be the same as other models in R (e.g., call with something like
`rf(response ~ ., data=data)`

- Models should be able to be saved/loaded in R.

Now, for the steps involved. Random Forests are a relatively simple algorithm that consist of a train and test phase.
For a set of input data, we construct `n`

decision trees. Each tree is constructed using bootstrapped sample of the data
(sample some amount of rows with replacement). Thus, we need to be able to do the following things:

- Correctly recognize what the user is asking for in R (parse
`formula`

objects) - Partition input data to improve prediction accuracy (i.e., a decision tree node)
- Do (2) a bunch of times to make a decision tree
- Do (3) a bunch of times to make a forest
- Save (4) in such a way that it persists in R and can be saved without taking tons of memory
- Make predictions using (5)

Under the constraints and priorities I have, the languages to use were pretty clear. First, the whole solution has to be R-compatible, so the only languages available to me are R, C, and Fortran. (1) will necessarily be in R, since the user interface will be R-exposed. (3-4) are major computational components and require tree structures, so C is a natural choice. (5) is likely going to be a combination of C and R, since R-Fortran interfaces are now generally discouraged. (6) is also going to be an R/C combo, since there needs to be an R interface but it still relies on tree structures (tree structures are fast and simple to implement in C but challenging in R). That just leaves (2), which is the most numerically intensive operation. For this, I’m going to rely on Fortran, since it does a good job with numerical calculations and is much easier for me to debug than C.

The last step is figuring out how to actually implement it. I’d like to test early and often, so ideally I develop in such a way that I start with self-contained components that can be tested in a vacuum. I also like to start with the toughest components to get them out of the way. I ended up on the following order of application:

- Determine an R-compatible way to represent decision trees, and write code to read/write between R and C [C, some R]
- Make decision tree nodes for classification [Fortran, some C]
- Make a decision tree [C]
- Make random forests [R, some C]
- Figure out the
`formula`

syntax to train a decision tree [R] - Figure out the
`formula`

syntax to make predictions with a decision tree [R] - Extend to full random forests
- Make decision tree nodes for regression [Fortran, some C]
- Double check
`randomForest`

for features I may be missing - Optimize existing implementations

That being said, let’s jump into it!

Determining *how* we’re going to represent decision trees is an important point, because everything else
depends on it. If my structure is bad, then I’ll have to refactor significant amounts of code down the line. Looking
at the priorities of the project, we want it to be a robust implementation with models that can be saved/loaded and good
optimization in terms of memory usage and runtime.

Because of this, there are a few things we **cannot** do. The naive approach is just to build trees directly in R. However,
memory allocation and garbage collection in R is slow, and R doesn’t have great support for tree structures. End users don’t
really need direct access to each decision tree, so it’s okay if we obfuscate the internal model in something like C.

However, C presents its own challenges. While it is possible to save a pointer to a C object within R, these objects are super finicky. These “External Pointer” objects do not copy their values, and they cannot be saved across R sessions. That means that if your R session ever restarts (or if you try to save off the object), the external pointer will be garbage.

I settled on a mixed approach. We’ll have an R object that saves a compressed version of all the data required to reconstruct a decision tree, along with an external pointer object. Whenever we use this R object in C, we can check if the external pointer is a real object, or if its become garbage. If it is garbage, then we just reconstruct the object in C, point the external pointer at it and carry on.

We do still have to be careful here, since as mentioned previously, it isn’t super simple to store tree structures within R. I also want to make sure we’re not saving huge objects, since we’re going to have to make hundreds of these decision trees for each random forest. If a single decision tree takes 2MB to store, a 500-tree Random Forest will be 1GB!

I ended up with the following structure. First, we have this structure in C to define a decision tree node:

```
struct DTreeNode {
struct DTreeNode *left;
struct DTreeNode *right;
double threshold;
double gini_gain;
int index;
};
typedef struct DTreeNode DTN;
```

Here `DTN`

stands for Decision Tree Node. The basic structure is just a binary tree node, with pointers to the left and right
nodes. I also have three additional variables: `index`

, which defines which column of the data we split on, `threshold`

, which
determines the value of that column to split on, and `gini_gain`

, which is the Gini gain of that split. When I eventually move
on to regression, `gini_gain`

can also hold the residual improvement.

This implementation actually allows for a very simple compressed storage in R. Internal nodes of the decision tree will always have
a nonnegative value of `index`

. We can then create leaf nodes by setting `index`

to `-1`

, and using `threshold`

to store the
prediction for that leaf node. For classification, we just cast the result to `int`

, and for regression it’s already in the correct
format. Then, we can read it out to R by traversing the tree in a breadth-first search. This allows us to store the entire tree
as three vectors (one `int`

, two `double`

), which can be compressed in R using `rle`

.

This means that, given these three vectors, we can reconstruct a decision tree by calling the following function:

```
// basic queue structure
struct DTNqueue{
struct DTNqueue *next;
DTN *ptr;
};
typedef struct DTNqueue queue;
DTN *bfs_q2tree(int *indices, double *thresholds, double *gini, int length){
// set up a queue
queue *q = malloc(sizeof(queue));
queue *end = q;
queue *tmp_q = q;
DTN *tmp, *head;
// initialize decision tree
head = initNode();
q->ptr = head;
q->next = NULL;
int i=0, cur_ind;
while(q && i<length){
// load value into queue
cur_ind = indices[i];
tmp = q->ptr;
tmp->threshold = thresholds[i];
tmp->gini_gain = gini[i];
tmp->index = cur_ind;
if(cur_ind > -1){
// add both children of the node into the queue
end->next = malloc(sizeof(queue));
end = end->next;
tmp->left = initNode();
end->ptr = tmp->left;
end->next = malloc(sizeof(queue));
end=end->next;
tmp->right = initNode();
end->ptr = tmp->right;
end->next = NULL;
}
i++;
q = q->next;
}
// free the entire queue
while(tmp_q){
q = tmp_q;
tmp_q = tmp_q->next;
free(q);
}
// return the tree
return head;
}
```

Then, we just need an R interface:

```
SEXP R_get_treeptr(SEXP VolatilePtr, SEXP INDICES, SEXP THRESHOLDS, SEXP GINIS){
// if tree exists, just return the external pointer
// note that it seems R_NilValue can be treated as an external pointer address for whatever reason
if(VolatilePtr != R_NilValue && R_ExternalPtrAddr(VolatilePtr)) return(VolatilePtr);
// otherwise, create the tree
DTN *tree = bfs_q2tree(INTEGER(INDICES), REAL(THRESHOLDS), REAL(GINIS), LENGTH(INDICES));
// using LENGTH because it makes calling the function a lot easier --
// could be optimized slightly by calculating this on the R end
int madePtr = 0;
if(VolatilePtr == R_NilValue){
// if the pointer is just NULL, we make a pointer for it
VolatilePtr = PROTECT(R_MakeExternalPtr(tree, R_NilValue, R_NilValue));
madePtr = 1;
} else {
// else just set the address of the pointer to the tree we just made
R_SetExternalPtrAddr(VolatilePtr, tree);
}
R_RegisterCFinalizerEx(VolatilePtr, (R_CFinalizer_t) R_TreeFinalizer, TRUE);
if(madePtr) UNPROTECT(1);
return VolatilePtr;
}
```

It’s really important to test functions frequently! This is a set of functions that can be directly tested–even if we can’t “learn” decision trees, we can provide these functions with a set of dummy values to make sure it’s working properly. All we need are three vectors in R, and some way to print them. My logic for printing out a decision tree is…long, so I’m just going to say “trust me, it works”. If you’re interested in the full code, you can look at it on the project website. Just to give a sense of what this looks like, here’s one of my R functions for testing these functions:

`test_bfs_q2tree`

is a C function that essentially looks like this:

```
SEXP test_bfs_q2tree(SEXP INDICES, SEXP THRESHOLDS, SEXP GINI, SEXP LEN){
// reconstruct the tree
SEXP R_ptr = PROTECT(R_get_treeptr(R_NilValue, INDICES, THRESHOLDS, GINI, LEN));
// get the tree from the R external pointer
DTN *tree = (DTN *) R_ExternalPtrAddr(R_ptr);
// print it out
printDecisionTree(tree);
// free its memory
freeDecisionTree(tree);
// unprotect the R external pointer and return
UNPROTECT(1);
return R_NilValue
}
```

At this point, I have a data structure and a way to read it to/from R. It also can be saved across multiple R sessions. The next step is creating one of the building blocks of decision trees: the nodes it contains.

There are many ways to build decision trees, but the most well-known is the CART (Classification and Regression Trees)
algorithm. I first tried to figure out what they’re doing in the `randomForest`

package, but their code is…extremely
difficult to understand. In absence of that, I instead just started implementing based on what it should theoretically
look like.

The CART algorithm for each node of a classification tree in a random forest is fairly straightforward:

- Randomly choose
`n`

variables to evaluate - For each variable, determine the split point that maximizes the Gini Gain
- Split the data on the variable/threshold combination that maximizes Gini Gain

You can see that a lot of this revolves around the “Gini Gain”, but what exactly is that? Gini Gain is derived from the Gini Impurity, which measures “how often a randomly chosen element of a set would be incorrectly labeled if it were labeled randomly and independently according to the distribution of labels in the set” (Wikipedia, see previous link). The mathematics work out very cleanly, so the expression for Gini Impurity is just:

```
1 - sum_i(p_i^2)
```

Here `p_i^2`

is the probability of each class in the training set, and `sum_i`

is the sum over all categories. Essentially,
you take one minus the sum of squared probabilities for each class. The best possible value is 0, when all elements are in the
same class.

Gini Gain is then just the Gini Impurity of the total dataset minus the weighted Gini Impurity of each branch after the split.
To illustrate this, let’s walk through an example. Suppose I have a set of three classes, `{A,B,C}`

, and my starting dataset is:

```
{A, A, A, A, B, B, C, C, C}
```

The gini impurity is then one minus the squared sum of probabilities for each class. The probabilities for `A,B,C`

are
`4/9, 2/9, 3/9`

(respectively), so the Gini impurity is:

```
1 - ((4/9)^2 + (2/9)^2 + (3/9)^2)
= 1 - (16/81 + 4/81 + 9/81)
= 1 - 29/81
= 52/81
```

That’s about equal to 0.642.

Now, let’s say that we pick some split point that divides our dataset into two groups: `{A, A, A, A, | B, B, C, C, C}`

.
The left side has all the elements from classes `A`

, and the right has all from `B,C`

. The Gini Impurity of the right and
left are then:

```
Gini(left):
1 - ((4/4)^2)
= 1 - (1^2)
= 1 - 1
= 0
Gini(right):
1 - ((2/5)^2 + (3/5)^2)
= 1 - (4/25 + 9/25)
= 1 - 13/25
= 12/25
```

Now the Gini Gain of this split is the Gini Impurity at the beginning (0.642) minus the sum of *weighted* Gini Impurities
of the left and right nodes. This weighting is done by the number of elements in each node. In this case, since we split
our set of nine elements into two sets of size four and five (resp.), the final calculation is:

```
Gini Gain:
GiniImpurity(parent) - (weightedGini(left) + weightedGini(right))
= 52/81 - ((4/9)(0) + (5/9)(13/25))
= 52/81 - (0 + 65/225)
= 52/81 - 13/45
= 143/405
```

So our Gini Gain is 143/405, which is about 0.353. By maximizing the Gini Gain, we’ll consistently pick split points that reduce the Gini Impurity as much as possible. This is because the maximizing the Gini Gain corresponds to picking split points wherein the Gini Impurity of the child nodes is smallest relative to the Gini Impurity of the parent.

To illustrate this, if we had picked a
partition that split our nodes into `{A, A, B, C, C | A, A, B, C}`

, the Gini Impurity of the child nodes would be 9/25 for the left
and 6/16 for the right. Our Gini Gain would be `52/81 - [(5/9)(9/25) + (4/9)(6/16)] = 223/810 = 111.5/810`

, which is about 0.275.
This is a worse score than our previous example, and it corresponds to a case where the elements of the set are much less well
separated.

Let’s get back to coding. The first step was making a function to calculate the Gini Impurity of a vector of classes. I chose Fortran for this, since matrix operations are a little easier to write in Fortran.

```
pure subroutine gini_imp(classes, l, nclass, o_v)
! calculate gini impurity of a given vector of classes
! variable definitions:
! classes: vector of classes (integer, 1:n)
! l: length of `classes`
! nclass: number of unique classes
! o_v: output variable
use, intrinsic :: iso_c_binding, only: c_int, c_double
implicit none
integer(c_int), intent(in) :: l, nclass
integer(c_int), intent(in) :: classes(l)
real(c_double), intent(out) :: o_v
real(c_double) :: class_counts(nclass), total
integer(c_int) :: i
if(l == 0) then
o_v = 1.0
return
end if
! tabulate number of classes
do i=1, nclass
class_counts(i) = 0.0+count(classes==i) ! cast to double for later
end do
total = sum(class_counts)
! gini impurity is 1 - (squared probabilities)
o_v = 1.0-sum((class_counts / total)**2)
end subroutine gini_imp
```

This gives us a way to calculate the Gini Impurity given a single vector. Now, we just have to apply it to a set of observations to find the optimal split point for a given variable:

```
pure subroutine find_gini_split(v, response, l, nclass, o_v, o_gini_score) bind(C, name="find_gini_split_")
! Variable declarations:
! v: vector of values to split on (numeric)
! response: classes of each entry
! l: length of v and responses
! nclass: number of unique classes
! o_v: (output) value to split on
! o_gini_score: (output) Gini Gain of split
use, intrinsic :: iso_c_binding, only: c_int, c_double
implicit none
integer(c_int), intent(in) :: l, nclass
integer(c_int), intent(in) :: response(l)
real(c_double), intent(in) :: v(l)
real(c_double), intent(out) :: o_gini_score, o_v
integer(c_int) :: i, mloc
real(c_double) :: total_gini, gains(l)
logical :: tmpmask(l)
! calculate the base gini impurity
call gini_imp(response, l, nclass, total_gini)
gains(:) = total_gini
! Calculate the gini gain for every possible split point
do concurrent(i=1:l)
tmpmask(:) = v <= v(i)
if(count(tmpmask) == l) then
gains(i) = -1.0
else
call gini_imp(pack(response, tmpmask), count(tmpmask), nclass, total_gini)
gains(i) = gains(i) - (total_gini * count(tmpmask)) / l
tmpmask(:) = .not. tmpmask
call gini_imp(pack(response, tmpmask), count(tmpmask), nclass, total_gini)
gains(i) = gains(i) - (total_gini * count(tmpmask)) / l
end if
end do
! find the best split and the gini gain of that split
gains = gains / l
mloc = maxloc(gains, dim=1)
o_v = v(mloc)
o_gini_score = gains(mloc)
end subroutine find_gini_split
```

This gives us a way to calculate the best split point for a given variable. The final step is just determining what variables to pass to Fortran from C.

```
void split_decision_node_classif(DTN *node, double *data, int *class_response,
int nrows, int ncols, int nclass, int num_to_check){
// data should always be a numeric
// response should be an int ranging from 1:n
// nclass, num_to_check are constant throughout execution of the program
// data will be a matrix stored by column (first nrows entries are col1, second are col2, etc.)
// we'll just assume that all the preprocessing is done in R, no need to fiddle with that here
// processing the SEXPs will be done separately so we can repeatedly call this internally
// setting up a random sample of ints
int *cols = malloc(sizeof(int) * ncols);
for(int i=0; i<ncols; i++) cols[i] = i;
int choice, tmp;
// shuffle the columns, use R's random number generator
// this is a Fisher-Yates shuffle, for anyone interested
GetRNGstate();
for(int i=ncols-1; i>0; i--){
choice = floor(unif_rand()*i);
tmp = cols[choice];
cols[choice] = cols[i];
cols[i] = tmp;
}
PutRNGstate();
double *results = malloc(sizeof(double) * num_to_check);
double *gini_gain = malloc(sizeof(double) * num_to_check);
double curmax = -0.5;
choice = -1;
for(int i=0; i<num_to_check; i++){
// call Fortran to find the best split point
F77_CALL(find_gini_split)(&data[nrows*cols[i]], class_response, &nrows, &nclass, &results[i], &gini_gain[i]);
if(gini_gain[i] > curmax){
choice = i;
curmax = gini_gain[i];
}
}
// assign the threshold, index, and gini gain to the node
node->threshold = results[choice];
node->index = cols[choice];
node->gini_gain = curmax;
// cleanup
free(results);
free(gini_gain);
free(cols);
return;
}
```

And now we have a way to determine a split point in the nodes. Next up is doing it a bunch of times to generate a full decision tree.

If you actually implement this and try it out, you’ll find a couple issues. First, the runtime is abysmal compared to `randomForest`

. Second, the performance is middling. Third, it only works on classification–Gini Gain isn’t really defined for regression. These aren’t huge issues, though; the priority is getting something working that we can refine later. I’ll come back to optimizing these methods once the whole algorithm is working.

The next step is building a decision tree for some input data. My goal is getting a working model, so I’m going to just focus on an easier problem subset. I’ll assume that input data are a matrix of `numeric`

values (`double`

in C), and the output is a vector of `integer`

(`int`

in C) corresponding to which class each row belongs to. This isn’t actually that much of a simplification–R’s `formula`

parsing will transform the inputs into a matrix of `numeric`

values anyway.

We already have a way to calculate a split at each node, so we need to do two more things to make a decision tree:

- Create a node object and populate it with the correct split point and index
- Split up the matrix of data according to the split point

This function is going to be called recursively, so I’m also going to be sure to call `R_CheckUserInterrupt()`

to make sure that if the user wants to stop early, it actually exits. The code here is going to be into a few parts.

First, let’s do some preliminary checking on the classes themselves. If they’re all the same class, we can just finish. Additionally, if we’re already past the maximum depth specified by the user, we should stop splitting. I’m intentionally going to check if `cur_depth == max_depth`

rather than using `cur_depth > max_depth`

so that we can use `max_depth=-1`

to have no maximum.

First, let’s call the function we wrote earlier to find a split point:

```
void learntreeclassif_helper(DTN *node, double *data, int *class_response,
int nrows, int ncols, int nclasses, int num_to_check,
int cur_depth, int max_depth, int min_nodesize){
// Error checking and stuff is going to go here
// ...
// this will assign the splitpoint and stuff into node
split_decision_node_classif(node, data, class_response,
nrows, ncols, nclasses, num_to_check);
```

That takes care of (1), so now we need to split up the data and call the function for the node’s children.

```
// How big do we need the new arrays to be?
// get the values we just found
double splitpoint = node->threshold;
int ind = node->index;
double *v = &data[nrows*ind];
// determine how many rows the data passed to left/right nodes will have
int nrow_left = 0, nrow_right=0;
for(int i=0; i<nrows; i++){
if(v[i] <= splitpoint)
nrow_left++;
else
nrow_right++;
}
// allocate space for data and classes
double *left_data = malloc(sizeof(double) * nrow_left*ncols);
double *right_data = malloc(sizeof(double) * nrow_right*ncols);
int *left_class = malloc(sizeof(int) * nrow_left);
int *right_class = malloc(sizeof(int) * nrow_right);
int ctr_l=0, ctr_r=0;
// traverse the data, adding each row to the left or right data matrices
for(int i=0; i<nrows*ncols; i++){
if(v[i%nrows] <= splitpoint){
left_data[ctr_l] = data[i];
if(ctr_l < nrow_left)
left_class[ctr_l] = class_response[i%nrows];
ctr_l++;
} else {
right_data[ctr_r] = data[i];
if(ctr_r < nrow_right)
right_class[ctr_r] = class_response[i%nrows];
ctr_r++;
}
}
// create a new left and right node, then call function recursively
DTN *left_node = initNode();
DTN *right_node = initNode();
// left node
learntreeclassif_helper(left_node, left_data, left_class, nrow_left,
ncols, nclasses, num_to_check, cur_depth+1,
max_depth, min_nodesize);
// right node
learntreeclassif_helper(right_node, right_data, right_class, nrow_right,
ncols, nclasses, num_to_check, cur_depth+1,
max_depth, min_nodesize);
node->left = left_node;
node->right = right_node;
return;
}
```

That’s the brunt of the function. If you’ve used C before, you’ll notice a bunch of problems here. First, lots of `malloc`

calls without any `free`

calls. We’ll be leaking memory all over the place. Combine that with the fact that our recursion never ends, and we’ll be crashing CPUs left and right. Since this tutorial is already extremely long, I’m going to omit the code and do what every math student dreams of: leave it as an “exercise to the reader”.

Here’s a sketch of how the function works:

- Check if we should stop recursion. More specifically, are the entries all the same class? Are we deeper than the specified max depth? Do we have fewer rows than the
`nodesize`

parameter? - Try to split the node
- Check if we made a split. If every split increases the Gini Impurity, then we can’t make any more good splits and we should just return.
- If we made a split, allocate space for the children nodes and recurse.

The other little blip is the memory (de)allocation. My solution was to free `data`

and `classes`

ASAP in the function itself. Since we copy values into `left_data`

, `right_data`

, `left_class`

, and `right_class`

, we don’t really need `data`

or `classes`

anymore. Copy the data into new containers, free the old ones, pass the new ones to the recursion.

Oh, and we need an R interface:

```
SEXP R_learn_tree_classif(SEXP DATA, SEXP NROWS, SEXP NCOLS, SEXP CLASSES, SEXP NCLASSES, SEXP TO_CHECK, SEXP MAX_DEPTH, SEXP MIN_NODESIZE){
// array input
double *data = REAL(DATA);
int *class_response = INTEGER(CLASSES);
// variable inputs
int nrows = INTEGER(NROWS)[0];
int ncols = INTEGER(NCOLS)[0];
int nclasses = INTEGER(NCLASSES)[0];
int num_to_check = INTEGER(TO_CHECK)[0];
int max_depth = INTEGER(MAX_DEPTH)[0];
int min_nodesize = INTEGER(MIN_NODESIZE)[0];
// internal vars
DTN *head = initNode();
// helper function will destroy data and class_response, so duplicate them first
double *dup_data = malloc(sizeof(double)*nrows*ncols);
int *dup_class_response = malloc(sizeof(int)*nrows);
// these do not need to be free'd -- will be free'd in the helper function
dup_data = memcpy(dup_data, data, sizeof(double)*nrows*ncols);
dup_class_response = memcpy(dup_class_response, class_response, sizeof(int)*nrows);
learntreeclassif_helper(head, dup_data, dup_class_response, nrows, ncols, nclasses,
num_to_check, 0, max_depth, min_nodesize);
// now we should have our entire tree created, and our duplicated arrays destroyed.
// Now let's export it to an R object
// these objects will be allocated in `export_internal_tree`
int *indices = NULL;
double *thresholds = NULL, *gini_gain=NULL;
int l = 0;
export_internal_tree(head, &indices, &thresholds, &gini_gain, &l);
// This is one option, I'm instead just going to register the external
// pointer right away and return it, since I think that's easier.
// Avoids a double call, and most people will predict right after
// training anyway.
// Read values back into R
SEXP R_retval = PROTECT(allocVector(VECSXP, 4));
SEXP R_indices = PROTECT(allocVector(INTSXP, l));
SEXP R_thresholds = PROTECT(allocVector(REALSXP, l));
SEXP R_gini = PROTECT(allocVector(REALSXP, l));
memcpy(INTEGER(R_indices), indices, sizeof(int)*l);
memcpy(REAL(R_thresholds), thresholds, sizeof(double)*l);
memcpy(REAL(R_gini), gini_gain, sizeof(double)*l);
free(indices);
free(thresholds);
SET_VECTOR_ELT(R_retval, 1, R_indices);
SET_VECTOR_ELT(R_retval, 2, R_thresholds);
SET_VECTOR_ELT(R_retval, 3, R_gini);
UNPROTECT(3);
// register the external pointer and then return
SEXP R_ptr = PROTECT(R_MakeExternalPtr(head, R_NilValue, R_NilValue));
// R_TreeFinalizer just deallocates all the memory allocated to the decision tree
R_RegisterCFinalizerEx(R_ptr, (R_CFinalizer_t) R_TreeFinalizer, TRUE);
SET_VECTOR_ELT(R_retval, 0, R_ptr);
UNPROTECT(2);
return(R_retval);
}
```

Yes, I’m going to combine these steps into a single section. They’re all quick, so they can be combined.

Making a random forest is pretty simple. Random forest trees are actually easier than single decision trees, since Breiman’s implementation for random forests skips pruning the trees. Thus, we basically just loop from `1:n`

for `n`

trees, and make each of them with a sample of the data. All we need to do for that is just use `sample`

in R, then pass the values to the C functions we’ve made.

I wish I could write more about `formula`

objects. The truth is, they’re kind of a black box to me. Whenever I work with formulas, I usually just adapt the first few lines of `glm()`

or `lm()`

to parse the formula objects.

```
parse_formula <- function(formula, data, weights, na.action){
## copying a lot of this from glm()
if(missing(data))
data <- environment(formula)
mf <- match.call(expand.dots=FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action"),
names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
if(identical(method, "model.frame"))
return(mf)
mt <- attr(mf, 'terms')
y <- model.response(mf, "any")
if(length(dim(y)) == 1L){
nm <- rownames(y)
dim(y) <- NULL
if(!is.null(nm))
names(y) <- nm
}
if(!is.empty.model(mt))
x <- model.matrix(mt, mf, contrasts)
else
x <- matrix(NA_real_,nrow(y), 0L)
weights <- as.vector(model.weights(mf))
# do other stuff
}
```

This is roughly how most of R’s `base`

code parses `formula`

objects. At the end of this function, `x`

will store the input data as a `numeric`

matrix with consistent variable names and ordering. Storing the formula will allow us to make similar data structures for predictions:

```
predict.RandForest <- function(rf, newdata=NULL, na.action=na.pass){
tt <- terms(attr(rf, 'formula'), data=newdata)
noData <- (missing(newdata) || is.null(newdata))
if(noData){
x <- model.matrix(rf)
mmDone <- TRUE
return()
} else {
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata, na.action = na.action)
x <- model.matrix(Terms, m, contrasts.arg=attr(rf, 'contrasts'))
mmDone <- FALSE
}
nentries <- nrow(x)
nc <- ncol(x)
results <- matrix(0.0, nrow=nentries, ncol=length(attr(rf, "class_levels")))
colnames(results) <- attr(rf, "class_levels")
# do other stuff
}
```

Here `rf`

is our model, and the `formula`

object used to generate it is stored as `attr(rf, 'formula')`

. I included the `contrasts`

argument in case I add something that uses it later, but right now it should always be `NULL`

.

Once we’ve parsed the formula into a consistent matrix, we just call our internal functions to build the tree. Prediction is a pretty simple routine as well:

```
double predict_for_input(DTN *tree, double *data){
DTN *tmp=tree;
while(tmp->index != -1){
if(data[tmp->index] <= tmp->threshold)
tmp = tmp->left;
else
tmp = tmp->right;
}
return(tmp->threshold);
}
```

So it’s around this time that I decided I’d benchmark the runtime of my implementation against the `randomForest`

package.
It turns out that mine is…bad. For evaluating a dataset with 1000 rows, `randomForest`

took around 0.25 seconds, whereas
mine took…over 11 seconds. That’s not great.

I decided to take some time to think about this a little more. The crux of the problem is in how we determine split points. This is easily verifiable by just commenting out the logic to find Gini Gain and replacing it with constant assignments that execute (basically) instantly. How would you go about optimizing the previous implementation?

First, some insights. My program runs around 50x slower than `randomForest`

, so this isn’t just a case of using less optimized
languages or functions–the algorithm itself is worse. I looked into it a little more, and came up with the following three
problems:

`find_gini_split`

calls`gini_imp`

twice per proposed threshold and creates two vectors. That could easily be a single vector and a single function call.- The logic to find a split point just checks for the highest raw Gini Gain. However, a negative Gini Gain means that our split is actually worse than we started. We should be stopping in these cases and just set the node to be a leaf node.
- We’re checking every single value in the vector as a possible threshold.

(1-2) are big improvements already–implementing these fixes brought my runtime down to just 0.55 seconds. That’s better, but still double that of `randomForest`

. The scaling of mine is also an issue; my algorithm slows down much faster than that of `randomForest`

.

This is all (likely) because of (3). This implementation checking every possible value in the vector as a threshold is super super inefficient. To see why, here are two examples.

First, imagine the predictor is just a simple true/false value. Our vector of values will be `n`

long, with all values either 0
or 1. My implementation will check `n`

different thresholds, even though we only really need to check a single one (whether or not the value is 0).

Second, imagine the predictor is some bimodal distribution, like sum of distinct normal distributions. Let’s also assume that the modes clearly distinguish between two classes. We’ll try every possible threshold, but we could instead just check the values in the center of the modes. There will be significantly fewer thresholds between the modes, and they’ll also all be better split points than any of the values around the modes.

In essence, I’m hinting at some smarter way to traverse the space. Neural networks often use gradient descent, but we unfortunately do not have access to a closed form solution for the derivative of the Gini Gain. However, we can easily approximate it. I’m not completely sure what the output space looks like, so I’m going to use a simulated annealing approach to traverse the space. In pseudocode, this roughly looks like this:

```
current_threshold = mean(values)
temp_max = 100
current_gini = gini_imp(currentthreshold)
for i in (temp_max-1:0):
shouldUpdate = false
new_threshold = current_threshold + runif()
new_gini = gini_imp(newthreshold)
if new_gini < current_gini:
shouldUpdate = true
else:
proposal_chance = exp( (new_gini - current_gini) / (1 - ((i)/tempmax)) )
roll = runif()
if roll <= proposal_chance:
shouldUpdate = true
if shouldUpdate:
current_threshold = new_threshold
current_gini = new_gini
```

Essentially, move around the space randomly and recalculate the Gini Impurity at the new point. If the Gini Impurity is less (meaning the Gini Gain would be larger), we take that as our new estimate. If the Gini Impurity is greater (Gini Gain would be less, a worse choice), we accept it with a probability proportional to the “temperature”. The temperature isn’t a real temperature, but it’s a parameter that decreases our acceptance probability over time. The algorithm is a simulated version of metals annealing.

Using this, the accuracy of my implementation is roughly the same as `randomForest`

(if not slightly better) on classification tasks with numeric variables. For some reason the accuracy of my algorithm drops significantly when categorical variables are added. I’m not yet quite sure why that happens, but it’s next on my list to investigate.

Conclusion? But we still have so much to go!

Yep. This takes me a while to write, and I’m also not completely done with the code. Code-wise, I’ve roughly completed through (7), but this blog post is already super super long. You can check out the most current state of the code at https://github.com/ahl27/machineRy. I’m planning on updating it as I have time, so stay tuned!

]]>I’m including example code in all three languages, and then I’ll benchmark their relative performance at the end.

For all of these examples, I’m going to be writing code to sort a vector of integers using either quicksort or mergesort. The setup for these is pretty easy:

We’re preallocating the vector so that when we benchmark, we can ignore the time required to make the testing vector. If you’re unfamiliar with either sorting algorithm, they’re both divide-and-conquer algorithms. Quicksort works by picking a “pivot” and sorting the array such that all values less than the pivot come before it, and all values larger come after. We then recursively apply the same strategy to the values before and after the pivot (picking new pivots) to sort the entire array. Mergesort works by first recursively partitioning the array into smaller and smaller blocks, and then iteratively combining these blocks into sorted order. You can check out example code here, or look at illustrated examples on Wikipedia (Mergesort, Quicksort).

If you’re here, you probably already know how to write and execute R code. This is the biggest
benefit of R: it’s (by definition) the easiest type of code to write and execute within R.
I’m going to use two sorting implementations in R. The first is the built-in `sort`

method,
using `method='quick'`

to call quicksort. This is basically just calling C on the backend,
so we should expect the performance to be roughly the same as a C implementation.

Now, I’m also going to add a raw R implementation to see what the difference between the (basically C) base R implementation and typical user-written R code. I’m not guaranteeing this is the most optimal implementation, but it does get the job done.

C code can be called from R in three main ways. The first is to use R code that basically runs
C in the background, such as with built-in `base`

functions like `sort`

. However, oftentimes
there aren’t base R functions for the use-case you want. In this case, we can write our own
functions in C and call them either using `.Call`

or `.C`

. `.Call`

takes in R objects (referred
to as `SEXP`

objects) and can return a new `SEXP`

object, whereas `.C`

takes in pointers to the
underlying C data behind R code, and modifies these values in-place. Functions called with `.C`

must be `void`

, and are unable to allocate any R objects. Here’s an example of a quicksort algorithm
implemented in a C file:

```
#include <R.h>
#include <Rdefines.h>
// helper functions to be called by the main function (c_quicksort)
inline void swapval(int *vec, int i1, int i2){
int tmp = vec[i2];
vec[i2] = vec[i1];
vec[i1] = tmp;
}
void c_quicksort_helper(int *vec, int n){
if(n <= 2){
if (n==2 && vec[1] < vec[0])
swapval(vec, 0, 1);
return;
}
int pivot = n / 2;
int pivotval = vec[pivot];
int tempind = 0;
// swap out the pivot point to the final element
swapval(vec, n-1, pivot);
for(int i=0; i<n-1; i++){
if(vec[i] < pivotval){
if(i > tempind) swapval(vec, i, tempind);
tempind++;
}
}
swapval(vec, tempind, n-1);
c_quicksort_helper(vec, tempind);
c_quicksort_helper(vec+tempind+1, n-tempind-1);
}
// note that all the inputs are pointers
void c_quicksort(int *vec, int *n){
int len = n[0];
c_quicksort_helper(vec, len);
}
```

To be able to use this in R, we first have to compile it into a format
R can understand. Assuming I saved the above in a file called `cquick.c`

,
we can compile it by using `R CMD SHLIB cquick.c -o cquick.so`

on the commandline.
This will create a shared object (`.so`

) file called `cquick.so`

in the same directory.
Once we’ve done this, we can call it from R with:

C and R are pretty commonly used together, but I personally had quite a bit of trouble figuring out
how to add Fortran code to R. The initial appeal of Fortran to me is that Fortran syntax *feels* a lot
like R–it supports slice indexing, vectorized functions, all that good stuff. Unlike C, you don’t
have to worry too much about memory or managing pointers. Let’s write a quicksort routine in Fortran first:

```
subroutine fquicksort(x, n)
implicit none
integer, intent(in) :: n
integer, intent(inout) :: x(n)
call helpersort(x, n)
end subroutine fquicksort
recursive pure subroutine helpersort(x, n)
implicit none
integer, intent(in) :: n
integer, intent(inout) :: x(n)
integer :: p, pless, pgr
! return if array is less than 3 long
if(n <= 2) then
! if it's 2 long, make sure they're sorted
if (n == 2 .and. x(2) < x(1)) x(:) = x(2:1:-1)
return
end if
! pivot here is just the center of the array
p = x(n/2)
! replacing values using pack() -- returns subset of array defined by mask (always 1D)
x(:) = [pack(x, x < p), pack(x, x==p), pack(x, x > p)]
! count up how many we have less than and greater than
pless = count(x < p)
pgr = count(x > p)
! recursive calls
call helpersort(x(:pless), pless)
call helpersort(x((n-pgr+1):), pgr)
end subroutine helpersort
```

Here we have two subroutines, and if you strip away all the Fortran setup stuff (e.g., variable declaration
and function annotations), the result is pretty similar to R code. `pack(x, x<p)`

is equivalent to
`x[x<p]`

, and the rest is pretty similar to R. If you can write something in R code, it’s (to me at least)
much easier to translate it to Fortran than to R.

Now, how do we call it from R? We’re going to compile it the same way, using
`R CMD SHLIB fquick.f90 -o fquick.so`

, and then we use almost the same syntax:

There are two main differences between calling this and calling the C implementation. First, we use
`.Fortran`

instead of `.C`

–that difference should be pretty self-explanatory. Second, we provide
named arguments and get the returned value with `$x`

. It’s possible to call Fortran code in the same
way as `.C`

, using `.Fortran('fquicksort', randvec, len)[[1L]]`

, but Fortran also supports providing
named arguments. When you do, the returned list will also be named, which lets us get the returned value
using `$x`

.

Now, it’s also possible to call Fortran within C code called from R. That seems like it may be overcomplicating things, but often it can be faster to call to Fortran than to callback to R. In the event you need to do something that C doesn’t do super well, it can be useful to make small subroutine calls to Fortran within larger C functions. I’m going to provide a mergesort implementation here that’s called from C–there are a couple things to be aware of when doing this.

First, the Fortran code:

```
! Fortran subroutine to run mergesort
module fmergemod
implicit none
private
public fmergesort
contains
pure subroutine fmergesort(x, n) bind(C, name="fmerge")
! use c-compatible types
use, intrinsic :: iso_c_binding, only: c_int
implicit none
integer(c_int), intent(in) :: n
integer(c_int), intent(inout) :: x(n)
call helpermsort(x, n)
end subroutine fmergesort
recursive pure subroutine helpermsort(x, n)
use, intrinsic :: iso_c_binding
implicit none
integer(c_int), intent(in) :: n
integer(c_int), intent(inout) :: x(n)
integer(c_int) :: center, temparr(n), il, ir, j
if(n == 1) return
! get the center of current array, may not be exact
center = n / 2
! recursive call on left and right sides
call helpermsort(x(:center), center)
call helpermsort(x((center+1):), n-center)
! sort the two halves into a complete value
il = 1
ir = center+1
do j=1, n
if(il > center .or. (ir <= n .and. x(ir) <= x(il))) then
temparr(j) = x(ir)
ir = ir+1
else
temparr(j) = x(il)
il = il+1
end if
end do
x(:) = temparr
end subroutine helpermsort
end module fmergemod
```

Wrapping subroutines in modules is generally encouraged, so that’s what we’ve done here. A couple things to note:

- We have to use C-compatible types, which we do with the
`iso_c_binding`

intrinsic. Integer variables are marked with`integer(c_int)`

(other C types are supported in ways you’d expect, e.g.,`c_double`

). - We have to make sure the function we’ll call from C is marked as
`public`

. - We specify that C-callable functions are bound to C using
`bind(C, name="xyz")`

. This makes the function available to C and callable as`xyz()`

.

Now we have to call it from C. This code is much shorter:

```
#include <R.h>
#include <Rdefines.h>
extern void fmerge(int *x, int *n);
SEXP run_fmerge(SEXP VEC, SEXP LEN){
int *v = INTEGER(VEC);
int *l = INTEGER(LEN);
// note that we can only pass POINTERS
// passing int instead of int * will break
fmerge(v, l);
return(VEC);
}
void better_fmerge(int *vec, int *n){
fmerge(vec, n);
}
```

I’ve included both an example of using the `.Call`

syntax (`run_fmerge`

) and the `.C`

syntax (`better_fmerge`

).
`.C`

is a better approach for this specific example, but using `.Call`

is much more common. Note that we’ve
protyped the Fortran function using `extern void fmerge(...)`

, and made sure that all the arguments to our
Fortran function are pointers. The function itself is defined within Fortran, so we have to compile them
together. Let’s assume they’re called `fmerge.f90`

and `fmerge.c`

, then we make our shared library with
`R CMD SHLIB fmerge.f90 fmerge.c -o fmerge.so`

. Once we’ve done that, calling it from R is the same as the C example:

The real question is, how do they compare? We have five total functions: two in R, one in C, and
two in Fortran. Let’s benchmark them with the `microbenchmark`

package (truncating results to just
the median and mean):

Now, these aren’t all the same algorithm, but they are relatively comparable in algorthmic complexity. Let’s break down the results, focusing on the median runtimes.

First, `c_quick`

and `r_base`

are almost identical. As mentioned before, R’s `sort()`

function is basically just C anyway, so it’s not surprising that these are about the
same. `c_quick`

outperforms `r_base`

because it does less error checking and R callbacks,
which save time but make the code less robust.
Both flavors of Fortran (`fortran_quick`

, called from R, and `cfortran_merge`

,
called from C from R) perform about the same, around 3-6x slower than the C implementation.
Finally, the strict R implementation `r_quick`

is by far the slowest, clocking in at roughly
10x slower than the Fortran methods and about 30-50x slower than the C implementations.

After writing this, I found older, much more comprehensive benchmarks that roughly parallel these results: good C code executes roughly 3-4x faster than good Fortran code. You can see lots of languages benchmarked against each other here. Fortran was dropped from the benchmark in 2012, so the newest results are only from 2011.

All this is to say: **C is definitely the fastest, and Fortran is relatively close.**
If you need ultra-high performant code, C is likely your best option. However,
if you’re writing code that’s tough to write in C, you can strike a nice balance
between performance and ease of coding using Fortran. If you’re a Fortran wizard
and can optimize your code more, you can probably approach C performance. I won’t
pretend that these are heavily optimized implementations, but they’re good enough
to get rough estimates of relative performances.

I feel like people tend to stick too much to the languages they’re familiar with. Having a diverse set of tools in your toolbox gives you more options to attack problems with. Have proficiency in Fortran, C, and R together gives you the ability to maximize runtime without sacrificing too much readibility. Give Fortran a try, you may find that it’s easier than it seems. Happy Coding!

]]>This effect is especially pronounced in languages with significantly different paradigms. For example, while there are certainly differences between C and C++, their core features are very similar. In contrast, a Haskell specialist will likely have a much different approach to problem solving than a C++ veteran. It’s these languages that introduce totally new ways to write code that really interest me, which brings me to Forth.

If you’ve never heard of Forth, it’s an interesting little language. Like Python, Forth is an interpreted language with a REPL and no enforcement of variable types. Like C, Forth gives you unfettered direct access to your system’s memory, and like Java, it compiles your code on the fly to optimize runtime. Oh, and did I mention it doesn’t have variables? (sort of)

Forth was also released prior to C in 1970. Since then, it has powered several spaceflight missions, the Open Firmware boot system, and many other projects. It also happens to double as an operating system, and is one of the most lightweight languages out there. A complete Forth system (OS, compiler, interpreter, and space for user data) could fit in memory on an 8-bit system. That’s less than 64KB!

Programming languages are typically classified by their features. For example, you’re probably heard of object-oriented programming languages, like Java or Python. Forth falls into a unique little realm of programming languages called “stack-oriented” languages. This is because everything Forth does revolves around a single stack—all your values get pushed to a single stack, and all functions work on that same stack.

Hopefully at this point, I’ve given a sufficiently good exposition of Forth to make you at least moderately interested in it. I mentioned earlier that using different programming languages helps change how your approach problems. However, one of the largest barriers for me to work in new languages is the entry point. Working in a new language is often frustrating, and experienced programmers usually have a workflow they’re used to for their normal languages.

That’s what I sought to address with the `froth`

package. Rather than making you go out and figure out how to download, install, and start working in Forth, `froth`

lets you work directly in a Forth-like environment from R. You can use all the aspects of R you’re used to (RStudio, R objects, the R profiler, etc.) but still try out working in Forth!

You can download the `froth`

package from CRAN with

```
install.packages("froth")
```

From there, you can enter a Forth environment by just running `froth()`

, and you can quit the environment by typing `exit`

or `quit`

.

I’ve shown you how to start a Forth environment in R, but not actually what to do with it. Like I mentioned, Forth is a complete programming language with its own syntax distinct from R. My implementation of `froth`

doesn’t have all the features of full Forth systems, but it has enough to get you started.

The go-to resource for Forth is the *Starting Forth* textbook. To help you get started, I’ve adapted this book into a tutorial for getting started with `froth`

, which is available on the the package website. I’ll leave you with the go-to first program for any new programming language:

```
> library(froth)
> froth()
fr> ." Hello, world!\n"
Hello, world!
ok.
```

Good luck in your Forth journey!

]]>`dendrapply`

function. After some initial feedback from R-devel, I was encouraged to apply for the R Project Sprint at the University of Warwick in the UK.
What an incredible experience! I had the opportunity to meet many members of R Core, and got to show my code to people much smarter than me. Most notably, I had the chance to show my `dendrapply`

implementation to the original author of the function, Martin Maechler. I’ve updated my code significantly since then, so I thought it would be a good idea to walk through how people can contribute to R. If you’re just interested in the current state of `dendrapply`

, you can check it out on Bugzilla.

The top advice for people learning a new language is to immerse yourself in it. Surrounding yourself with examples of a new language is incredibly helpful for learning, especially if those examples come from native speakers. The same is undoubtedly true of R; immersing yourself in the R development community helps learn the ins and outs of contribution.

There exist a number of R mailing lists (see here). The most important of these is arguably `R-devel`

, in which developers discuss contributions, bugs, and changes to the main R codebase. This mailing list gets quite a bit of traffic from the R-core team, as well as other excellent developers in the community.

The other main source of information is the codebase itself. R is maintained on subversion, but a mirror of the current codebase is available on GitHub. On there you can see all the internals of R…though if you’re like me, that’s probably quite a lot to take in all at once. Reasonbly good explanations for how the internals work can be found in the Writing R Extensions guide, as well as Hadley Wickham’s guide.

Okay, now we know a little bit about the internals of R. However, the codebase is still enourmous; how do we go about identifying *how* a particular part of R works?

If we’re looking at R code, the process is relatively simple. From within R, you can typically view the source code for a given method by just typing the name of the method without `()`

. For example:

```
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x11d0e6798>
<environment: namespace:base>
```

In this case, typing `lapply`

returns the internal definition of the function, as well as the namespace it lives in (in this case, `base`

). Most of the builtin R namespaces can be found within `/src/library`

–for example, `base`

is found in `/src/library/base`

, and the code for `lapply`

is found in `/src/library/base/R/lapply.R`

.

Unfortunately, `lapply`

calls `.Internal(lapply(X, FUN))`

, which means there’s also some C code to be analyzed. C code can be tricker to track down, but most of it is found in `/src/main`

. On the command line, we can use `grep`

to search for lines within files. Let’s try looking for the source code for `lapply`

:

```
bash$ grep -n -e "lapply" src/main/*
src/main/apply.c:33:/* .Internal(lapply(X, FUN)) */
src/main/apply.c:41:attribute_hidden SEXP do_lapply(SEXP call, SEXP op, SEXP args, SEXP rho)
src/main/builtin.c:1047: /* There is a complication: if called from lapply
src/main/coerce.c:1977: necessarily correct, e.g. when called from lapply() */
src/main/deparse.c:918: // := structure(lapply(slotNms, slot, object=s), names=slotNms)
src/main/envir.c:3023: * Equivalent to lapply(as.list(env, all.names=all.names), FUN, ...)
src/main/logic.c:479: One exception is perhaps the result of lapply, but
src/main/names.c:658:{"lapply", do_lapply, 0, 10, 2, {PP_FUNCALL, PREC_FN, 0}},
src/main/saveload.c:2314: val <- lapply(list, get, envir = envir)
src/main/saveload.c:2318: Unfortunately, this will result in too much duplication in the lapply
```

`-n`

tells `grep`

to return line numbers, and `-e`

tells it to use a regular expression. You can also use the `-r`

flag to search recursively through directories.

In this case, we get a bunch of values returned. Most of them are comments, but two of these lines are important. `src/main/names.c:658:...`

tells us that, on line 658 of `src/main/names.c`

, the call `.Internal(lapply(...))`

calls `do_lapply`

. It just so happens that the definition of `do_lapply`

can be found on line 41 of `src/main/apply.c`

, as shown by our `grep`

call. If we wanted to investigate further, we could open those files and look at what’s going on internally.

Now we sort of understand R and can find the source code for a given function. The process for fixing bugs looks like the following:

- Make sure the bug is actually a bug
- Make absolutely sure the bug is real
- Make sure you can reproduce the bug on current R with a clean environment
- Do 1-3 again
- Check with the community to make sure you did (1-3) right
- Report the bug on Bugzilla, potentially with a patch

If it isn’t apparent, be really sure you’ve actually found a bug and haven’t just done something dumb (like the one time I accidentally overwrote the value of `c`

). If you’re unsure, you can ask on the R Contributors Slack channel or email `r-devel`

. Once you have a bug and can reproduce it, use Bugzilla to report the bug along with a minimal working reproduction of the problem.

If you’ve cloned the `r-source`

repository, you can create a patch by first changing the codebase, then using `git diff upstream/master > PatchFile.diff`

. This will create a patch file called `PatchFile.diff`

with your changes. Before submitting, make sure to test your patch! The easiest way to do this is with Docker, though if you have a Linux machine you should be able to do this without spinning up a VM. You can download the most up-to-date R source and apply your patch with:

```
svn checkout https://svn.r-project.org/R/trunk`
cd trunk
svn patch /path/to/PatchFile.diff
./configure --with-readline=no --with-x=no --without-recommended-packages && make
./bin/R
```

The second to last line builds R, and the last one runs your (patched) version of R. This often requires a lot of dependencies; you can use a prebuilt docker container I’ve created that has them all installed with these commandline commands:

```
docker pull peiple/r-sandbox
docker run -it --rm -v /PATH/TO/LOCAL/PATCH/FILE:/UserData peiple/r-sandbox
```

This will drop you in a Linux VM with all dependencies already installed, and then you can `svn checkout`

and `svn patch`

as normal.

`dendrapply`

`dendrapply`

has been updated and uploaded to Bugzilla (see here). This version has a lot of bugfixes as compared to the code uploaded in my original blog post, and should be ready for integration into R. Since it’s a relatively large change, the review process takes a while–I’m talking with R-core about what the best decision is regarding `dendrapply`

. We may instead end up refactoring it into an R implementation (without C code) if some proposed changes from Luke Tierney are integrated. TBD!

`dendrogram`

objects, which are essentially a series of nested lists. Each “node” of the tree is a list with multiple members (two if a binary tree, but `dendrogram`

objects are not constrained to be binary), each of which is another `dendrogram`

object. The leaves are special cases in that they have length 1 and an additional property `leaf`

, which is set to `TRUE`

.
R has a number of functions called `apply`

functions, whose primary purpose is to apply a function to a set of things in a particular way. Commonly used examples are `lapply`

to apply a function over a list-like object, `tapply`

to apply a function to a set of objects grouped based on a factor, `rapply`

to recursively apply a function to an object, or `apply`

to apply a function to a matrix/array.

`dendrapply`

is a special type of `apply`

statement intended specifically for applying functions to `dendrogram`

objects. The main requirement is to recursively apply a function to each node of the tree. While this may sound like a task for `rapply`

, `rapply`

is more intended to apply a function to non-list elements of a nested list, while `dendrapply`

is intended to apply the function to all nodes in the list. This means that `dendrapply`

specializes in applying the function to internal nodes of the tree, whereas `rapply`

applies the function only to the leaves (and doesn’t always preserve the original structure).

`dendrapply`

is currently implemented recursively, which has led to some users experiencing issues from stack overflows resulting from deep recursion on trees with many nodes (at least in users’ reports on `DECIPHER`

). Additionally, the recursive implementation makes the function slow for trees with many internal nodes. `rapply`

avoids this issue by factoring out the recursion internally, which gave me the idea to try to implement an optimized version of `dendrapply`

.

My focus for this implementation was removing the recursive calls in `dendrapply`

. The final implementation is only a modest runtime improvement, but this can hopefully be optimized prior to the final release.

The full implementation is written in C, which makes a lot of things easier at the cost of a few things becoming really hard. The algorithm proceeds as follows:

```
copytree = copy(input_dend)
initialize LinkedList
add root node to LinkedList
ptr = head(LinkedList)
while ptr is not NULL:
if node has children:
for child in ptr.node.children:
insert child in next position
else:
while(node has no unevaluated children):
apply function to node
merge node into parent
node = parent
ptr = parent.ptr
ptr = ptr.next
return head(LinkedList).node
```

It’s a little hard to write out, so I’ll use a small example to showcase. Imagine we have a dendrogram with two leaves, and we want to apply a function `f`

to each node `a,b,c`

.

```
Tree:
a
/ \
b c
Initialize LL:
HEAD -> NULL
Add root node:
HEAD -> a -> NULL
Set ptr to head:
HEAD -> a* -> NULL
Iterate over linked list:
HEAD -> a* -> c -> NULL (insert child, back to front to ensure correct traversal)
HEAD -> a* -> b -> c -> NULL (insert child)
HEAD -> a -> b* -> c -> NULL (increment pointer)
HEAD -> a -> B* -> c -> NULL (b has no children; apply f, f(b)=B)
HEAD -> aB* -> c -> NULL (merge B into parent)
HEAD -> aB -> c* -> NULL (a has an unevaluated child, so increment pointer)
HEAD -> aB -> C* -> NULL (apply f to c, f(c) = C)
HEAD -> aBC* -> NULL (merge C into parent)
HEAD -> ABC* -> NULL (a has no unevaluated children so apply f, f(a)=A)
HEAD -> ABC -> NULL* ( done, return ABC )
```

This implementation achieves the same result as a recursive operation, but doesn’t run the risk of overflowing the stack for deep trees. Note that this algorithm is a post-order traversal; pre-order is the default and is discussed later.

There are a few things about the implementation of this that make things difficult. First, the R garbage collector likes to collect R objects while they’re being worked on in C. We can get around this by using `PROTECT()`

to tell the garbage collector not to touch the object, but the protection stack has a fixed size that is extremely small (10,000 items by default). The tree as a whole can be protected when it’s first loaded, but calling `f()`

on a node of the tree produces a new R object that is unprotected, so we have to protect the result.

The naive approach is to just put all the nodes into a linked list, apply the function to every node, then rebuild the tree. Unfortunately, this rapidly exhausts the available space we have on the protection stack. Instead, we have to work slightly smarter.

We can get around this issue by immediately assigning the value of `f(node)`

to the parent node. If the parent node is protected, the value assigned to it will be protected as well. This means that, as long as we ensure the parent is always protected, we don’t need any extraneous protection calls. Some clever shuffling can ensure that we protect everything under the first `PROTECT`

call on the entire tree.

I’ve implemented two methods for `dendrapply`

: a pre-order traversal and a post-order traversal. The final implementation for both uses a maximum of 3 slots on the protection stack, and no recursive function calls. The original tree consumes the first `PROTECT`

call, which protects all its children until they’re modified. When each node is evaluated, we use a `PROTECT`

call to create the R expression to be called and a second to protect the `SEXP`

returned from the function. This value is then assigned to the parent node using `SET_VECTOR_ELT`

, which protects the value. Since protection is by value and not by reference, we can safely store this new value in the linked list. Applying the function to the root can safely be done by using `REPROTECT()`

to preserve protection on children.

The result is a little funky in the pre-order case: we use `VECTOR_ELT`

on the parent to get the node, call the function on it, replace the node using `SET_VECTOR_ELT`

on the parent, and then populate the reference in the linked list by calling `VECTOR_ELT`

on the parent a second time. This ensures we always have protected values.

With a post-order traversal, the implementation is a little cleaner. Since the children of each node are always evaluated prior to the node itself, we don’t need to recall `VECTOR_ELT`

after calling the function. Instead, we can just merge the nodes and continue.

The main difference for end users between the two traversals is that, for a given node `n`

, pre-order traversal will always evaluate `f(n)`

*before any of its children*, whereas post-order traversal will always evaluate `f(n)`

*after all of its children*. Pre-order is the default for backwards compatibility with the original `stats::dendrapply`

, which used a pre-order traversal. The post-order method allows for some new functions, such as the following:

This function assigns a new attribute equal to the label if it’s a leaf, or the concatenation of the child nodes’ new attributes if it’s an internal node. The default application of `dendrapply`

will only create new attributes for the leaves, and will return `character(0)`

for any internal nodes (since the children won’t have had their new attribute set yet). However, using `how='post.order'`

will ensure we evaluate the children first, meaning that internal nodes will be assigned a non-empty value:

This capability is something I have found myself wishing for often in `dendrapply`

. Calculating Fitch Parsimony for a phylogeny is a great example of a method that relies upon a post-order traversal.

Another weird thing I had to contend with is ensuring that the functionality *exactly* replicates that of the current `dendrapply`

so as not to break existing packages. I’m reminded of this tweet:

Users tend to do things with code the developers never expect, which is exactly how I found myself failing all build checks repeatedly when trying to integrate my code into my fork of the R codebase. It turns out, at one point, a test was added to ensure `dendrapply`

performs expectedly when doing the following operation:

```
> D <- as.dendrogram(hclust(dist(cbind(setNames(c(0,1,4), LETTERS[1:3])))))
> dendrapply(D, labels)
[[1]]
[1] "C"
[[2]]
[[2]][[1]]
[1] "A"
[[2]][[2]]
[1] "B"
[[3]]
[1] "C"
```

“Expectedly” means producing the output shown. This was surprising to me, because this dendrogram is fully bifurcating. Additionally, `labels(D)`

produces `c("C", "A", "B")`

, so it’s not even that we performed `labels()`

on the root node and then replaced the first and second entries with the result of their children. It turns out that, due to the way the original `dendrapply`

was coded, the results of each operation replicate if they’re not large enough to fill the list generated by the parent. This means that, after applying `labels()`

to the root node, we get a `list`

with 3 elements. Then, when we apply `labels`

to the children, we get two new values. The parent list has three values, so we replicate those values as many times as necessary. In this case, the result of the children are the first two entries shown above. Since the list has length 3, we replicate the first element to the third location, producing the output given.

According to the tests, many CRAN packages rely on this bug working as “expected”. This required a bit of a code refactor, but I did manage to get my C code to replicate values across the parent array to accurately reproduce this bug. Knowing how software development is, I’m sure that I’ll have to do something similar for future bugs people are relying on…but that’s an issue for another day.

A inorder traversal is definitely possible, but I’m not sure if it’s worth it. The implementation would use the almost same code as the post-order case, although nodes would be added by inserting the right element next and the left element at the end of the list. In-order traversal on a multifurcating tree is defined as evaluating all but the last child node, then the node, then the final (rightmost) child. I can’t think of a good use case for this type of traversal, especially for multifurcating trees.

Breadth-first traversals are also easily implementable by making elements insert at the end of the linked list rather than at the next position. However, the behavior of these can be a little odd with `dendrogram`

objects (the result is fairly counterintuitive to me at least). These may be worth exploring as options in the future.

Something I would like to implement is a “flat” application of `dendrapply`

, similar to the flexibility offered in `rapply`

. Providing an option to get the results as a flat list/vector could have very good usecases. To illustrate what this would look like, imagine the following tree with shown labels:

```
a
/ \
b c
/ \ / \
e f g h
```

I’d like the function to be able to do something like:

```
> dendrapply(exTree, \(x) attr(x, 'label'), how='post.order', flatten=TRUE)
[1] "e" "f" "b" "g" "h" "c" "a"
> dendrapply(exTree, \(x) attr(x, 'label'), how='pre.order', flatten=TRUE)
[1] "a" "b" "e" "f" "c" "g" "h"
```

Note that this is different from `dendrapply`

in that the result is a flat vector and not a nested list, and different from `rapply`

in that the result is the function applied to leaves *and* internal nodes. Adding breadth-first and in-order traversals would likely be more useful for this kind of function than for the standard `dendrapply`

.

Speed gains from this implementation are relatively modest, although I suspect that further optimization could improve runtime. As the main improvement of this is in the backend and not the R function calls themselves, it should have relatively consistent performance regardless of the input function. Testing was performed on a simple function to add an attribute to each node, as well as a recursive one that calls `rapply`

at every node. Speedup is relatively consistent regardless of function, with the average boost in runtime approximately 1.5-3x on my machine (2021 MacBook Pro, M1 Pro, 32GB RAM). Calling `rapply`

had less of a speedup compared to faster functions, likely because the runtime of the called R function dominates the overall runtime of the `dendrapply`

call. Benchmarking using a minimal function to measure only the impact of the new apply function resulted in an average speedup of 2.5-3x depending on tree size. Looking at the memory usage of the functions, my new implementation has significantly decreased usage due to fewer function call frames allocated on the stack. However, this is difficult to benchmark since the original function uses R to allocate memory and the new function allocates in C.

Compatibility with the previous version of `dendrapply`

was tested against the unit tests available in `dendextend`

, one of the largest packages that makes extensive usage of `stats::dendrapply`

in a variety of scenarios. The pre-order traversal version passed all unit tests.

Post-order failed many tests because methods in `dendextend`

depend on the pre-order traversal order–this is why pre-order is the default setting.

The complete code is available on Github. What follows are some comments on the code contained within the files.

This has some quirks to make it a drop-in replacement for `stats::dendrapply`

. The behavior of the original function when the provided function does not return a `dendrogram`

or `list`

-like object is a little counterintuitive to me, but after lots of testing this implementation should replicate it all accurately. There is a weird quirk where calling `VECTOR_ELT`

on an `SEXP`

seems to unclass the object, and I was running into problems with the input nodes not being of type `dendrogram`

. I thought it was fairly safe to just reclass the object as `'dendrogram'`

when it comes to the function, since we expect to be applying the function to `dendrogram`

objects anyway (and children of a `dendrogram`

should also be `dendrogram`

).

The C code is much longer. The main functions exposed to R are `do_dendrapply()`

via `.Call`

interface, and `free_dendrapply_list()`

via `.C`

interface. The brunt of the computation is done in `new_apply_dend_func`

, which will likely be either renamed in the future or rolled into `do_dendrapply()`

. Note that checking for leaf nodes depends on the leaves having correct values for `attr(node, 'leaf')`

; if someone messes with the nodes it could perform unexpectedly. The current implementation defines leaves as nodes such that `attr(node, 'leaf')==TRUE`

and non-leaves as `is.null(attr(node, 'leaf')) || attr(node,'leaf')==FALSE`

. It’s unfortunately not possible to just check for nodes with length 1, since we should be able to support dendrograms with arbitrary numbers of children at internal nodes (whether that’s 1, 2, or many). I regard this error as being in the same category as users being able to change S3 classes to arbitrary values–it’s known and documented that things could go wrong if users mess with the values, and those that choose to modify the values are proceeding at their own risk. I’m not completely sure what the function will do in this scenario–I believe it will probably run infinitely until the C memory limit is exceeded. I’ve added enough checks to ensure that the function execution is always rescuable via user interrupts, so I think this is overall okay.

`RMB0`

, `SMB0`

, `BBR0`

, `BBS0`

). I’ve also updated the GUI to graphically iterate through instructions when `(r)un`

is input, meaning you can set up an infinite loop and watch it iterate through. The iteration executes at the same speed the computer normally would (determined by the clock speed), so you can watch it step through programs at slow speeds if you’d like.
At this point, I’m considering the project tentatively done for the near future. There are some things I’d like to improve eventually, such as adding support for ROM and some other external chips, but those are lower priority than some other projects I have in my queue. I’m going to be moving on to focus on my FORTH interpreter and starting to get into making contributions to the Biostrings R package. Maybe eventually I’ll have time to come back to this emulator.

As always, you can check out the complete code on Github. If you try it out and find any bugs, please let me know!

]]>Most programmers have probably heard of the Curses library, but if you haven’t, let me be the first to say: it’s an awesome package of code. Curses (a pun on “cursor optimization”) allows people to easily design terminal-based GUIs that are cross-compatible with almost any terminal type. If you’ve ever used `nano`

before, everything you see is made using `curses`

. The current version of `curses`

is `ncurses`

(short for *new* curses), and after seeing some examples of it in action I was super excited to start writing a GUI.

As in my last post, I’m not going to go over the code line-by-line, but I will include some code sketches and images.

`curses`

works by handling all the nitty-gritty of printing to a specific window, and allows the programmer to focus on what they actually want to see displayed. Once a `WINDOW`

object is initialized, printing to it is very similar to normal I/O from C. The basic outline of an interactive GUI looks like this:

```
#include <ncurses.h>
void interactive_gui(){
// initialize window
// If you don't give it a height/width,
// it defaults to the entire screen
WINDOW *mem_win;
bool quit = false;
// Initialize the screen
initscr();
clear();
noecho();
cbreak();
curs_set(0); // turn off cursor
// create a subwindow within the screen
mem_win = newwin(0, 0, 0, 0);
// refresh it once so we can start displaying content
wrefresh(mem_win);
// This function prints out whatever we want
print_page_function();
// main loop to get user input then react
while(!quit){
c = wgetch(mem_win);
switch(c){
case 'q':
quit = true;
break;
// Other keycodes can go here
default:
break;
}
print_page_function();
}
// general cleanup
// calling endwin() is really important
// because it GC's everything allocated for the window
// not calling it has messed up my terminal
clrtoeol();
refresh();
endwin();
return;
}
void print_page_function(){
// print at a specific coordinate
mvprintw(0, 0, "Printing stuff from coordinates 0,0!\n");
// print from wherever the cursor is
printw("\n\nPress 'q' to quit.");
printw("\n> ");
refresh();
return;
}
```

Essentially, the only change from regular `printf`

calls is we first create a window and subwindow, then we use `printw`

and `mvprintw`

to print to it. `wgetch`

will wait until a character is received, and then we can `switch...case`

on it to check for certain inputs. `curses`

also implements special macros for certain keycodes, such as `KEY_UP`

or `KEY_DOWN`

for the up/down arrows (respectively). Since `wgetch`

only reads a character at a time, if you want to read in entire strings you’ll need to allocate a character buffer, read to it on character inputs, and then process the whole string at once on `KEY_ENTER`

.

GUIs always end up taking tons of time for me because there’s always some tiny aspect of the display that could be just a *little* better. At this point, though, I’m pretty happy with what I’ve designed. By the time anyone reads this I’ll probably have changed it again, but for now the final product looks like this (it’s a video, just relatively slow):

The top section displays the current memory page (`0x00-0xFF`

) as well as a hexdump of the memory contents. Below that, we have all the individual registers, the program counter, stack pointer, and the bytecode of the last operation executed (this is super helpful for debugging). Finally, the flags are displayed using red/green for clear/set on terminals that support color. If color isn’t supported, it instead displays the values of each flag using binary numbers.

User input is shown at the `>`

, and the following commands are supported:

- Moving memory page by one using left/right arrows (ex.
`0xA0`

to`0xA1`

) - Move memory page by
`0x10`

using up/down arrows or scrolling (ex.`0x80`

to`0x90`

) - Moving directly to a memory page by typing it in (ex. type
`10`

to go to`0x10`

) - Jump to whichever page the program counter is on with
`home`

- Executing individual commands by using
`step`

or`step n`

, the latter executing`n`

commands - Run until a
`BRK`

command with`run`

- Display help using
`help`

- Quit the program using
`quit`

Overall, I’m really pleased with it! It feels intuitive and easy to use (but to be fair, I’m the one who wrote it). The information is similar to that of Symon, but all within a terminal window.

Now that I have a GUI, I can finally start writing test suites. There are a lot of opcodes to test, but I’m slowly working through them. I’ve currently written one test suite to test all the addressing modes supported by loads/stores to all the registers, as well as load/store/transfer to all registers that support it. This led to me finding a ton of bugs, which was…expected but still disheartening. Thankfully it wasn’t *too* hard to fix the issues, and everything I’ve tested should be working properly now. I’ve also implemented tests for `ADC`

and `SBC`

, which are now working. These tests suites are located at loadstore.asm and [arithmetic.asm][https://github.com/ahl27/65C02Emulator/blob/main/tests/arithmetic.asm], and were compiled with `xa -C -M`

. Here’s an example of running the `loadstore`

test:

This program loads and stores a bunch of numbers using all the different addressing modes to various locations on the zero page. I think it would probably be possible to write an automated test that runs all the test cases and reports pass/fail without me having to manually check the result, but that will be a project for another day.

If you’ve made it this far, thanks for reading! I hope you learned something, or at least enjoyed the writeup. This emulator is approaching its final stages, I’m really just checking all the opcodes to make sure they’re working properly. I’m not planning to write another full blog post about this project since it’ll be mainly test suites and bugfixes, but I’ll probably put up a short post when I officially finish the project.

As always, you can check out the complete code on Github. Thanks for reading!

]]>Emulation definitely gets harder as systems get more complicated, but for a small 8-bit microprocessor with a small instruction set, it’s not super hard to set up a project in C to accurately read compiled 6502 assembly and reproduce the behavior one would expect.

The complete source code is available on Github. Rather than go through it line by line, I just wanted to give a broad overview of the intuition behind the code. If you’re interested in the complete code though, definitely check out the code repository!

Let’s start with assumptions. There are a lot of ways to set up an emulator depending on what you’re trying to accomplish–my primary goal here is to be able to run compiled 6502 assembly code, so I’m not trying to support every use case or setup. Any other ICs connected to the computer will be implemented entirely in software, with the exact internal connections obfuscated.

I’m also going to assume that the code is correctly compiled 6502 assembly code. At the beginning, I’m focusing specifically on 6502 code, and once I have that working I’ll move on to implementing the updates included in the 65c02 line. I’m doing my best to emulate the correct clock speeds, and once I have a working implementation I’ll also try to replicate known bugs from a real 6502 (ex. the `JMP`

instruction using indirect indexing to the bottom of the third page would incorrectly read from memory locations `$30FF`

and `$3000`

rather than `$30FF`

and `$4000`

).

The internal setup of the computer is pretty simple. 6502s have three 8-bit registers (`A`

the accumulator, `X`

, and `Y`

), an 8-bit register for status flags, an 8-bit stack pointer, and a 16-bit address register. In total, that’s five 8-bit registers and one 16-bit register. The system supports memory access from `$0000-$FFFF`

, so we also need an array of 65,536 bytes. In C this is pretty simple, we just initialize some global variables:

```
// defining byte an unsigned 8-bit integer
typedef uint8_t byte
// Accumulator, X, Y registers
byte a, x, y;
// Program Counter
uint16_t pc;
// Stack Pointer
byte stackpointer;
// Flags
// N V - B D I Z C
byte flags;
// Address bus
uint16_t address;
// Memory, 0x10000 has indices 0x0000-0xFFFF
byte memory[0x10000];
```

`memory`

is an array of bytes, so we can access it either with array syntax (ex. `memory[0x0001]`

is the second byte), or by using pointer arithmetic (ex. `*(memory+0x0001)`

would access the same value). This comes in pretty handy, since we can access the current byte the program is at at anytime using `*(memory+pc)`

.

Normally we’d also have a collection of 8 pins corresponding to the data bus, where the processor sends or receives data to/from external sources (resp.). However, as I mentioned previously, I’m going to skip this in favor of just implementing the connections in software. This means that, for now, we won’t have any errors with things like chips not being hooked up correctly or having incorrect address decoding. Programs are essentially treated as running on a system with a 6502 whose outputs are all mapped to 64KiB of RAM. In the future, I’ll probably add in a translation layer that implements software-level address decoding and correctly misses lookups on chips that aren’t connected.

For now though, it’s simple enough to implement read/writing bytes:

```
byte read_byte(byte *address){
sleep(CLOCK_TIME);
return (*address);
}
void write_byte(byte *address, byte value){
sleep(CLOCK_TIME);
*address = value;
return;
}
uint16_t read_address(byte offset){
uint16_t val = read_byte(memory+offset+1);
val <<= 8;
val |= read_byte(memory+offset);
return(val);
}
```

`CLOCK_TIME`

here is a constant defined elsewhere such that `sleep(CLOCK_TIME)`

pauses for a single clock cycle. This means that `read_byte()`

and `write_byte()`

both take a single clock cycle, which is roughly what happens on the 6502. There is some error in this timing due to the speed of the C code execution, but with how fast processors are nowadays that overhead should be negligible.

`read_address()`

is another little function I wrote to help me with reading out addresses, since this happens a lot in 6502 instructions. 6502 is a little endian computer, meaning the least significant byte is store at the lowest address. This function ensures I don’t ever mess that up (since I end up looking up the difference between little and big endian pretty much every time I write a function using addresses).

Okay, so we now have some data and registers. The next big part is a way to determine what to do for each operation. Every instruction for the 6502 is a single byte (so a number from `0x00`

to `0xFF`

). Based on its value, we may need to read up to two additional bytes depending on the instruction. For example, if we want to jump to the code at `$0140`

, we would need the opcode for `JMP`

with absolute addressing, then the address `$0140`

(two bytes). This would look like `6C 40 01`

, since `6C`

corresponds to `JMP(abs)`

and `$0140`

is entered in little endian format.

There’s a great resource online for some good rules to decode opcodes. The naive approach is to just `switch...case`

on all 255 possible values, but that’s pretty inefficient. Some of the opcodes aren’t used, and others have a lot of patterns.

Each opcode is a byte, meaning 8 bits. Based on their values, they fall into seven main groups: Groups 1-3 (G1,G2,G3), Single Byte 1-2 (SB1 and SB2), Conditionals, and Interrupt/Subroutine (I/S). Groups 1-3 are named based on the MC6500 Microcomputer Family Programming Manual, as mentioned in the previous link. The other three groups are ones I named myself based on their patterns–they’re a little bit of a mixed bag.

SB1 is the simplest group–if the low nibble of the opcode is `0x8`

, then its in SB1. These are all single byte instructions, meaning we don’t have to read any bytes past the value. An easy example is opcode `0xE8`

, which corresponds to `INX`

and increments the `X`

register by one.

SB2 are also single byte instructions, and are all instructions where the low nibble is `0xA`

and the upper nibble is at least `0x8`

, so anything of the form `1000 1010`

.

If the instruction isn’t in SB1 or SB2, it’s probably in one of the main groups, G1-3. All of the opcodes in these groups follow the same pattern: If we regard the bits as `aaa bbb cc`

, then `cc`

determines which group we’re in, `bbb`

determines the addressing mode, and `aaa`

determines the operation. For example, G1 has `cc=01`

, `LDA`

has `aaa=101`

, and the immediate addressing mode for G1 has `bbb=010`

. Thus, `LDA #`

has opcode `101 010 01 = 1010 1001 = 0xA9`

. The addressing modes are slightly different between G1 and G2-3, so I implemented two different functions to decode the addressing mode depending on which group it belongs to.

Since `read_byte`

and `write_byte`

take as input a pointer to a byte, the address mode decoding function can simply return the pointer to the value we’re going to access. For immediate instructions this is simply `memory+pc`

, for absolute instructions it’s `memory + *(memory+pc)`

, and similarly for other addressing modes. When we need to use a register, we can just pass the address of the register (ex. `&a`

for operations on the accumulator).

Within G3 is a special subgroup, the Conditionals group. These are all of the form `xxy 100 00`

, and branch conditionally on a specified value. `xx`

determines which flag to check (`N,V,C,Z`

for `00, 01, 10, 11`

respectively), and `y`

determines the value to check against. For example, `0xB0 = 101 100 00`

branches if `C == 1`

, which is the Branch if Carry Set (`BCS`

) instruction.

The last group is sort of a weird bag of leftovers, the I/S group. This only has four instructions: `BRK (0x00)`

, `JSR (0x20)`

, `RTI (0x40)`

, and `RTS (0x60)`

. These are all of the form `0aa 000 000`

.

The final flow of the opcode decoding logic looks like this:

```
byte opcode = some_value;
byte highnibble = opcode >> 4;
byte lownibble = opcode & 0x0F;
if (lownibble == 8){
// SB1 Logic
} else if (lownibble == 0xA && highnibble > 7){
// SB2 Logic
} else {
byte aaa = (opcode & 0xE0) >> 5;
byte bbb = (opcode & 0x1C) >> 2;
byte cc = opcode & 0x03;
switch(cc){
case 1:
// G1 address decoding and opcode logic
break;
case 2:
// G2 address decoding and opcode logic
break;
case 3:
if (bbb == 4){
// Conditional Branching logic
} else if (bbb == 0 && !(aaa & 0x4)){
// I/S Logic
} else {
// G3 address decoding and opcode logic
}
break;
}
}
```

The actual address mode decoding and opcodes are a bunch of `switch...case`

statements, so I’m not going to include them. The result is a lot fewer than 255 `case`

statements, but there are still tons of them in the codebase.

The opcodes themselves are pretty simple to implement at this point. We have a pointer to the data we’ll need and the operation, and the logic behind each operation is very basic. For example, for `ORA`

, we just do:

```
void ORA(byte *addr){
// OR value at addr with the accumulator
write_byte(&a, a | (*addr));
// Set N,Z flags (0x7D = 0111 1101)
flags = (flags & 0x7D) |
((a & 0x80)) | // N
((a==0) << 1); // Z
return;
}
```

The only trick here is making sure that we’re updating the flags correctly. Some of these are pretty funky, like the `V`

flag, and some updated in situations I wasn’t aware of (for instance, `LDA`

updates `N,Z`

depending on the value loaded). I also learned a lot about some instructions I didn’t know existed, like `PHP`

and `PLP`

.

At this point, I’ve written software to decode all the opcodes and run the logic behind them accordingly.

One of the interesting things about the actual 6502 is that it has a lot of bugs in it. In order to accurately emulate 6502s running, I’m eventually going to try to incorporate these bugs. For example, I learned that if a G1-3 opcode with `cc = 11`

is supplied, it will interpret it as *both* a G1 and G2 opcode. This means that if you somehow passed opcode `0x6B = 0110 1011`

, it’ll see a G1 code of `0x69`

(`ADC`

immediate) *and* the G2 opcode `0x6A`

(`ROR`

on accumulator). This actually executes both instructions *simultaneously*, meaning it will try to add a value to the accumulator while also rotating it. As mentioned in documentation, sometimes this resolves correctly (ex. when different registers are referenced), sometimes it has weird behavior (ex. when one instruction is a hidden instruction), and sometimes it’s totally random (ex. `STA`

and `STX`

stores `A AND X`

, sometimes with an extra constant depending on the manufacturer). “Hidden” instructions are already implemented with my setup, and are created by combining a function and opcode that don’t make sense. For example, `0x89`

corresponds to `STA`

immediate, which doesn’t make any sense (but will still execute).

That’s it for this blog post! I spent a lot of time learning how to make `Makefile`

s for this, as well as getting my include guards to correctly guard against endless recursive includes. I write a lot of C code for R, but it’s been a whiel since I wrote a standalone C project…so I was a little rusty. Next steps are going to be writing some logic to actually load a compiled file into memory, and then I’ll start going through all my functions to make sure they’re working as intended. As part of that, I’ll be writing a bunch of unit tests to comprehensively test everything.

As always, you can check out the complete code on Github. Thanks for reading!

]]>There’s a bunch of ways to implement division on computers, but I’m going to be using one of the simplest methods. For people that are interested, I highly recommend this post from SEGGER Microcontroller on how to use the Newton-Raphson Method with fixed point multiplication to quickly divide numbers. Unfortunately, implementing this requires that we have the capability to multiply two 16-bit numbers into a 32-bit number, which I didn’t feel like implementing.

Instead, I’m going to be using the following long division algorithm (shamelessly taken from Wikipedia):

```
let N=numerator, D=denominator
let Q=0, R=0 # Quotient, Remainder
for i in (n-1) to 0: # n is the number of bits in N
R = R << 1
R(0) = N(i) # Set the 0'th bit of R to the i'th bit of N
if R ≥ D:
R = R - D
Q(i) = 1 # Set the i'th bit of Q to 1
return R,Q
```

This is an easy algorithm that goes through the steps of long division on unsigned binary integers, and that doesn’t require more than 16 bits for each value. The one important thing to keep in mind is we need to calculate `n`

, the number of bits in the numerator after discarding any leading zeros. The algorithm as a whole is pretty simple, it just takes a fair amount of code when writing it in assembly.

As always, I’m using my implementation of a 16-bit stack. You can look at my previous posts for how it’s implemented, but essentially it’s a downward growing block of memory that starts at the top of the zero page. For consistency with all my other functions, this will remove the top two values of the stack and replace them with the result of the operation. Since we’re calculating the remainder and the quotient, we’re going to store *both* of these at the end. The stack before and after looks like this:

```
Top of Stack -----> Top of Stack
Divisor -----> Remainder
Numerator -----> Quotient
```

Notice here that we’re treating the top entry of the stack as the denominator of the operation, and the second entry as the numerator. This feels like a good decision, since we’d probably expect to have whatever number we’re working with on top of the stack. This makes it simpler to just push a divisor and then divide; if we treated the top value as the numerator, dividing the number at the top of the stack by another would require three calls (push, swap, divide).

Here’s the skeleton of the function:

Lots of lines here, but none of them are super complicated. As usual, we’re allocating space for the new values, doing some stuff in the middle, copying the values to lower spaces in the stack, then discarding the space we no longer need. The meat of the function is going to be the `findbits`

and `dloop`

routines. I’m also storing the value `16=0x10`

in the `y`

register, which I’ll use later in the `findbits`

routine. This value is going to store the value of `n`

(the number of bits in the numerator), which is at most 16.

We’ll start with the easier of these: `findbits`

. In my implementation of `mult16`

I talked about how we right/left shift 16-bit values, which will be very important here. All we have to do is left shift the numerator until the leading bit is a 1, at which point we continue on.

At the end of this, `y`

will store the number of bits in the numerator, and the numerator will have all its leading zeros trimmed. Notice that I’m using `y=n`

rather than `y=n-1`

as stated in the initial algorithm–looping from `n:1`

is a little cleaner to implement than looping from `(n-1):0`

.

This code is pretty close to correct, but there’s a major bug in it. What happens if the numerator is zero? Then our `bit`

call would never find a non-zero value, and we’d be trapped in an infinite loop. To fix this, we need to first test if the numerator is zero. While we’re at it, we may as well test if the denominator is zero as well.

This adds a few new lines–we’re checking if the numerator or the denominator are zero, and if so, we set both the quotient and remainder to 0, clean up the stack, and then return. This won’t work quite right as a subroutine, but the final implementation is going to replace all subroutine calls with code to avoid unneccessary `jsr`

and `rts`

calls.

At this point, we’ve trimmed off the leading zeros of the numerator, and we’ve stored the number of bits in it in the `y`

register. Now, we can move onto the main division loop.

This function is a little more involved. To recap, we have five main steps repeated `n`

times, with `i`

ranging from `n-1:0`

:

- Left shift the remainder
- Set the last bit of the remainder to bit
`i`

of the numerator - If the remainder is less than or equal to the denominator:
- Subtract the denominator from the remainder
- Set bit
`i`

of the quotient to 1

There are quite a few references to bit `i`

, which will be tricky to implement in the 6502. Instead, I’m going to modify the steps into a longer (but equivalent) form, that will end up being simpler to implement in assembly:

- Left shift the remainder
- Left shift the quotient
- Set the last bit of the remainder to the first bit of the numerator
- Right shift the numerator
- If the remainder is less than or equal to the denominator:
- Subtract the denominator from the remainder
- Add 1 to the quotient

This ends up being the same instructions as before, but we only need to reference the first or last bit of any given value. If you’re not convinced this works, try it out for yourself!

Now all that’s left is to implement it in code. I’ve covered shifting operations in my multiplication post, but it’s important to note that all the shifting operations shift the outgoing bit into the carry register. We can use this to combine steps `(3)`

and `(4)`

–right shifting the numerator stores the most significant bit in carry, which we can use to set the last bit of the remainder.

The main thing that took me a while to understand here is the result of the `cmp`

instruction. `cmp`

compares the value in the accumulator to the memory address provided (or immediate/indirect/whatever, it supports other addressing modes). The result of the operation sets the following flags:

```
Condition N Z C
---------------------------------
Register < Memory: 1 0 0
Register = Memory: 0 1 1
Register > Memory: 0 0 1
```

Since each number is stored in two bytes, we have to make at most two comparisons. The branching statements help streamline some of the calls. These were all new information to me since I don’t have a lot of experience with all the different branching functions, but if you’re experienced with this feel free to skip straight to the next section.

The first step is to compare the upper byte of the remainder (stored in the accumulator register) and the upper byte of the denominator (stored in memory). If the remainder’s upper byte is less than the denominator’s, we know that the total 16-bit value of the remainder must be less, so we can skip all the remaining logic for that loop. This situation corresponds to when the **N**egative flag is set, so we use `bmi`

(**B**ranch if **Mi**nus).

If the upper byte of the remainder is greater than the denominator, we know the total value of the remainder is greater without needing to look at the lower byte. In this case, we skip the lower byte comparison to go straight to subtracting values. This case happens when the **Z**ero flag is not set, so we use `bne`

(**B**ranch if **N**ot **E**qual). `bne`

is more like “Branch if Zero not Set”, but whatever.

The last case is when the upper bytes are exactly equal–in this case, we do need to look at the lower byte. This uses the same instructions, but we only need to check if the remainder is less than the denominator, since the other two cases lead to the same result (continuing to the `subtract`

label). As before, this happens when the `N`

flag is set, so we use `bmi`

to skip `subtract`

if the remainder is less than the denominator.

At the end of this function, we should have the quotient and remainder stored in the stack. All that’s left is to put it all together in a single function!

Here I’ve written the subroutines for `findbit`

and `dloop`

into the function directly, which avoids unneccessary `jsr`

and `rts`

calls. There’s a little bit of redundancy in the end cleanup and `earlyexit`

labels, but it’s not enough to make things crazy. I’m currently handling divide by zero by just setting the result to zero, but specific error behavior could be implemented later in the `earlyexit`

label.

I also wrote some quick helper functions to get just the quotient or remainder, since always storing both is a little tedious.

These just call `div16withmod`

, pop the unneeded value, and clean up the stack afterwards. As I mentioned at the beginning of this post, there are definitely faster ways to implement all these functions, but my priority right now is getting something that works. Division is especially notorious for being slow no matter how you slice it, and eking out every bit of optimization on hardware that’s pretty slow to begin with is not a huge concern at the moment.

I really hate testing, but unfortunately it’s a necessary evil. If only my code always worked perfectly on the first try…this implementation went through a couple revisions as a result of testing, but the code included here is all working successfully. I wrote three tests to check if the division and remainder operations were working correctly (again using my `stacktest.asm`

file I’ve been using for testing all these stack operations):

We should get the following as output:

```
Address Value
x=0xF9 junk
0xFA 0x1A ; modtest
0xFB 0x00
0xFC 0x00 ; divtest2
0xFD 0x8A
0xFE 0xA1 ; divtest1
0xFF 0x00
```

Running it through symon, we get the following:

Everything looks correct!

At this point, I don’t think I can put off working on the core interpreter any longer. Now that most of the auxilliary pieces are in place, I can go back to writing and testing my main Forth interpreter. The plan is to make sure calling dictionary words work correctly, populating the dictionary with words based on the stack functionality I’ve implemented thus far, and then working on a simple commandline REPL. After that, I’ll technically have a working Forth interpreter…but we’ll see how long it takes to get there.

I’ll also note that I haven’t included any way to handle negative numbers yet. I’m not sure when I’m going to implement that, but adding the ability down the line is fairly simple (just convert to positive, divide/multiply, then negative if exactly one of the numbers is negative). I haven’t settled on exactly how I’ll adjust things, but I’ll probably end up switching my computer from using unsigned integers to using two’s complement numbers, with a negation function that runs as a preprocessing step before any of these functions. We’ll see.

As always, you can follow this project on Github!

]]>