Yesterday I posted some slides about an idea I had recently which I call Human Guided Forests or HGF for short. This an attempt to marry crowdsourcing with machine learning to produce better class predictors for datasets with very large features spaces. Specifically, the idea is to replace the ‘random’ in the Random Forest algorithm with ‘human’.
Random Forests basically work like this:
Given a labeled dataset with M input variables and N samples,
- Choose m as the number of input variables allowed per tree in the forest
- For X iterations:
- Choose a subset of n samples from the training set
- Select m random input variables
- Build a decision tree using the randomly selected variables, all n samples (the ‘bootstrap’ or ‘in bag’ sample), and standard induction techniques (e.g. C4.5)
- Measure the error rate for that tree on the samples not used to train it (the ‘out of bag’ or ‘oob’ samples)
- Save the tree
After the forest of decision trees has been constructed, classify new samples by running them through all the trees and choose the class that is predicted most frequently. (This is a very successful kind of ‘ensemble classifier‘ that is similar to one that I, because of my ignorance, reinvented as one of my first projects in bioinformatics.)
This algorithm has been shown to be very effective at extracting good classifiers from datasets automatically. However, as the random forest authors say:
“But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem.” Leo Breiman and Adele Cutler
So, the question I’m posing is this: by inserting humans into the learning process can we improve it using their reasoning and background knowledge?
For HGF we replace the random input variable selection at step 2 above with expert guided variable selection. (We may also let people guide the inference of the decision tree.) The hypothesis is that the experts will choose feature sets that are better than the randomly selected ones – that will generalize to novel datasets with less error and produce more easily understood classifiers.
As with standard RF, we need the HGF to produce many trees for each dataset with each tree having high classification performance and low overlap with the rest of the forest. Ideally we would have each bootstrap of the training data converted into a tree by a different expert where the experts were drawn from a pool with highly diverse expertise. This would require a significant investment of work from a large collection of expensive people..
Now, the next question is how on earth are we going to get a very large pool of skilled professionals to contribute their expertise to this project? The answer we have been gravitating towards is games. We hope to translate the feature selection problem into a game that knowledgable biologists and interested lay people will play for fun.
The formulation of the game(s) that will be used to drive an HGF implementation is very much a work in progress. At the moment, the basic structure of our candidate games is that of a card game. One way or another, players compose ‘hands’ of cards that correspond to features in a particular dataset. For example, cards might correspond to genes from a gene expression dataset. Hands are scored by testing the predictive performance of classifier trees inferred using the features in the hand and the training data (like one cycle of a random forest run).
Relation to Network Guided Forests
This idea is highly related to the concept of ‘Network Guided Forests‘ (NGF) described by Dutkowski and Ideker in a PLoS paper last fall. In that approach, the features used to build decision trees are constrained to related nodes within protein-protein interaction networks. Features are selected for a given tree by picking one at random and then walking out along the network to bring in others in close proximity in the network. The algorithm did not improve on classification performance in comparison to standard random forest as measured in cross-validation, but it did result in much more stable and coherent feature selection across several datasets. It tended towards choosing genes that were known to relate to the phenotype of interest (e.g. breast cancer prognosis) much more often than random methods. In comparison, NGF has the huge advantage that it can be used immediately based on data in databases without any dependence on human intelligence. HGF has the theoretical advantage of tapping into a much broader collection of knowledge that is not limited to interaction data.
Call for comments
At this point, this is just a nascent idea. I have no evidence beyond intuition that it will succeed and there is a quite a bit of difficult work ahead to find out. Any thoughts on it at this early point in time are most welcome!