Blog

Sequenced genomes per year

As part of building the case for creating our proposed CMOD resource, we wanted to know just how quickly the number of sequenced genomes was increasing. The thinking is that the more genomes are being sequenced, the more genomes there are that are going with virtually...

Data Chart Plugin Beta

How shall we find the concord of this discord? how to enlarge penis —William Shakespeare, A Midsummer Night's Dream   Big news coming out of the Su Lab today! As you may know, we’ve been doing a lot of work recently on the presentation of our datasets stored in...

New BioGPS and MyGene.info Paper published

In this first post of the new year, we are happy to report that the update paper on BioGPS and MyGene.info is now published in the Nucleic Acids Research Database issue:   This paper highlights the exciting updates on BioGPS since our first paper was published in...

GSoC recap for Crowdsourcing Biology team at TSRI

The Crowdsourcing Biology team at the Scripps Research Institute participated in the Google Summer of Code for the first time this year. Five students contributed to efforts to harness the power of community intelligence to advance biomedical science.

Introducing the Dataset Library

We're very excited to announce the addition of the Dataset Library to BioGPS! As I mentioned in my last blog post BioGPS now has thousands of datasets available for browsing. Providing this many datasets comes with some challenges, including making them easy to search...

Results from the Cancer Biology game: The Cure

Building intelligent systems for biology

Our research group has been exploring the concept of serious games for several months now.  Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players.  The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.   

As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called The Cure’.  The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples.  We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.  

In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes.  The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.

Below is a game in progress.  I’m on the bottom and my opponent, Barney, is on the top.  We alternate turns selecting a card (a gene) from the board and adding it to our hand.  When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time.  The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules.   The better the gene set performs in generating predictive decision trees, the higher the score.  When the player defeats their opponent, they move on to play another board.  (Multiple players play each board.)  

A game of the The Cure.  Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.  

As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes.  Players are also encouraged to make use of external knowledge sources (in addition to their own brains).

Promotion, players and play

The Cure was promoted on launch day via a presentation by Andrew Su at Genome Informatics 2012, via Twitter and in several blog posts.   As we first described in a post published on the Sage community site, more than 120 players registered and collectively played more than 2000 games in the first week that the game was alive – with much of this activity happening within the first few days.  Nearly half of the players self-reported having PhDs and half claimed knowledge of cancer biology.  Following the initial buzz, game-playing activity slowed down to what is now a slow but persistent trickle.

Games played at The Cure since launch

As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games).  The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.  

Predicting breast cancer prognosis

Aside from entertainment, the point of this particular game is to assemble a predictor for breast cancer prognosis.  The main hypothesis is that biological knowledge, accessible from players, can be used to help select good sets of genes to use to train predictive models using machine learning algorithms.  The premise is that injecting distributed biological knowledge (which can not entirely be learned from any one training set) will help reduce overfitting by identifying the gene sets with biologically consistent associations with disease progression.  

The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play.  While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.

  1. Filter out games from players that indicated no knowledge of cancer biology.  
  2. Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.  
  3. Select the top 20 genes according to this ranking.
  4. Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge.  To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.  
CCL3L3 CXCL9 IL1B BCL2 DUSP1 ERBB2 EGR1 JUN PITX1 MAP3K1 IGFBP2 STAT1 BCAR3 HOXB2 BCL11B MAPK15 WNT5A APOA2 HLA-DRB4 CD163
Game-selected genes


The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboardsearch for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)  

In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information).  This was substantially better than classifiers produced using randomly selected genes (55%).  Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.  

Final Results from METABRIC round of BCC challenge


!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!

Conclusions

These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play.  We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.   
Data
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Implementation
The code that operates The Cure is freely available on our BitBucket account.  It consists of a Java server application (running in Tomcat) that handles database interaction, board generation, and integration with the WEKA machine learning library.  WEKA is used to dynamically train and test decision trees (though we could easily use other models) while the game is running.  The interface is almost entirely CSS and JavaScript that communicates with the server via JSON requests.  We would be thrilled if some one wanted to use this code to build another classification game!

Trees
One aspect of the code-base that may be useful in a variety of different projects is the code that translates the Java objects that represent decision trees in WEKA into the Web-ready visualizations presented to the players.  This is accomplished via server-side translation into a JSON structure that is rendered in the browser using code that builds on the D3 javascript visualization library.

Credits
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.

Barney
The opponent in The Cure came from a Wikipedia Commons image from the game “You have to Burn the Rope“. Thanks for sharing!

Subscribe

Enter your email address:

Archives

Categories