Blog
Gene Wiki on NAR database cover
The Gene Wiki Rainbow just got a little bit more famous. Check it out on the cover of the 2012 Nucleic Acids Database issue. Thanks again Martin K.!The Gene Wiki on the cover of NAR
Mining the Gene Wiki
Our article about mining ontology-based gene annotations from the text of the Gene Wiki just came out at BMC Genomics. Yay!
In the article, we discuss the results of what I think might be the simplest text-mining strategy that could possibly work. Based on the premise that each Gene Wiki article is fundamentally about one particular gene, we make the simplifying assumption that all of the concepts detectable in the article are descriptors of what that gene does. With those assumptions in place, we use the NCBO annotator to detect concepts from the Gene Ontology (GO) and the Human Disease Ontology (DO) in the text of articles about genes. Each detected occurrence thus produces a candidate annotation for the gene. From the article:
For example, we identified the GO term ‘embryonic development (GO:0009790)’ in the text of the article on the DAX1 gene: “DAX1 controls the activity of certain genes in the cells that form these tissues during embryonic development”. From this occurrence, our system proposed the structured annotation ‘DAX1 participates in the biological process of embryonic development’. Following the same pattern, we found a potential annotation to the DO term ‘Congenital Adrenal Hypoplasia’ (DOID:10492) in the sentence: “Mutations in this gene result in both X-linked congenital adrenal hypoplasia and hypogonadotropic
hypogonadism”.
We found that, in terms of precision, this simple approach worked pretty well on detecting gene-disease annotations (90-93%) but not nearly as well at detecting gene-function (GO) annotations (48-64%). As you might expect, the recall equation worked in the opposite direction with many more potential GO annotations discovered (11,022) then DO annotations (2,983). Though there was some overlap, the majority of the predicted annotations did not have any match in existing annotation databases, showing that the Gene Wiki contains some knowledge that centralized resources like the Gene Ontology Annotation database do not yet represent and that basic text mining provides a way to access that knowledge computationally.
But, you say, that precision for the GO is really low, what use is this really? For applications that require 100% accuracy, like a curated database, well you would need to curate the predicted results and that might be quite a lot faster than searching through PubMed to find them all from scratch. As it turns out, there are also other kinds of applications that can take advantage of data like this that has noise in it. As long as there is a strong signal within the noise, probabilistic techniques, like enrichment analysis, can work. This is possible because, although many of the individual annotations might turn out to be incorrect, as a group they are far far from random.
For more details, read the paper ;).
Quotes from Reality is Broken
I’m currently supposed to be writing an article about scientific discovery games in biology, but I have writer’s block. So instead, I’m writing here.. which is much easier! The article I am not currently writing will discuss recent successes like “Algorithm discovery by protein folding game players” by Firas Khatib and others. In preparing to write this article (i.e. more not-writing), I assembled some inspiring quotes from the fantastic book “Reality is Broken” by Jane McGonigal. I share them here below because, well they made me think a little bit and perhaps they will do the same for some else, and this allows me to push back my real work by another 5 minutes..
“It is games that give us something to do when there is nothing to do. We thus call games “pastimes” and regard them as trifling fillers of the interstices of our lives. But they are much more important than that. They are clues to the future. And their serious cultivation now is perhaps our only salvation” – quote that opens McGonigal’s book – from Bernard Suits
“Games aren’t leading to the downfall of human civilization. They’re leading to its reinvention” (p354)
“Game developers know better than anyone else how to inspire extreme effort and reward hard work. They know how to facilitate cooperation and collaboration at previously unimaginable scales. and they are continuously innovating new ways to stick with harder challenges, for longer, and in much bigger groups.” (p13)
“Game design isn’t just technological craft. It’s a twenty-first-century way of thinking and leading. And gameplay isn’t just a pastime. Its a twenty-first-century way of working together to accomplish real change.” (p13).
“Anything else you think you know about games, forget it for now. All the good that comes out of games-every single way that games can make us happier in our everyday lives and helps us change the world-stems from their ability to organize us around a voluntary obstacle” (p34)
“Compared with games, reality is unproductive. Games give us clearer missions and more satisfying, hands-on work.” (p55)
“If you were able to focus the attention of the entire planet on a single goal, even just for one day, and even if it just involved dispatching aliens in a video game, it would be a truly awe-inspiring occasion. It would give the whole earth goose bumps.” (p112)
Dizeez to novel gene annotations
We've been hard at work mining the logs for the Dizeez game (see past posts for context). To summarize the take home message, the Dizeez game resulted in the identification of several novel gene-disease annotations. We used a psuedo-gold standard set of 3439 candidate...
Gene Wiki article out today at NAR
The articles for the annual database issue are starting to appear in the NAR collection. My favorite one this year is, immodestly perhaps, ours about the Gene Wiki! The simple message here is that the Gene Wiki is continuing to grow and that the content remains very high quality overall. For more information, the abstract is below, and of course the paper is freely accessible online:
“The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10 000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project’s Wikipedia portal page:http://en.wikipedia.org/wiki/Portal:Gene_Wiki.”
Learning from the "Dizeez" game
We recently released a game called Dizeez that tests your knowledge of gene-disease links. (Haven't seen it yet? Play here.) Now that it's been live for a couple of weeks, we've had a chance to look at the game logs and make a few observations: Dizeez was played to...
Outsourcing the BioGPS iPhone app
We're starting a new experiment with the BioGPS iPhone app. As you may know the iPhone app does exist, and it's got a more-than-respectable four-star rating. Yet, it's not an avenue that we've been pushing very hard on since we first released it. I think it's fair to...
Dizeez – fun with gene-disease links
Think you know something about the genetic basis of human diseases? Prove it by playing our new game "Dizeez". The rules are simple. You are shown one gene and five diseases. Pick the disease that is known to be linked with the gene and you get some points. Get as...
Stepping towards a Semantic Wikipedia
(Update, check out our publication in Database for a full-length, peer-reviewed version of this article.)It is now possible to specify the nature of the relationships between things described by Wikipedia articles directly in the context of the article…
Twenty questions for genes — evaluation framework
Part 1: Introduction to the concept Part 2: The prototype game Part 3: Evaluation framework (this post) In our previous post, we described how we created a prototype Guesser program that plays the game 20 Questions on genes. The next natural question is: how accurate...