I blogged previously about my rule of 80% duplication, 20% innovation. That post discussed why there are so many gene-centric portals available which all present mostly-overlapping annotation and data. I attributed this phenomenon to the lack of a mechanism for data providers and developers to easily integrated their new data with existing resources.
But many people probably thought, “What about DAS?” The Distributed Annotation System in theory targets exactly this — data exchange and integration between online resources. My hypothetical researcher in the 80%/20% post should have looked into using DAS to integrate his/her own data into an existing DAS gene portal, instead of creating a brand new one.
But my guess is that many people do look at DAS as an option, but end up not using it. I think there are two related reasons here. First, have you looked at the DAS specification? Yikes, over 9000 words with lots of dense XML. This protocol is clearly not aimed at the researcher with a cool new data set, who may have some basic programming/CGI skills but no formal bioinformatics training. To hammer home this point, I did an analysis a while back on all the DAS resources listed at dasregistry.org. Of the 53 working human and mouse DAS servers, 64% were written by just three organizations (which all incidentally were heavily involved in developing the DAS specification itself).
Secondly, even if you do sort out the data interface, you’ll also find that DAS is focused on highly-structured data. That means strongly-typed annotation, usually features localized based on genomic coordinates. This is great of interchange of data between, for example, the ENCODE project and your favorite genome browser. But what if you have large-scale proteomics data to share? How do you visualize a gene network that shows physical or genetic interactions? Data structure is great if the structure accommodates your data type, and horrible if it doesn’t…
So, in summary, I think it’s great if DAS is the solution catering to large genome centers with full-time professional bioinformaticians. But I think we as a community need to recognize that this should not be the only data exchange protocol in biology.
I’m still reading into this, but I guess one question is, what if someone, or some organization, provided a service that made it easy for your data to conform to DAS spec?
Hmmm, obviously I’d be very interested. But as a very structured data format, I’d expect DAS to be only applicable to a some data types (most notably data that can be presented as features on a genomic axis). But anyway, very interested to hear what you’re thinking…