ID mapping is a very common, and often not fun, task for every bioinformatician. Suppose you have a list of gene symbols or reporter ids from an upstream analysis, and then your next analysis requires the use of gene IDs (e.g. Entrez gene IDs or Ensembl gene IDs). Converting from one symbol/identifier to another is a conceptually simple but often tedious process.
Here we want to show you how to use the mygene module in Python to do ID mapping quickly and easily. mygene is a convenient Python module to access MyGene.info gene query web services.
Installing mygene
Install mygene is easy, as pip is your friend:
pip install mygene
Now you just need to import it and instantiate the MyGeneInfo class:
import mygene
mg = mygene.MyGeneInfo()
Mapping gene symbols to Entrez gene ids
Suppose xli is a list of gene symbols you want to convert to entrez gene ids:
xli = ['DDX26B', 'CCDC83', 'MAST3', 'FLOT1', 'RPL11', 'ZDHHC20',
'LUC7L3', 'SNORD49A', 'CTSH', 'ACOT8']
out = mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')
scopes
defines the type of the input identifier, fields
defines the variable(s) to be returned, and species
limits the species to search. The returned “out” looks like this:
[{u'_id': u'203522', u'entrezgene': 203522, u'query': u'DDX26B'},
{u'_id': u'220047', u'entrezgene': 220047, u'query': u'CCDC83'},
{u'_id': u'23031', u'entrezgene': 23031, u'query': u'MAST3'},
{u'_id': u'10211', u'entrezgene': 10211, u'query': u'FLOT1'},
{u'_id': u'6135', u'entrezgene': 6135, u'query': u'RPL11'},
{u'_id': u'253832', u'entrezgene': 253832, u'query': u'ZDHHC20'},
{u'_id': u'51747', u'entrezgene': 51747, u'query': u'LUC7L3'},
{u'_id': u'26800', u'entrezgene': 26800, u'query': u'SNORD49A'},
{u'_id': u'1512', u'entrezgene': 1512, u'query': u'CTSH'},
{u'_id': u'10005', u'entrezgene': 10005, u'query': u'ACOT8'}]
Although the simple example above uses gene symbols from human, MyGene.info actually supports over 30 common identifiers (see the list here) and almost all species indexed by NCBI. And the annotation data are always updated on a weekly basis.
Get the idea of how it works? Continue to read the full tutorial here, which covers slightly more advanced examples and edge cases…