GRCh38 (or hg38) is the latest human genome assembly, which was released almost a year ago. MyGene.info is now moved to support GRCh38 by default for human genes, but the data (including queries) based on previous assembly version (GRCh37/hg19) are still supported. Here are some details about this change:
-
genomic_pos field is now based on hg38
This field contains the genomic location of the given gene, the start/end positions are now based on hg38:
In [1]: mg.getgene('1017', fields='genomic_pos')
Out[1]:
{'_id': '1017',
'genomic_pos': {'chr': '12',
'end': 55972784,
'start': 55966769,
'strand': 1},
}
-
exons field is now based on hg38
This field contains the genomic locations of exons, as well as cdsstart/cdsend, txstart/txend data. All these values are now based on hg38:
In [2]: mg.getgene('1017', fields='exons')
Out[2]:
{'_id': '1017',
'exons': {'NM_001290230': {'cdsend': 55971625,
'cdsstart': 55967008,
'chr': '12',
'exons': [[55966768, 55967124],
[55968048, 55968169],
[55968777, 55968948],
[55971043, 55971247],
[55971520, 55972789]],
'strand': 1,
'txend': 55972789,
'txstart': 55966768},
'NM_001798': {'cdsend': 55971625,
'cdsstart': 55967008,
'chr': '12',
'exons': [[55966768, 55967124],
[55967856, 55967934],
[55968048, 55968169],
[55968777, 55968948],
[55969474, 55969576],
[55971043, 55971247],
[55971520, 55972789]],
'strand': 1,
'txend': 55972789,
'txstart': 55966768},
'NM_052827': {'cdsend': 55971625,
'cdsstart': 55967008,
'chr': '12',
'exons': [[55966768, 55967124],
[55967856, 55967934],
[55968048, 55968169],
[55968777, 55968948],
[55971043, 55971247],
[55971520, 55972789]],
'strand': 1,
'txend': 55972789,
'txstart': 55966768}}}
- Genome interval query is now based on hg38 by default:
In [3]: mg.query('chrX:151,073,054-151,383,976', species='human')
Out[3]:
{'hits': [{'_id': '100422930',
'_score': 5.1352987,
'entrezgene': 100422930,
'name': 'microRNA 4330',
'symbol': 'MIR4330',
'taxid': 9606},
{'_id': 'ENSG00000228717',
'_score': 5.1352987,
'symbol': 'AF013593.1',
'taxid': 9606},
{'_id': 'ENSG00000278724',
'_score': 5.1352987,
'name': 'Metazoan signal recognition particle RNA',
'symbol': 'Metazoa_SRP',
'taxid': 9606},
{'_id': '9248',
'_score': 5.1264653,
'entrezgene': 9248,
'name': 'G protein-coupled receptor 50',
'symbol': 'GPR50',
'taxid': 9606},
{'_id': 'ENSG00000234696',
'_score': 5.1264653,
'name': 'GPR50 antisense RNA 1',
'symbol': 'GPR50-AS1',
'taxid': 9606},
{'_id': 'ENSG00000269993',
'_score': 5.111959,
'symbol': 'AF003625.3',
'taxid': 9606}],
'max_score': 5.1352987,
'took': 1202,
'total': 6}
- A new field genomic_pos_hg19 is added to hold the genomic location data based on hg19:
In [4]: mg.getgene('1017', fields='genomic_pos_hg19')
Out[4]:
{'_id': '1017',
'genomic_pos_hg19': {'chr': '12',
'end': 56366568,
'start': 56360553,
'strand': 1}}
- A new field exons_hg19 is added to hold the exons data based on hg19:
In [5]: mg.getgene('1017', fields='exons_hg19')
Out[5]:
{'_id': '1017',
'exons_hg19': {'NM_001290230': {'cdsend': 56365409,
'cdsstart': 56360792,
'chr': '12',
'exons': [[56360552, 56360908],
[56361832, 56361953],
[56362561, 56362732],
[56364827, 56365031],
[56365304, 56366573]],
'strand': 1,
'txend': 56366573,
'txstart': 56360552},
'NM_001798': {'cdsend': 56365409,
'cdsstart': 56360792,
'chr': '12',
'exons': [[56360552, 56360908],
[56361640, 56361718],
[56361832, 56361953],
[56362561, 56362732],
[56363258, 56363360],
[56364827, 56365031],
[56365304, 56366573]],
'strand': 1,
'txend': 56366573,
'txstart': 56360552},
'NM_052827': {'cdsend': 56365409,
'cdsstart': 56360792,
'chr': '12',
'exons': [[56360552, 56360908],
[56361640, 56361718],
[56361832, 56361953],
[56362561, 56362732],
[56364827, 56365031],
[56365304, 56366573]],
'strand': 1,
'txend': 56366573,
'txstart': 56360552}}}
- You can still make Genome interval query based on hg19 by adding a hg19. prefix:
In [6]: mg.query('hg19.chrX:151,073,054-151,383,976', species='human')
Out[6]:
{'hits': [{'_id': 'ENSG00000231937',
'_score': 6.9943757,
'symbol': 'RP11-329E24.6',
'taxid': 9606},
{'_id': '574412',
'_score': 6.9943757,
'entrezgene': 574412,
'name': 'microRNA 452',
'symbol': 'MIR452',
'taxid': 9606},
{'_id': 'ENSG00000228965',
'_score': 6.9943757,
'symbol': 'RP11-1007I13.2',
'taxid': 9606},
{'_id': '2564',
'_score': 6.9620624,
'entrezgene': 2564,
'name': 'gamma-aminobutyric acid (GABA) A receptor, epsilon',
'symbol': 'GABRE',
'taxid': 9606},
{'_id': '4109',
'_score': 6.9620624,
'entrezgene': 4109,
'name': 'melanoma antigen family A, 10',
'symbol': 'MAGEA10',
'taxid': 9606},
{'_id': '407009',
'_score': 6.9609237,
'entrezgene': 407009,
'name': 'microRNA 224',
'symbol': 'MIR224',
'taxid': 9606},
{'_id': 'ENSG00000229967',
'_score': 6.9609237,
'symbol': 'RP11-366F6.2',
'taxid': 9606},
{'_id': 'ENSG00000266560',
'_score': 6.9609237,
'symbol': 'RP11-1007I13.4',
'taxid': 9606},
{'_id': '2556',
'_score': 6.8982496,
'entrezgene': 2556,
'name': 'gamma-aminobutyric acid (GABA) A receptor, alpha 3',
'symbol': 'GABRA3',
'taxid': 9606},
{'_id': '4103',
'_score': 6.8982496,
'entrezgene': 4103,
'name': 'melanoma antigen family A, 4',
'symbol': 'MAGEA4',
'taxid': 9606}],
'max_score': 6.9943757,
'took': 1095,
'total': 12}
As a final note, this change affects human genes only, of course.