MyVariant.info is a high-performance API for querying aggregated and up-to-date annotations for human genetic variants. Developers typically query MyVariant.info API in an analytic workflow or a live web application, so that they don’t have to waste time building and maintaining their own local annotation database. We recently had some requests from our users for a complete list of all variant IDs (i.e. HGVS IDs we used) available from MyVariant.info. This is particularly useful for those web applications which want to add hyperlinks to MyVariant.info when the annotations are available.
We heard your request. The MyVariant.info build process now outputs a list of all HGVS IDs, one per line. That’s a list of all 425M IDs in our most recent release.
This tutorial will:
- show you how to download and decompress the IDs lists, and
- show a way to stream and decompress the IDs lists using Python.
Downloading and decompressing HGVS IDs lists
You can access an xz-compressed file for all variant HGVS IDs annotated using either the hg19 or hg38 reference genome. The hg19 file is available at:
http://myvariant.info/all_ids/myvariant_hg19_ids.xz
The hg38 file is available at:
http://myvariant.info/all_ids/myvariant_hg38_ids.xz
You can download this file and decompress it, like:
$ wget http://myvariant.info/all_ids/myvariant_hg19_ids.xz
$ xz -d myvariant_hg19_ids.xz
The resulting myvariant_hg19_ids file has all HGVS IDs in MyVariant.info using the hg19 reference genome. You can see the first ten below:
$ head myvariant_hg19_ids
chr1:g.10583G>A
chr1:g.10611C>G
chr1:g.11869G>A
chr1:g.11869G>C
chr1:g.11869G>T
chr1:g.11870T>A
chr1:g.11870T>C
chr1:g.11870T>G
chr1:g.11871T>A
chr1:g.11871T>C
Streaming and decompressing IDs on-the-fly
Alternatively, the streaming solution can avoid the steps of downloading and decompressing local files. An entirely streaming solution can be implemented using Python’s requests module. If stream=True
is passed to the requests.get
function (as in the yield_ids
function below), you can stream and decompress this file in chunks, only yielding ids.
In [1]: HG19_IDS_URL = 'http://myvariant.info/all_ids/myvariant_hg19_ids.xz'
In [2]: HG38_IDS_URL = 'http://myvariant.info/all_ids/myvariant_hg38_ids.xz'
In [3]: import requests, lzma
In [4]: def yield_ids(url):
...: r = requests.get(url, stream=True)
...: with lzma.open(filename=r.raw, mode='rb') as xz_file:
...: for line in xz_file:
...: yield line.decode('utf-8').strip('\n')
...:
Here, r
is a streaming GET from our download URL. The raw-data chunks are passed into the open
function of Python’s lzma module (Note: lzma is included by default in CPython versions 3.3 and higher, for support with older Python versions, check out a backport). Finally, for each line in this decompressed chunk, we yield a newline-stripped, utf-8 decoded string containing one HGVS ID.
Let’s see how long it takes to get the first ID using this function:
In [5]: x = yield_ids(HG19_IDS_URL)
In [6]: %time next(x)
CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 440 ms
Out[6]: 'chr1:g.10583G>A'
The first id is available within a half of a second, streaming and decompressing the IDs file on-the-fly. Next, let’s see how long it takes to iterate through all hg19 IDs, storing them all in a list:
In [7]: x = yield_ids(HG19_IDS_URL)
In [8]: %time hg19_ids = list(x)
CPU times: user 17min 51s, sys: 1min 21s, total: 19min 13s
Wall time: 19min 13s
In [9]: len(hg19_ids)
Out[9]: 424515266
This took ~19 minutes to iterate through (and store) all ~425 million HGVS IDs (be careful doing this on machines with < ~10 GB of RAM!). For the purposes of comparison, it took ~8-9 minutes to download, decompress, then iterate through all ids and store them in a list.
And, again with all hg38 IDs:
In [11]: x = yield_ids(HG38_IDS_URL)
In [12]: %time hg38_ids = list(x)
CPU times: user 17min 25s, sys: 18.4 s, total: 17min 43s
Wall time: 17min 44s
In [13]: len(hg38_ids)
Out[13]: 412986445