Raw data download

by Andrew Su | Dec 4, 2009 | BioGPS, feature | 29 comments

One of the primary reasons scientists come to BioGPS is to view our reference gene expression data sets in a simple bar chart form. And for the bioinformaticians, we’ve always provided the data sets for download on our BioGPS downloads page. Now, we have one more mechanism for users to access the raw data.

Using the new “Downloads” tab in the gene expression chart plugin, users can now download data on a gene-by-gene basis. This feature is useful for people who just want to plot data from one gene for a publication, and for cases when dominant expression in one tissue obscures variation in other tissues in the bar chart.

We hope you find this feature useful. As always, feedback welcome…

29 Comments

John on January 22, 2010 at 4:31 am

Where can I find more detail on the tumor samples? I can only guess at what tumor types from the graph.
Reply
Andrew on January 22, 2010 at 8:45 am

Hi John,

Sample annotations can be found on our downloads page at http://biogps.gnf.org/downloads/.

Cheers,
-andrew
Reply
Anonymous on January 29, 2010 at 4:07 pm

What is the best way to get the mappings between the Affy IDs and Entrez Gene IDs ?

Further, I found a lot of probes in the platform GNF1H that do not have any mappings to public databases, most of them being "predicted transcript (Celera)". How to put these guys in context without gene or RNA information ?

Thanks in advance.
Reply
Andrew on January 29, 2010 at 4:19 pm

Hi Anonymous,

The best source of mappings is Affymetrix itself. You can download annotation files from their website (e.g., Current NetAffx Annotation Files for U133A). Note that it does require a free NetAffx account…

As for how to handle the Celera transcripts on our custom-designed GNF1H and GNF1M chips, I'm afraid I don't have a good answer for you. We can't legally disclose the probe sequences (unless you are/were also a Celera customer). The good news is that we've mapped all the probe sets to public sequences whenever possible, and the vast majority of the rest aligns to genomic fluff.

Cheers,
-andrew
Reply
Anonymous on February 1, 2010 at 6:48 am

Hi Andrew,

Thanks for the response.
I got the file from http://www.affymetrix.com/Auth/analysis/downloads/na30/ivt/HG-U133A_2.na30.annot.csv.zip

However, only 132 probe names match the ones found in the data file available in BioGPS.

I may be missing some important point here and would appreciate some further clarifications…

Kind regards.
Reply
Anonymous on February 1, 2010 at 7:15 am

It is me again Andrew.

After some more digging I might have found the reason of the problem. The identifiers in the data file available at http://plugins.gnf.org/download/gnf1h-gcrma.zip highly overlap with the file provided by affy. However, the problem comes when I try to compare the identifiers with those in the GEO platform file GPL1074. Since it is from 2004, I guess it is not updated.

Please let me know if I can use the data file from BioGPS with the affy annotations, without the GNF GEO file.

Best regards,
Reply
Andrew on February 1, 2010 at 8:39 am

Hi Anonymous,

Hmmm, I'm not exactly sure your question, but let me try to give you more info and you can tell me if anything is still not clear.

The data file for the "Human U133A/GNF1H Gene Atlas" contains ~33k probe sets. Of these, ~22k are directly from Affy's U133A chip, so you can get the most recent annotations from the file you downloaded from netaffx.

The remaining ~11k are custom-designed probe sets. You can download annotations for these probe sets from our downloads page. For most people, the links and files above are sufficient for most uses of these data.

There are also many probes on the GNF1H chip that aren't shown in the data file above because they no longer map to any current gene entry. Nevertheless, you can get information on all the probes from the GEO file for GPL1074. Note that the probe sequences that map to Celera data only are removed from that file.

Hope that helps, and let us know if you need further clarification.

Cheers,
-andrew
Reply
Anonymous on February 1, 2010 at 8:56 am

Hi Andrew,

Thanks for the kind and detailed answers. Now I think I found myself in these files.

Sorry if my questions were very basic. Although I have years of experience in Bioinformatics, I am completely new to BioGPS and its data/annotations.

Best regards.
Reply
yxi on February 3, 2010 at 1:29 pm

Hi Andrew,

How can I download expression profiles for a list of genes in a specific tissue?

Thanks,

Yuanxin
Reply
Andrew on February 3, 2010 at 3:48 pm

Hi Yuanxin,

To download data for several genes in a single tissue, you'll want to just download the entire data matrix from the downloads page. However, you should also note that microarrays are best used to compare single probe sets across multiple conditions. Due to different characteristics, comparisons between multiple probe sets on a single condition should be done with caution.

Hope that helps…

Cheers,
-andrew
Reply
Anonymous on February 10, 2010 at 11:17 am

Is there any consensus on how to choose the best probe for a given gene?

For example, assume you have 10 genes and want to generate a figure, say a heatmap, with 10 rows. What would be the best way to pick the best one?

I gave some though on this, but am still unsure. I could use the one with lowest SD, highest intensity etc.

Any tips ? Thanks !
Reply
Andrew on February 10, 2010 at 11:44 am

Hi Anonymous,

Unfortunately there is no consensus on how to choose the "best" probe set for a given gene. Personally, I tend to filter out any probe set that has a low maximum expression (say, less than 150) because that means the probe set is not responding under any conditions. If there are replicates in the data set you're looking at, then large error bars for many samples tends to be a bad sign (even if n=2). Beyond that, you can look at where the individual probes map to see if different patterns can be attributed to the sequences being queried (splice-variants, for example). Highest intensity is probably a reasonable thing to look at, provided it's not uniformly high signal (which might indicate high background).

Hope that helps,
-andrew
Reply
Eric Ho on February 16, 2010 at 7:20 pm

Hi,

Since I need to study a set of genes therefore I have downloaded the raw data according to what you have recommended in previous comment.

So I downloaded gnf1h-gcrma.zip and its annotation file gnf1h-anntable.zip. But I have problem to cross reference the data between the two files using probesetID. Eg. I queried human MYCBP2 from the web, the expression activity chart indicated that MYCBP2 is probed by two probesets viz. 201959_s_at & 20960_s_at. Readings of the two probesets can be found in U133AGNF1B.gcrma.avg.csv (unzip of gnf1h-gcrma.zip), but I can't found the 2 probesetIDs in gnf1h-anntable.zip. If not, then how can I figure out the gene probed by a probeset from the download files?

Thks, Eric.
Reply
Eric Ho on February 17, 2010 at 5:02 am

Hi Andrew,

Please ignore my previous question about cross referencing the info. between gnf1h-gcrma.zip and its annotation file gnf1h-anntable.zip. I think your reply to someone previously can help. Sorry for the trouble. Eric.
Reply
Andrew on February 17, 2010 at 8:35 am

Hi Eric,

Glad you found your answer. Just so it's also posted here directly, the "gnf1h-gcrma.zip" data is actually a combination of our custom-designed GNF1H chip and the publicly-available U133A chip from Affymetrix. The best annotation file for the U133A chip can always be found directly on their website. (And it's no trouble at all, Eric…)

Cheers,
-andrew
Reply
Anonymous on February 22, 2010 at 7:03 am

Hi Andrew,
What do you think would be a good cut off to define a probe as expressed or not ? 150 ?

Thank you.
Reply
dmanagadze on February 22, 2010 at 2:04 pm

Hello,

I found two different files with the same name and have trouble determining which one is the correct one:
Two data sources from GNF containing the expression data:

1. There is a link to the file: http://plugins.gnf.org/download/gnf1h-gcrma.zip
I downloaded it a couuple of weeks ago.
It contained a file: U133AGNF1B_public.gcrma.newid.avg.txt

2. There is another link:
http://plugins.gnf.org/download/gnf1h-gcrma.zip
that contains the file: U133AGNF1B.gcrma.avg.csv

The same file names, but the data are different!

(e.g. probe 201451_x_at in adipocyte is
10.7 in Source #1; 19.35 in Source #2;
USCS Table browser, which I thought contained the copies of your data, has completely different values)

Today (22-Feb-10) I again downloaded the file from the Source #1. Now it contains the same data as Source #2 !

So, which data source is the correct one? Which one can I trust?
Reply
Andrew on March 3, 2010 at 11:07 am

Hi Anonymous (regarding thresholds),

I think there is no good global threshold. Every probe set has its own background characteristics. Generally if I have a specific gene of interest, I trust my eyes to determine the right background level. Large error bars are a huge red flag as far as noisy probe sets (even though they are generally based on n=2).

But for global analyses, visual inspection isn't feasible. For those cases, 150 seems to be a reasonable threshold.

Cheers,
-andrew
Reply
Anonymous on March 5, 2010 at 11:50 am

Do you mean 150 threshold in the file provided in the website ?

By doing this, only 11833 probe sets would be considered "on" in at least one tissue.

Is this correct ? I think it is too stringent…
Reply
Anonymous on March 10, 2010 at 6:22 am

Nobody answers the blog anymore…
Reply
Andrew on March 10, 2010 at 10:27 am

Anonymous,

Apologies, I'm a little bit behind on answering questions. I handle almost all of the Q&A; on the blog, and I'm in the middle of a busy travel period. Note also that questions posted to the BioGPS Google Group may receive a quicker response because more people monitor that forum.

Cheers,
-andrew
Reply
Andrew on March 10, 2010 at 10:51 am

Dear dmanagadze,

Apologies for the confusion. We need to do a better job of tracking versions of the files we provide for download. We recently did update the file you mention below (and confusingly, we updated the filename within the zip file, but not the zip file itself). Our goal is always to present the "best" file for download on the website, and we also aim to exactly match the data that is shown in the bar charts. If I'm remembering correctly, that was the issue that was recently corrected. There was a mismatch between the online data and the downloadable data, and so we put everything through the most recent analysis pipeline and updated both. In addition, a few weeks before that, we also added a few new brain regions to the Gene Atlas data set.

Sorry for the confusion, hope that helps…

Cheers,
-andrew
Reply
Andrew on March 10, 2010 at 11:38 am

Anonymous,

The fact that you find a threshold of 150 to be too stringent underscores the difficult of defining a good global threshold. Unfortunately, there is no "right answer". Please also see this discussion on the Google Group.

Cheers,
-andrew
Reply
Anonymous on April 7, 2010 at 7:25 am

Are the values in the averaged file in log2 scale ?

Thanks !
Reply
Andrew on April 7, 2010 at 9:06 am

Most if not all of our data are presented on a linear scale. We just feel that's a better representation of the gene expression profile. (If there's a data set in particular that you're inquiring about, post again and I'll check for that specific file.)

Cheers,
-andrew
Reply
Anonymous on April 7, 2010 at 9:47 am

Sorry for the lack of details. I meant this file:

http://plugins.gnf.org/download/gnf1h-gcrma.zip

Thanks very much.
Reply
Andrew on April 7, 2010 at 9:54 am

Yes, that raw data file for our latest human GeneAtlas data set is linear scale.

Cheers,
-andrew
Reply
Anonymous on July 29, 2014 at 7:41 am

Hi Andrew,
I have a question regarding the probeset to gene mapping.
I found examples where the mapping in the latest annotation file and the BioGPS site are not the same. Is there a newer annotation file?
Example:
BioGPS site: http://biogps.org/#goto=genereport&id=161424
gene symbol=NOP9, probesetId=gnf1h08751_at
In the latest mapping file for Human GNF1H this probeset is mapped to gene symbol=CIDEB
Thank you very much.
Reply
ginger on July 31, 2014 at 1:28 pm

From a previous comment: “The data file for the “Human U133A/GNF1H Gene Atlas” contains ~33k probe sets. Of these, ~22k are directly from Affy’s U133A chip, so you can get the most recent annotations from the file you downloaded from netaffx.

The remaining ~11k are custom-designed probe sets. You can download annotations for these probe sets from our downloads page. For most people, the links and files above are sufficient for most uses of these data.”

For the most up-to-date annotation files on the non-custom probe sets, Affymetrix is your best bet. You’ll probably have to register with them in order to access their files.
Reply

Raw data download

29 Comments

Submit a Comment Cancel reply

Subscribe

Archives

Categories