One of the primary reasons scientists come to BioGPS is to view our reference gene expression data sets in a simple bar chart form. And for the bioinformaticians, we’ve always provided the data sets for download on our BioGPS downloads page. Now, we have one more mechanism for users to access the raw data.
Using the new “Downloads” tab in the gene expression chart plugin, users can now download data on a gene-by-gene basis. This feature is useful for people who just want to plot data from one gene for a publication, and for cases when dominant expression in one tissue obscures variation in other tissues in the bar chart.
We hope you find this feature useful. As always, feedback welcome…
Where can I find more detail on the tumor samples? I can only guess at what tumor types from the graph.
Hi John,
Sample annotations can be found on our downloads page at http://biogps.gnf.org/downloads/.
Cheers,
-andrew
What is the best way to get the mappings between the Affy IDs and Entrez Gene IDs ?
Further, I found a lot of probes in the platform GNF1H that do not have any mappings to public databases, most of them being "predicted transcript (Celera)". How to put these guys in context without gene or RNA information ?
Thanks in advance.
Hi Anonymous,
The best source of mappings is Affymetrix itself. You can download annotation files from their website (e.g., Current NetAffx Annotation Files for U133A). Note that it does require a free NetAffx account…
As for how to handle the Celera transcripts on our custom-designed GNF1H and GNF1M chips, I'm afraid I don't have a good answer for you. We can't legally disclose the probe sequences (unless you are/were also a Celera customer). The good news is that we've mapped all the probe sets to public sequences whenever possible, and the vast majority of the rest aligns to genomic fluff.
Cheers,
-andrew
Hi Andrew,
Thanks for the response.
I got the file from http://www.affymetrix.com/Auth/analysis/downloads/na30/ivt/HG-U133A_2.na30.annot.csv.zip
However, only 132 probe names match the ones found in the data file available in BioGPS.
I may be missing some important point here and would appreciate some further clarifications…
Kind regards.
It is me again Andrew.
After some more digging I might have found the reason of the problem. The identifiers in the data file available at http://plugins.gnf.org/download/gnf1h-gcrma.zip highly overlap with the file provided by affy. However, the problem comes when I try to compare the identifiers with those in the GEO platform file GPL1074. Since it is from 2004, I guess it is not updated.
Please let me know if I can use the data file from BioGPS with the affy annotations, without the GNF GEO file.
Best regards,
Hi Anonymous,
Hmmm, I'm not exactly sure your question, but let me try to give you more info and you can tell me if anything is still not clear.
The data file for the "Human U133A/GNF1H Gene Atlas" contains ~33k probe sets. Of these, ~22k are directly from Affy's U133A chip, so you can get the most recent annotations from the file you downloaded from netaffx.
The remaining ~11k are custom-designed probe sets. You can download annotations for these probe sets from our downloads page. For most people, the links and files above are sufficient for most uses of these data.
There are also many probes on the GNF1H chip that aren't shown in the data file above because they no longer map to any current gene entry. Nevertheless, you can get information on all the probes from the GEO file for GPL1074. Note that the probe sequences that map to Celera data only are removed from that file.
Hope that helps, and let us know if you need further clarification.
Cheers,
-andrew
Hi Andrew,
Thanks for the kind and detailed answers. Now I think I found myself in these files.
Sorry if my questions were very basic. Although I have years of experience in Bioinformatics, I am completely new to BioGPS and its data/annotations.
Best regards.
Hi Andrew,
How can I download expression profiles for a list of genes in a specific tissue?
Thanks,
Yuanxin
Hi Yuanxin,
To download data for several genes in a single tissue, you'll want to just download the entire data matrix from the downloads page. However, you should also note that microarrays are best used to compare single probe sets across multiple conditions. Due to different characteristics, comparisons between multiple probe sets on a single condition should be done with caution.
Hope that helps…
Cheers,
-andrew
Is there any consensus on how to choose the best probe for a given gene?
For example, assume you have 10 genes and want to generate a figure, say a heatmap, with 10 rows. What would be the best way to pick the best one?
I gave some though on this, but am still unsure. I could use the one with lowest SD, highest intensity etc.
Any tips ? Thanks !
Hi Anonymous,
Unfortunately there is no consensus on how to choose the "best" probe set for a given gene. Personally, I tend to filter out any probe set that has a low maximum expression (say, less than 150) because that means the probe set is not responding under any conditions. If there are replicates in the data set you're looking at, then large error bars for many samples tends to be a bad sign (even if n=2). Beyond that, you can look at where the individual probes map to see if different patterns can be attributed to the sequences being queried (splice-variants, for example). Highest intensity is probably a reasonable thing to look at, provided it's not uniformly high signal (which might indicate high background).
Hope that helps,
-andrew
Hi,
Since I need to study a set of genes therefore I have downloaded the raw data according to what you have recommended in previous comment.
So I downloaded gnf1h-gcrma.zip and its annotation file gnf1h-anntable.zip. But I have problem to cross reference the data between the two files using probesetID. Eg. I queried human MYCBP2 from the web, the expression activity chart indicated that MYCBP2 is probed by two probesets viz. 201959_s_at & 20960_s_at. Readings of the two probesets can be found in U133AGNF1B.gcrma.avg.csv (unzip of gnf1h-gcrma.zip), but I can't found the 2 probesetIDs in gnf1h-anntable.zip. If not, then how can I figure out the gene probed by a probeset from the download files?
Thks, Eric.
Hi Andrew,
Please ignore my previous question about cross referencing the info. between gnf1h-gcrma.zip and its annotation file gnf1h-anntable.zip. I think your reply to someone previously can help. Sorry for the trouble. Eric.
Hi Eric,
Glad you found your answer. Just so it's also posted here directly, the "gnf1h-gcrma.zip" data is actually a combination of our custom-designed GNF1H chip and the publicly-available U133A chip from Affymetrix. The best annotation file for the U133A chip can always be found directly on their website. (And it's no trouble at all, Eric…)
Cheers,
-andrew
Hi Andrew,
What do you think would be a good cut off to define a probe as expressed or not ? 150 ?
Thank you.
Hello,
I found two different files with the same name and have trouble determining which one is the correct one:
Two data sources from GNF containing the expression data:
1. There is a link to the file: http://plugins.gnf.org/download/gnf1h-gcrma.zip
I downloaded it a couuple of weeks ago.
It contained a file: U133AGNF1B_public.gcrma.newid.avg.txt
2. There is another link:
http://plugins.gnf.org/download/gnf1h-gcrma.zip
that contains the file: U133AGNF1B.gcrma.avg.csv
The same file names, but the data are different!
(e.g. probe 201451_x_at in adipocyte is
10.7 in Source #1; 19.35 in Source #2;
USCS Table browser, which I thought contained the copies of your data, has completely different values)
Today (22-Feb-10) I again downloaded the file from the Source #1. Now it contains the same data as Source #2 !
So, which data source is the correct one? Which one can I trust?
Hi Anonymous (regarding thresholds),
I think there is no good global threshold. Every probe set has its own background characteristics. Generally if I have a specific gene of interest, I trust my eyes to determine the right background level. Large error bars are a huge red flag as far as noisy probe sets (even though they are generally based on n=2).
But for global analyses, visual inspection isn't feasible. For those cases, 150 seems to be a reasonable threshold.
Cheers,
-andrew
Do you mean 150 threshold in the file provided in the website ?
By doing this, only 11833 probe sets would be considered "on" in at least one tissue.
Is this correct ? I think it is too stringent…
Nobody answers the blog anymore…
Anonymous,
Apologies, I'm a little bit behind on answering questions. I handle almost all of the Q&A; on the blog, and I'm in the middle of a busy travel period. Note also that questions posted to the BioGPS Google Group may receive a quicker response because more people monitor that forum.
Cheers,
-andrew
Dear dmanagadze,
Apologies for the confusion. We need to do a better job of tracking versions of the files we provide for download. We recently did update the file you mention below (and confusingly, we updated the filename within the zip file, but not the zip file itself). Our goal is always to present the "best" file for download on the website, and we also aim to exactly match the data that is shown in the bar charts. If I'm remembering correctly, that was the issue that was recently corrected. There was a mismatch between the online data and the downloadable data, and so we put everything through the most recent analysis pipeline and updated both. In addition, a few weeks before that, we also added a few new brain regions to the Gene Atlas data set.
Sorry for the confusion, hope that helps…
Cheers,
-andrew
Anonymous,
The fact that you find a threshold of 150 to be too stringent underscores the difficult of defining a good global threshold. Unfortunately, there is no "right answer". Please also see this discussion on the Google Group.
Cheers,
-andrew
Are the values in the averaged file in log2 scale ?
Thanks !
Most if not all of our data are presented on a linear scale. We just feel that's a better representation of the gene expression profile. (If there's a data set in particular that you're inquiring about, post again and I'll check for that specific file.)
Cheers,
-andrew
Sorry for the lack of details. I meant this file:
http://plugins.gnf.org/download/gnf1h-gcrma.zip
Thanks very much.
Yes, that raw data file for our latest human GeneAtlas data set is linear scale.
Cheers,
-andrew
Hi Andrew,
I have a question regarding the probeset to gene mapping.
I found examples where the mapping in the latest annotation file and the BioGPS site are not the same. Is there a newer annotation file?
Example:
BioGPS site: http://biogps.org/#goto=genereport&id=161424
gene symbol=NOP9, probesetId=gnf1h08751_at
In the latest mapping file for Human GNF1H this probeset is mapped to gene symbol=CIDEB
Thank you very much.
From a previous comment: “The data file for the “Human U133A/GNF1H Gene Atlas” contains ~33k probe sets. Of these, ~22k are directly from Affy’s U133A chip, so you can get the most recent annotations from the file you downloaded from netaffx.
The remaining ~11k are custom-designed probe sets. You can download annotations for these probe sets from our downloads page. For most people, the links and files above are sufficient for most uses of these data.”
For the most up-to-date annotation files on the non-custom probe sets, Affymetrix is your best bet. You’ll probably have to register with them in order to access their files.