Tag Archives: entrez gene

Tip of the Week: GRAIL for prioritizing SNPs

grail_snps_tipPerusing my copy of Nature Genetics last week, I was flipping through the pages and noticed an unusual graphic.  I looked at it a little closer and was convinced it was one of the Spirographs that I used to make as a kid.  (Remember those? I always liked that….)  I looked a little bit closer and realized it was somewhat more informative than the Spirographs I used to draw.  This represented the relationships between genes, based on the literature.  Hmmm….how did they do this, exactly?

The paper I was reading was Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk by Raychaudhuri et al, which was interesting enough.  I like to read the GWAS papers to see what the current techniques and strategies are, not only for the specific genes themselves.  And this paper reported the strategy that they used to prioritize their SNPs, and that they used GRAIL to generate the data for this graphic of gene relationships.  Check out Figure 1 for the strategy.

When I saw the name GRAIL I thought–huh….GRAIL is back with a new use?  I thought that was…ah…retired…at this point.  But this isn’t that GRAIL (http://compbio.ornl.gov/Grail-1.3/, Gene Recognition and Assembly Internet Link).  This is a different GRAIL–the new one is Gene Relationships Among Implicated Loci. So I had to go and read that paper, which is  Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions by Raychaudhuri et al.

This new GRAIL is all about text mining.  It is a tool that relies on statistical text mining of the literature for genes in a region and examines the relationships among those genes in the text.  The focus in their case is disease regions, but there’s no reason that you couldn’t use it for a variety of other topics.   As the authors state:

Given only a collection of disease regions, GRAIL uses our text-based definition of relatedness (or alternative metrics of relatedness) to identify a subset of genes, more highly related than by chance; it also assigns a select set of keywords that suggest putative biological pathways.

So you pull a set of genes out of the literature based on SNPs or locations of interest, and you can begin to assess what’s interesting in the set.   Now, the tool makes a lot of assumptions that you should be aware of if you are going to use it.  It assumes each region contains a single pathogenic gene.  I’m not sure that’s always going to be the case, but for this tool as long as you know that, that’s a fair assumption.  They suggest this helps to keep from multigenic regions from dominating the analysis.  Fair enough, but…what if that is the interesting aspect?  Still–that’s ok as long as you know.

In the paper they use validated SNPs from 4 different research areas:

  • SNPs associated with serum lipid levels: GRAIL finds genes in the cholesterol biosynthesis pathway.
  • SNPs associated with height; they identify pathways they consider plausible.
  • Crohn’s disease; they confirm associations that have been seen.
  • Schizophrenia–and here they used rare deletions as the items of interest; they find related genes, many highly enriched in the CNS. So this suggests using this not only for SNPs but for CNVs this may be a useful strategy.

Their Figure 1 nicely summarizes the strategy:


One curious tweak of the data analysis was that they used the literature prior to December 2006, because right after that there was an onslaught of GWAS papers that would list a whole bunch of genes associated with regions that might be more tenuous still.  I understand this in theory, but I imagine it also eliminates more current research on genes of interest from other methods too.  I saw in the tool you could choose either pre-Dec 06 or a more up-to-date literature set.  It would be useful to try both if you use GRAIL and keep that in mind.

Another point to keep in mind: some genes are just not found in the abstracts, and they mention that is an issue.   So the set you can examine are those that were in the abstracts, and were identified properly with nomenclature, spelling, etc.  Text mining is cool, but has a lot of limitations around those aspects, and the use of synonyms too in general. It’s not just an issue for GRAIL, but for all text mining tools at this point.

They also devise a way to use Gene Ontology (GO) and some expression data in GRAIL as other “relatedness” metrics.  You’ll find those available from the GRAIL tool as well. spirograph

They don’t show any spirographs in their figures in this first GRAIL paper.  That one that drew me in was Figure 2 in the arthritis paper.  So I went over to the software to try to generate these myself.  The outcome at this point is a web page with text and links to UCSC Genome Browser, and Entrez Gene (from the individual genes and from the keyword list–keywords collect multiple Entrez Genes).  I was a little surprised that the keyword link wasn’t to PubMed as well.  Currently it doesn’t provide the graphic, but maybe that will come along over time.  If it does I’ll be sure to mention it on the blog.

One final note on the paper: in the supplemental section they compare GRAIL to other tools in this arena.  If you are interested in tools like we are here you may find some of them interesting as well.   The tools are listed with URLs in Table S5, and the comparison outcome is in Text S1:

Prioritizer [2], Gene2Disease (G2D) [3,4,5], Commonality of Functional Annotation (CFA) [6], and Prospectr [7]. There were five supervised tools: Endeavour [8], GeneSeeker [9], SUSPECTS [10], TOM [11], and CANDID [12]

So check out GRAIL and see if you find gene relationships.  But don’t forget those caveats about the genes not listed in the abstracts, or the literature coverage dates.  The software can be found here:  http://www.broad.mit.edu/mpg/grail/

I know it’s a beta.  But I think it has a lot of potential to help people sift through the results they are getting from a variety of techniques.  Check it out.

NOTE: you may find periods that you can’t run GRAIL because it puts a burden on the servers.  You should try again during off hours if you are seeing problems with getting it to run. This happened to me during my testing of it last week.

The list of GWAS data I used to test GRAIL came from the NHGRI catalog, which we discussed here:  List of GWAS studies.  I tried the straight hair SNP list, and got a pretty interesting set of results that certainly included “epidermis” and “skin” as keywords, among other things.

++++++++++++ Citations ++++++++++++
Raychaudhuri, S., Plenge, R., Rossin, E., Ng, A., International Schizophrenia Consortium, Purcell, S., Sklar, P., Scolnick, E., Xavier, R., Altshuler, D., & Daly, M. (2009). Identifying Relationships among Genomic Disease Regions: Predicting Genes at Pathogenic SNP Associations and Rare Deletions PLoS Genetics, 5 (6) DOI: 10.1371/journal.pgen.1000534

Raychaudhuri, S., Thomson, B., Remmers, E., Eyre, S., Hinks, A., Guiducci, C., Catanese, J., Xie, G., Stahl, E., Chen, R., Alfredsson, L., Amos, C., Ardlie, K., Barton, A., Bowes, J., Burtt, N., Chang, M., Coblyn, J., Costenbader, K., Criswell, L., Crusius, J., Cui, J., De Jager, P., Ding, B., Emery, P., Flynn, E., Harrison, P., Hocking, L., Huizinga, T., Kastner, D., Ke, X., Kurreeman, F., Lee, A., Liu, X., Li, Y., Martin, P., Morgan, A., Padyukov, L., Reid, D., Seielstad, M., Seldin, M., Shadick, N., Steer, S., Tak, P., Thomson, W., van der Helm-van Mil, A., van der Horst-Bruinsma, I., Weinblatt, M., Wilson, A., Wolbink, G., Wordsworth, P., Altshuler, D., Karlson, E., Toes, R., de Vries, N., Begovich, A., Siminovitch, K., Worthington, J., Klareskog, L., Gregersen, P., Daly, M., & Plenge, R. (2009). Genetic variants at CD28, PRDM1 and CD2/CD58 are associated with rheumatoid arthritis risk Nature Genetics, 41 (12), 1313-1318 DOI: 10.1038/ng.479

Medland, S., Nyholt, D., Painter, J., McEvoy, B., McRae, A., Zhu, G., Gordon, S., Ferreira, M., Wright, M., & Henders, A. (2009). Common Variants in the Trichohyalin Gene Are Associated with Straight Hair in Europeans The American Journal of Human Genetics, 85 (5), 750-755 DOI: 10.1016/j.ajhg.2009.10.009

BioGene: iPhone app for NCBI searches from MSKCC team

iphoneI can’t remember how I got on this email list–but I like it :)  Today I was notified that there was a handy iPhone app to quickly get gene info out of the NCBI resources.  I wish I had this last week at the ASHG meeting.  You know what happens: you catch a gene name or see a symbol in a talk, it’s just one of several on a slide…but you must know what that is right now!!  This handy-dandy quick interface will let you search for the symbol and links you to Entrez Gene info, which also links to references in PubMed.

I like it.  I expect to use it.  The first reviewer over in the iTunes store says it has already expanded their conversation.  I wish it also covered OMIM, but I haven’t used it too hard yet, maybe I’ll get there.  That also would have been a help last week.  I was hearing about a disease and I wanted some information.

Check out the MSKCC team page here for more details, and download it from the iTunes store (for free) if you like the sound of it.

BioGene: http://cbio.mskcc.org/tools/iphone_ipodtouch.html

For other iPhone apps we’ve come across, check out our earlier post on the iPhone and research.

Database "openness"

We train on publicly available databases and resources. For our purposes on deciding when to develop training, the definition is relatively straightforward: Can the academic researcher access the data without cost or license restriction? If the answer is yes, our next step is to determine if we can develop training materials based on the resource without cost or license restriction and to ask the providers specifically for permission to do so. We ask permission for several reasons: let the developer know what we are doing, verify the restrictions or lack there of, build good relationships, etc.

That first decision, “is it publicly available?”, would seem a relatively clearcut criteria, but we have found that it isn’t always. There are several problems. Often, the ‘terms of use’ or copyright documentation is difficult to find on the web site or non-existent. Even when it available, the terms, language and restrictions can vary quite a bit across databases, countries and even within a resource at times. Determining what “publicly available” is and which resource fits that definition can be less than simple, to say the least.

There is an attempt to offer a definition of “open” using the Creative Commons license. Continue reading

A HuGE database

ResearchBlogging.org :) that was fun writing that title. A recent correspondence in Nature Genetics outlined some changes in the HuGE Navigator. This database has been available in some form since 2001. The basic purpose of the database is to…

navigate and mine the growing scientific literature on human gene-disease associations and related data in human genome epidemiology. As an interconnected system of applications that users can enter by using genes, diseases, or risk factors as the starting point, HuGE Navigator provides a potential bridge between epidemiologic and genetic research domains.

Continue reading