Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

What’s the Answer? (domain and lollipop mutation diagrams)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted question is one that was initially raised a while back, but had new answers recently added so it floated back up to the top. And one of the new answers is a very nifty web-based quick solution that our readers often find particularly handy–I’ll mention it specifically below.

Question: How To Create Mutation Diagram In R Or In Any Tools?

Please let me know any tools or R packages that can create a mutation diagram showing mutations in protein domains like this figure from the MSKCC cBio Cancer Genomic Portal? Thanks in advance

tp53_mutations

henryvuong

There was some chatter about doing some DIY stuff, and some possible R packages that can get there, as well as existing ways to see these at some resource providers. But the new solution that was just added by Jeffk for a web-based easy to use implementation at the cBioPortal is what I wanted to focus on. Their “Mutation Mapper” interface will do the trick, with just a little bit of organizing your data in the right columns. There isn’t a lot of documentation with it, it was just recently released according to the notes. It seems limited to human genes. For other species you can try the other options in the answers. It would be great to see this made more widely available for other species as well.

PS: Other simple web interfaces for domains that we’ve talked about before that remain popular include DomainDraw and MyDomains. You could accomplish diagrams with some of the features with those as well.

Reference:
Gao J., U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun, A. Jacobsen, R. Sinha, E. Larsson & E. Cerami & (2013). Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal, Science Signaling, 6 (269) pl1-pl1. DOI: http://dx.doi.org/10.1126/scisignal.2004088

Cambridge Healthtech Institute Announces the Acquisition of OpenHelix

Cambridge Healthtech Institute (CHI) announced the purchase of Washington–based OpenHelix, the provider of online and onsite training on some of the most popular and powerful open-access bioinformatics resources on the web.

“Knowing how to use the latest bioinformatics tools is critical to genomics research, which will only grow in importance,” said Phillips Kuhl, President of Cambridge Healthtech Institute “With an over ten year track record of developing and presenting training on open access bioinformatics databases and programs, OpenHelix is an instrumental service to researchers and a key addition to CHI’s family of conference and training products.”

OpenHelix will join the Cambridge Healthtech Institute as a division of Bio-IT World, a leading source of news and opinion on technology and strategic innovation in the life sciences, including drug discovery and development. “OpenHelix brings Bio-IT World an extensive and solid audience in the academic research community, as well as the opportunity to extend to our existing audience a valuable training product line,” said Lisa Scimemi, Publisher of Bio-IT World, “training that many of our readers need for themselves or their staff or students but may not be aware of.”

“We are proud of the success we have had in the past, with some of the top universities and medical schools subscribing to OpenHelix,” said Scott Lathe, CEO of OpenHelix “Working with Bio-IT World will bring us the infrastructure, resources, and market reach we need to further grow our tutorials, subscriptions, and product offerings.”

As part of the acquisition, Scott Lathe, CEO and co-founder of OpenHelix will become General Manager of the OpenHelix unit and Mary Mangan, President and co-founder of OpenHelix will become Director, Product and Content of the OpenHelix unit.

About Bio-IT World (www.Bio-ITWorld.com)
Bio-IT World provides outstanding coverage of cutting-edge trends and technologies that impact the management and analysis of life sciences data, including next-generation sequencing, drug discovery, predictive and systems biology, informatics tools, clinical trials, and personalized medicine. Through a variety of sources including, Bio-ITWorld.com, Weekly Update Newsletter and the Bio-IT World News Bulletins, Bio-IT World is a leading source of news and opinion on technology and strategic innovation in the life sciences, including drug discovery and development.

About Cambridge Healttech Institute (www.chicorporate.com)
Cambridge Healthtech Institute (CHI), founded in 1992, is the industry leader in providing superior-quality scientific information to eminent researchers and business experts from top pharmaceutical, biotech, and academic organizations. Delivering an assortment of resources such as events, reports, publications and eNewsletters, CHI’s portfolio of products include Cambridge Healthtech Institute Conferences, Barnett Educational Services, Insight Pharma Reports, Cambridge Marketing Consultants, Cambridge Meeting Planners, Knowledge Foundation and Cambridge Healthtech Media Group, which includes Bio-IT World and Clinical Informatics News.

About OpenHelix (www.openhelix.com)
OpenHelix, a Washington State company, was founded in 2003 to provide training on what was then a fledgling but quickly growing market of open access web based bioinformatics resources. OpenHelix has provided training and outreach services for many providers of resources, such as the UCSC Genome Browser, OMIM, and the Protein Data Bank (RSCB PDB). OpenHelix received a $1.2 million grant in 2007 to create a search engine for bioinformatics resources and to expand its tutorials suites. In 2009, it launched the subscription service to over 100 tutorial suites.

Video Tip of the Week: StratomeX for genomic stratification of diseases

The Calyedo team and the tools they develop have been on my short list of favorites for a long time. I’ve been talking about their clever visualizations for years now. My first post on their work was in 2010, with the tip I did on their Calyedo tool that combined gene expression and pathway visualization. They’ve continued to refine their visualizations, and enable new data types to be brought into the analysis, and earlier this year we featured Entourage, enRoute, LineUp, and also StratomeX. They have lots of options for wrangling “big data”. But recently they published a paper on StratomeX and a nice video overview, so I wanted to bring it to your attention again now that the paper is out.

The emphasis in this paper is cancer subtype analysis, using some data from The Cancer Genome Atlas (TCGA). But it’s certainly not limited to cancer analysis–any research area that’s currently flooded with multiple types of data and outcomes could be run through this stratification and visualization software. I find the weighting of the lines and connections among the subsets to be really effective for me when thinking about relationships among the data types. That schizophrenia work that recently did that sort of stratification and clustering thing to suss out the relationships among different sub-types, was the kind of thing that’s going to be really useful (but I don’t know what software they used, because paywall…). And I expect that strategy to become increasingly important for a lot of conditions.

So have a look at this new paper (below), and their well-crafted video with examples.

If you are going to start working with StratomeX, be sure to also see their documentation pages. There are some features and options there that aren’t covered in the intro video and that you’ll want to know about.

The team is a cross-institutional and international bunch: this is a joint project between a lab at Harvard, led by Hanspeter Pfister, Peter Park’s lab at the Center for Biomedical Informatics at Harvard Medical School, and collaborators at Johannes Kepler University in Linz and the Graz University of Technology (both in Austria). And look for upcoming tools from them as well–there’s new stuff over at their site. They keep developing useful items, and I expect to be highlighting those in future Tips of the Week.

Quick links:

StratomeX project page: http://caleydo.org/projects/stratomex/

Caleydo tools homepage: http://www.caleydo.org/

Reference:

Marc Streit, Alexander Lex, Samuel Gratzl, Christian Partl, Dieter Schmalstieg, Hanspeter Pfister, Peter J Park & Nils Gehlenborg (2014). Guided visual exploration of genomic stratifications in cancer, Nature Methods, 11 (9) 884-885. DOI: http://dx.doi.org/10.1038/nmeth.3088

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

 

What’s the Answer? (zero- or one-based coordinate systems)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted answer isn’t actually found at Biostars itself–but relied on the institutional knowledge at Biostars to assemble this helpful guide. This is a question that comes up so frequently, and burns both novices and seasoned practitioners on a regular basis, that I wanted to make sure people saw this curated information on which tools start with zeros and which ones with ones (er, and those with both….).

I won’t bring the whole post over like I usually do with Biostars, but here’s the link and a snip–go read it all:

Chromosome coordinate systems: 0-based, 1-based

….I’ve tried to figure out which website-application are using each coordinate system. The results can be found bellow. For each source, I provide the URL of the reference website where I found the information, and a caption where the system is described….

Via:

Video Tip of the Week: GOLD, Genomes OnLine Database

Yes, I know some people suffer from YAGS-malaise (Yet Another Genome Syndrome), but I don’t. I continue to be psyched for every genome I hear about. I even liked the salmon lice one. And Yaks. The crowd-funded Puerto Rican parrot project was so very neat. These genomes may not matter much for your everyday life, and may not exactly be celebrities among species. But we’ll learn something new and interesting from every one of them. It’s also very cool that it’s bringing new researchers, trainees, and citizens into the field.

The good news is there is opportunity still for many, many more species. And decreasing costs will make it possible for more research teams to do locally-important species. But–it would be a shame if we wasted resources by doing 30 versions of something cute, rather than tackling new problems. A central registry for sequencing projects may help to manage this. Genomes OnLine Database has been cataloging projects for years, and it would be great if folks would register their research there.

I was reminded of this by a tweet I saw come through my #bioinformatics column. This is what I saw flying by:

As much as I enjoy Twitter and think that science nerds are pretty good at it, it’s hard to know if the right people will see a tweet. Anyway, I suggested that this researcher check out GOLD and BioProject to see if anyone had registered anything.

I realized that although we have talked about GOLD in the past, it hadn’t been highlighted in our Tips of the Week before. So here I will include a video from a talk about GOLD. Ioanna Pagani gives an overview of GOLD, the foundations and the purpose. And then she goes on to demonstrate how to enter project metadata into their registry (~12min). Watching this will help you to understand the usefulness of GOLD, and what you can expect to find there. She describes both single-species project entry, and another option for entering metagenome data projects (~25min).

In the News at GOLD, they mention that their update this summer resulted in some changes to the interface–so the specifics might be a bit different from the video. But the basic structural features are still going to be useful to understand the goals and strategies. It may also help to convey the importance of appropriate metadata for genome projects. If you are involved with these projects, checking out the team’s paper on the structure and use of metadata is certainly worthwhile.

In times of all this sequencing capacity, people are going to start looking for new organisms to cover. Of course, some people will want to look at another strain, isolate, geographical sample for good reasons–but keeping a lot of unnecessary duplication from happening would be nice too. And it would be great if submitters also conformed to the standards for genome metadata–the ‘Minimum Information about a Genome Sequence’ (MIGS, now in the broader collection of standard checklists in the MIxS project) standards being developed by the Genomic Standards Consortium. (You can see how GOLD conformed to this in their other paper below.) Let’s spread the resources around to get new knowledge when we can. I would like to see a more formal mechanism that connects people who have some genome of interest with researchers who might have the bandwidth to do it, as well. Social sequencing?

Quick links:

GOLD: http://www.genomesonline.org

Genomics Standards Consortium: http://gensc.org/

References:
Pagani I., J. Jansson, I.-M. A. Chen, T. Smirnova, B. Nosrat, V. M. Markowitz & N. C. Kyrpides (2011). The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata, Nucleic Acids Research, 40 (D1) D571-D579. DOI: http://dx.doi.org/10.1093/nar/gkr1100

Liolios K., Lynette Hirschman, Ioanna Pagani, Bahador Nosrat, Peter Sterk, Owen White, Philippe Rocca-Serra, Susanna-Assunta Sansone, Chris Taylor & Nikos C. Kyrpides & (2012). The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness, Standards in Genomic Sciences, 6 (3) 444-453. DOI: http://dx.doi.org/10.4056/sigs.2675953

Field D., Tanya Gray, Norman Morrison, Jeremy Selengut, Peter Sterk, Tatiana Tatusova, Nicholas Thomson, Michael J Allen, Samuel V Angiuoli & Michael Ashburner & (2008). The minimum information about a genome sequence (MIGS) specification, Nature Biotechnology, 26 (5) 541-547. DOI: http://dx.doi.org/10.1038/nbt1360

Friday SNPpets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

What’s The Answer? (gene essentiality)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This was a new and interesting question, one I haven’t seen before. Are there resources specifically highlighting essential genes in fly? I can see how a dedicated set of these would be useful, and how it could be challenging to extract that from other, more broad, tools and collections.

Question: database of gene essentiality in Drosophila?

I am looking for annotation of gene essentiality in Drosophila. The ideal resource would be a knockout or a RNAi screening which would tell me, for every gene, whether its deletion or silencing is lethal or not.

I saw that there are many resources online, from flybase to UCSC, but I could not find any annotation on gene essentiality there. There are also a lot of screenings published, but they all seem to be related to specific conditions (e.g. exposure to a DNA damaging factor, stress, etc..), but I could not find any screening in which no special conditions were applied. In general, I not familiar with Drosophila, and I am not sure what an expert in the field would use. Which resource do you recommend me?

–Giovanni M Dall’Olio

Giovanni found a couple of answers and brought them over, but if you know of any other useful collections it would be handy to have that information. I know there are various species knock-out projects, and likely more to come. But I was not familiar with the OGEE (Online GEne Essentiality Database) set. It’s not limited to flies, btw. And as I was reading up on OGEE, I saw a reference in PubMed to another essential gene database that was new to me: DEG, Database of Essential Genes. Reading up on that now too.

References:

Chen W.H., M. J. Lercher & P. Bork (2011). OGEE: an online gene essentiality database, Nucleic Acids Research, 40 (D1) D901-D906. DOI: http://dx.doi.org/10.1093/nar/gkr986

Luo H., Lin Y., Gao F., Zhang C.T. & Zhang R. (2013). DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements., Nucleic acids research, PMID: http://www.ncbi.nlm.nih.gov/pubmed/24243843

Video Tip of the Week: #Docker, shipping containers for software and data

Breaking into the zeitgeist recently, Docker popped into my sphere from several disparate sources. Seems to me that this is a potential problem-solver for some of the reproducibility and sharing dramas that we have been wrestling with in genomics. Sharing of data sets and versions of analysis software is being tackled in a number of ways. FigShare, Github, and some publishers have been making strides among the genoscenti. We’ve seen virtual machines offered as a way to get access to some data and tool collections*. But Docker offers a lighter-weight way to package and deliver these types of things in a quicker and straightforward manner.

One of the discussions I saw about Docker came from Melissa Gymrek, with this post about the potential to use it for managing these things: Using docker for reproducible computational publications. Other chatter led me to this piece as well: Continuous, reproducible genome assembler benchmarking. And at the same time as all this was bubbling up, a discussion on Reddit covered other details: Question: Does using docker hit performance?

Of course, balancing the hype and reality is important, and this discussion thrashed that about a bit (click the timestamp on the Nextflow tweet to see the chatter):

To get a better handle on the utility of Docker, I went looking for some videos, and these are now the video tip of the week. This is different from our usual topics, but because users might find themselves on the receiving end of these containers at some point, it seemed relevant for our readers.

The first one I’ll mention gave me a good overview of the concept. The CTO of Docker, Solomon Hykes, talks at Twitter University about the basis and benefits of their software (Introduction to Docker). He describes Docker of being like the innovation of shipping containers–which don’t really sound particularly remarkable to most of us, but in fact the case has been made that they changed the global economy completely. I read that book that Bill Gates recommended last year, The Box, and it was quite astonishing to see how metal boxes changed everything. This brought standardization and efficiencies that were previously unavailable. And those are two things we really need in genomics data and software.

Hykes explains that the problem of shipping stuff–coffee beans, or whatever, had to be solved, at each place the goods might end up. This is a good analogy–like explained in the shipping container book. How to handle an item, appropriate infrastructure, local expertise, etc, was a real barrier to sharing goods. And this happens with bioinformatics tools and data right now. But with containerization, everyone could agree on the size of the piece, the locks, the label position and contents, and everything standardized on that system. This brought efficiency, automation, and really changed the world economy. As Hykes concisely describes [~8min in]:

“So the goal really is to try and do the same thing for software, right? Because I think it’s embarrassing, personally, that on average, it’ll take more time and energy to get…a collection of software to move from one data center to the next, than it is to ship physical goods from one side of the planet to the other. I think we can do better than that….”

This high-level overview of the concept in less than 10min is really effective. He then takes a question about Docker vs a VM (virtual machine). I think this is the essential take-away: containerizing the necessary items  [~18min]:

“…Which means we can now define a new unit of software delivery, that’s more lightweight than a VM [virtual machine], but can ship more than just the application-specific piece…”

After this point there’s a live demo of Docker to cover some of the features. But if you really do want to get started with Docker, I’d recommend a second video from the Docker team. They have a Docker 101 explanation that covers things starting from installation, to poking around, destroying stuff in the container to show how that works, demoing some of the other nuts and bolts, and the ease of sharing a container.

So this is making waves among the genomics folks. This also drifted through my feed:

Check it out–there seem to be some really nice features of Docker that can impact this field. It doesn’t solve everything–and it shouldn’t be used as an escape mechanism to not put your data into standard formats. And Melissa addresses a number of unmet challenges too. But it does seem that it can be a contributor to reproducibility and access to data issues that are currently hurdles (or, plagues) in this field. Docker is also under active development and they appear to want to make it better. But sharing our stuff: it’s not trivial–there are real consequences to public health from inaccessible data and tools (1). But there are broader applications beyond bioinformatics, of course. And wide appeal and adoption seems to be a good thing for ongoing development and support. More chatter on the larger picture of Docker:

And this discussion was helpful: IDF 2014: Bare Metal, Docker Containers, and Virtualization.

And, er…

I laughed. And wrote this anyway.

Quick links:

Docker main site: http://www.docker.com/

Docker Github: http://github.com/docker/

Reference:
(1) Baggerly K. (2010). Disclose all data in publications, Nature, 467 (7314) 401-401. DOI: http://dx.doi.org/10.1038/467401b

*Ironically, this ENCODE VM is gone, illustrating the problem:

encodevm_gone