I did this tip about two years ago. That was in our old system and I wanted to update it to our Scivee system. In addition, we did this tip using Galaxy and Galaxy has had a lot of changes since then. In this weeks tip I am going to walk you through a quick task of getting the flanking sequence of a list of chromosomal locations. In Galaxy, this is relatively simple, as you will see from the tip. There is a lot more you can do with this once you’ve obtained the sequence, manipulating the text to obtain columns of data necessary, etc. You might want to check out our tutorial on Galaxy or the Galaxy screencasts to learn more.
BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday* we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
BioStar Question of the Week:
Dump upstream sequence. I am looking for transcription factor motifs. I have a list of refseq IDs of genes that I am interested in. How would i export a multi-fasta of all sequences from TSS to 1000bp before TSS?…
Answer from Ian: I definitely endorse the use of Galaxy due to its flexibility in handling genome coordinate based data. If you would like to retrieve the coordinates of a particular RefSeq transcript (NM_xxxxxx) from RefSeq data you can also extract it from the UCSC table browser.
- select ‘Table Browser’ from the left-hand side panel
- select mammal/human/hg18 from the top row of options
- group: ‘genes and gene prediction tracks’; track: ‘RefSeq genes’
- get output
You can load the resulting file into Galaxy and retrieve the lines of information you want by comparing your RefSeq IDs to the second column of the table browser data.
Just remember that txStart = TSS if the gene is on the + strand. txEnd = TSS if the gene is on the – strand.
I just want to point out we go through this in both the tutorial and exercises in Galaxy (sponsored/free tutorial). Galaxy is excellent for this.
Check out the other answers, or provide one if you have insights into the problem.
*This thread will replace the “WYP” threads.
In an earlier What’s Your Problem thread, a researcher had hundreds of SNP locations where they were trying to easily obtain the flanking sequence of those hundreds of SNPs without having to go to each location in the UCSC Genome Browser and eyeballing. There are probably a few ways to do this, but I found that Galaxy was a good place to start. So, the tip this week is taking two SNP locations on the human genome and obtaining the flanking sequence from those locations and returning a file that could be saved either as a spreadsheet, text or even made back into a UCSC Genome Browser custom track that can then be uploaded, viewed and searched at UCSC. The process for individual researchers will be a bit different depending on the data and how the excel/worksheet/file is configured, but hopefully you’ll get the idea. The steps are thus:
1. Upload your file (tab delineated text)
2. Convert file to the ‘interval’ format
3. Cut out any columns of data from original file to save for later use.
4. Get flanking chromosomal locations (then merge upstream and downstream records into one record)
5. Get flanking sequence
6. Paste data columns from step 3 to the data columns (chromosomal location and sequence) from step 5.
Voila, now you have a tab-delineated text file that can be opened in Excel, made into a custom track (in Galaxy), etc.
Any suggestions on other methods for doing this?
(OpenHelix does training on Galaxy and UCSC Genome Browser).
People are using the UCSC Genome Browser to visualize and analyze many important genomic features. During our training sessions we have found that people sometimes don’t realize how they can see the amino acid sequences in the browser. When you have zoomed into a region enough you can actually see a three-frame translation of the genomic sequence in the viewer.
Ah-ha, you say! But there should be six frames, right?
Watch this “tip of the week” to see how to display the other 3-frames.
To learn more about other aspects of visualizations in the Genome Browser you can see the full tutorial.