GenomicTools: an open source platform for developing highthroughput analytics in genomics - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

8.1 Introduction

The advent of high-throughput sequencing techniques initiated by

pyrosequencing in 2004 [1] is expected to accelerate the pace of discovery

in life sciences. Indeed, the rapidly and inexpensively produced super-

exponential amount of data (e.g. short sequence patterns referred to as

reads) from various high-throughput sequencing platforms allows the

scientifi c community to study specifi c biological problems in depth, such

as quantifi cation of alternative splicing in tissues [2, 3], human disease

[4], discovery of new fusion genes in cancer [5, 6], improvement of

genome assembly [7], and transcript identifi cation [8-11].

The common steps in many high-throughput sequencing studies

include: (1) alignment of reads directly to a reference transcriptome or

genome ('read mapping'); (2) identifi cation of expressed genes, isoforms

or binding sites; and (3) differential analysis across samples. An in-depth

review of standard steps in RNA-seq and ChIP-seq computational

pipelines is published by Pepke and colleagues [12]. It is worth pointing

out that genome-wide data, such as transcripts/genes, exons/introns,

promoter sites, sequences, multiple sequence alignments, transcription

factor binding sites, intergenic regions, repeat elements, microarray

probes (expression, SNP, CNV, etc.), sequencing data (RNA-seq, ChIP-

seq, DNA-seq, etc.), chromosomal conformations (3C-seq, 4C-seq, etc.),

and inter-chromosomal associations can easily be represented as sets of

genomic intervals (see Figure 8.1).

Given the huge volume of available data, new effi cient computational

tools are required in order to effi ciently perform analysis tasks such as

those outlined above [13]. Currently, freely available computational tools

for large-scale data analytics include Bioconductor [14], Galaxy [15],

Genomic Regions Enrichment of Annotations Tool (GREAT) [16], USCS

genome browser [17] and Integrated Genome Browser (IGB) [18]. For

the readers' convenience, we report here the fundamental aspects of each

tool. Bioconductor uses the R statistical programming framework to

provide tools for the analysis and comprehension of high-throughput

genomic data. The functional scope of Bioconductor packages includes

the analysis of DNA microarray, sequence, fl ow, and SNP data. Galaxy is

an open web-based platform for genomic research, based around reusable

analysis templates that users can manipulate and run repeatedly on

different data sets. Galaxy has been used for different types of genomic

research, for example investigations of epigenetics, chromatin profi ling,

transcriptional enhancers, and genome-environment interactions.

GREAT is available as a web application that was designed to analyze the

Search WWH ::

Custom Search

Home