Integrated data analysis with KNIME - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

into a BEDGraph [9] fi le for visualization using, for example, GBrowse

[10] or UCSC [11] genome browser (lower part). Regions of interest are

also being identifi ed.

As this workfl ow is also quite complex, we only highlight the most

interesting parts. A complete description of an even more elaborate

workfl ow is available in a separate publication [7]. Similar to the previous

workfl ows, this one also makes use of the KNIME Community

Contributions. The NGS package offers special nodes for dealing with

NGS-related data. One example is the FastQ Reader, which reads the de

facto standard FastQ fi le format using the BioJava library. Its output is a

data table containing the cluster ID and the sequence along with the quality

information. The File Reader reads parameters for the subsequent Adapter

Removal Adv node, such as adapter sequences or other contaminating

sequences, similarity threshold, quality threshold, and minimum overlap.

This latter node compares each sequence from the FastQ fi le (target) with

all sequences from the parameter fi le (query) and removes contaminations.

The output is the second input table with adapters removed from the input

sequences. The following nodes compute the sequence length and fi lter out

very short sequences before writing everything back into a FastQ fi le. The

subsequent 'Bash' node is only connected to its predecessor with a variable

port. This ensures that it is not executed before the FastQ fi le has been

written. The node executes a bash-script, which calls the bowtie program

to align all sequences to the reference genome (hg19). Its output is a SAM

formatted fi le with the information from the alignment. This is read in

with the SAM Reader (again connected with a variable port to its

predecessor). The next nodes select only sequences that align to the

reference genome (Row Filter) and apply various transformations that

result in a data table holding the original sequences, the chromosomes

from the reference genomes and the positions in the sequences where there

has been a mismatch in the alignment process (Meta Node). This

information is subsequently written out into a BEDGraph fi le.

In the second branch, the ROI meta-node (see Figure 6.17) extracts the

regions of interest (i.e. consecutive regions of coverage between the

sequences and the reference chromosome). This is especially interesting

when analyzing small RNA. The input is a list of positions from the

reference genome with associated coverage. This list is already sorted by

chromosome and position. The GetRegions node identifi es regions of

interest (ROIs). A ROI is defi ned as having entries in the input table with

the same chromosome name and increasing (by one) positions, that is

consecutive regions of coverage. Values are stored in a string column and

concatenated using a space as a separator. Next, a Java Snippet retrieves

Search WWH ::

Custom Search

Home