Biomedical Engineering Reference
In-Depth Information
into a BEDGraph [9] fi le for visualization using, for example, GBrowse
[10] or UCSC [11] genome browser (lower part). Regions of interest are
also being identifi ed.
As this workfl ow is also quite complex, we only highlight the most
interesting parts. A complete description of an even more elaborate
workfl ow is available in a separate publication [7]. Similar to the previous
workfl ows, this one also makes use of the KNIME Community
Contributions. The NGS package offers special nodes for dealing with
NGS-related data. One example is the FastQ Reader, which reads the de
facto standard FastQ fi le format using the BioJava library. Its output is a
data table containing the cluster ID and the sequence along with the quality
information. The File Reader reads parameters for the subsequent Adapter
Removal Adv node, such as adapter sequences or other contaminating
sequences, similarity threshold, quality threshold, and minimum overlap.
This latter node compares each sequence from the FastQ fi le (target) with
all sequences from the parameter fi le (query) and removes contaminations.
The output is the second input table with adapters removed from the input
sequences. The following nodes compute the sequence length and fi lter out
very short sequences before writing everything back into a FastQ fi le. The
subsequent 'Bash' node is only connected to its predecessor with a variable
port. This ensures that it is not executed before the FastQ fi le has been
written. The node executes a bash-script, which calls the bowtie program
to align all sequences to the reference genome (hg19). Its output is a SAM
formatted fi le with the information from the alignment. This is read in
with the SAM Reader (again connected with a variable port to its
predecessor). The next nodes select only sequences that align to the
reference genome (Row Filter) and apply various transformations that
result in a data table holding the original sequences, the chromosomes
from the reference genomes and the positions in the sequences where there
has been a mismatch in the alignment process (Meta Node). This
information is subsequently written out into a BEDGraph fi le.
In the second branch, the ROI meta-node (see Figure 6.17) extracts the
regions of interest (i.e. consecutive regions of coverage between the
sequences and the reference chromosome). This is especially interesting
when analyzing small RNA. The input is a list of positions from the
reference genome with associated coverage. This list is already sorted by
chromosome and position. The GetRegions node identifi es regions of
interest (ROIs). A ROI is defi ned as having entries in the input table with
the same chromosome name and increasing (by one) positions, that is
consecutive regions of coverage. Values are stored in a string column and
concatenated using a space as a separator. Next, a Java Snippet retrieves
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search