Database Reference
In-Depth Information
Discussion
cvt_file.pl
(see
Recipe 11.6
) serves as a tool that converts entire files from one format to
another. Another common datafile operation is to manipulate columns. This is necesā
sary, for example, when importing a file into a program that doesn't understand how to
extract or rearrange input columns for itself. To work around this problem, rearrange
the datafile instead.
Recall that this chapter began with a description of a scenario involving a 12-column
CSV file
somedata.csv
from which only columns 2, 11, 5, and 9 were needed. To convert
the file to tab-delimited format, do this:
%
cvt_file.pl --iformat=csv somedata.csv > somedata.txt
But then what? If you just want to knock out a short script to extract those specific four
columns, that's fairly easy: write a loop that reads input lines and writes only the desired
columns, in the proper order. But that would be a special-purpose script, useful only
within a highly limited context. With just a little more effort, it's possible to write a more
general utility
yank_col.pl
that enables you to extract any set of columns. With such a
tool, you specify the column list on the command line like this:
%
yank_col.pl --columns=2,11,5,9 somedata.txt > tmp.txt
Because the script doesn't use a hardcoded column list, it can be used to extract an
arbitrary set of columns in any order. Columns can be specified as a comma-separated
list of column numbers or column ranges. (For example,
--columns=1,10,4-7
means
columns 1, 10, 4, 5, 6, and 7.)
yank_col.pl
looks like this:
#!/usr/bin/perl
# yank_col.pl: Extract columns from input.
# Example: yank_col.pl --columns=2,11,5,9 filename
# Assumes tab-delimited, linefeed-terminated input lines.
#
...
process
command
-
line
options
(
not
shown
)
...
#
...
to
get
column
list
into
@col_list
array
...
while
(
<>
)
# read input
{
chomp
;
my
@val
=
split
(
/\t/
,
$_
,
10000
);
# split, preserving all fields
# extract desired columns, mapping undef to empty string (can
# occur if an index exceeds number of columns present in line)
@val
=
map
{
defined
(
$_
) ?
$_
:
""
}
@val
[
@col_list
];
print
join
(
"\t"
,
@val
)
.
"\n"
;
}
The input processing loop converts each line to an array of values, then pulls out from
the array the values corresponding to the requested columns. To avoid looping through