Environmental Engineering Reference
In-Depth Information
should be used to avoid confusion. For example, raw data can be given the extension
raw , while verified data can be given the extension ver .
9.2 DATA VALIDATION
In these days of powerful personal computers, most data validation is done with
automated tools; however, a manual review is still highly recommended. Validation
software can be obtained from some data logger vendors, and commercial software is
also available. Firms that do a lot of data validation often create their own automated
methods using spreadsheets or custom software written in languages such as Fortran,
Visual Basic, C
,orR.
Whatever method is used, data validation usually proceeds in two phases: automated
screening and in-depth review. The automated screening uses a series of algorithms to
flag suspect data records. Suspect records contain values that fall outside the normal
range based on either prior knowledge or information from other sensors on the same
tower. The algorithms commonly include relational tests, range tests, and trend tests.
The second phase, sometimes called verification , involves a case-by-case decision
about what to do with the suspect values—retain them as valid or reject them as
invalid. This is where judgment by an experienced person familiar with the monitoring
equipment and local meteorology is most helpful. Information that is not part of the
automated screening, such as regional weather data, may also be brought into play.
As an example of how this process can unfold, the automated screening might flag
a brief series of 10-min wind speeds as questionable because they are much higher
than the speeds immediately before and after. Was this spike real, or was it caused
by a glitch in the logger electronics, such as might be caused by a loose connection?
During the review phase, the reviewer might check other sensors on the same mast
and observe the same spike; this would suggest that it is not a problem with a single
sensor or logger channel. Then he or she might look at regional weather records and
find that there was thunderstorm activity in the area at the time. The conclusion is
that the spike was most likely caused by a passing thunderstorm and should not be
excluded from the data analysis.
Another example is presented in Figure 9-1. After a period of apparently normal
operation, the 10-min average speed readings from an anemometer dropped to the off-
set value (indicating no detectable wind), while the standard deviation dropped to zero.
Later, both appeared to return to their normal behavior. The reviewer checks the tem-
perature and finds it hovered near freezing before the event and rose above freezing at
the end. Furthermore, the direction standard deviation (not shown) fell to zero shortly
before the speed standard deviation did and resumed normal behavior at about the same
time. The conclusion is that this was a likely icing event and should be excluded.
In such a two-phase validation approach, it is reasonable for the automated screen-
ing to be somewhat overly sensitive, meaning it produces a greater number of false
positives (data flagged as bad, although they are actually good) than false negatives
(data that are cleared as good but are actually bad). One reason for this bias toward
overdetection is that there will be an opportunity to reexamine bad data records in
++
Search WWH ::




Custom Search