Databases Reference
In-Depth Information
example, a biologist studying the sleep habits of small mammals would not
want to include any measures belonging to large mammals in the dataset even
though the data is valid.
Open the CmpltHomes.csv dataset.
View the dataset in a parallel coordinate plot.
For readability, you may want to drag all axes belonging to nominal
attributes to the right to keep their labels from obscuring the numeric
attributes of interest.
Look at the distributions of lot and price. Almost all homes in the dataset have
lots less than one acre in size, while there are a few homes with lots ranging up
to 200 acres. Most of the homes are priced under about $750,000. The highest
priced home is $8.9 million. Even though the data of these large lot or high
priced homes may be valid, they may be outside our area of interest with respect
to our data mining objectives. Like the single observation in Figure 6.2c, these
observations, if included in the analyses, may bias the models we generate.
Suppose that the reason for generating estimates of a home's sale price is to
identify homes for purchase as an investment. If the predicted price is well above
the asking price, then the potential for a good return on the investment is greater.
Suppose also that the investors have a policy to only invest in homes on lots under
two acres, priced less than $750,000, and under 5,000 square feet in size.
Including observations in the model building process outside these restrictions
risks the construction of models that are biased by these observations.
Use the parallel coordinate plot or the Control Center's “Create filtered
dataset” option to eliminate all homes with lots over two acres, priced over
$750,000, or over 5,000 square feet in size.
Name this set “selectedHomes”.
Createadataset derivedfromselectedHomesnamed“homes”containingonly
those attributes deemed acceptable to the regression modeling process.
(Includeallbut: city,daysOnMarket, elementary, jrHigh, state, street, andzip.)
Note: in the tutorial that follows, measures and statistics reported may differ
slightly from those that you experience as you follow along. This is because
when filtering out observations, your slider positions may have been slightly
different. The dataset (homes) in the examples below contained 3,111 observa-
tions. Yours may have a few more or a few less; however, your modeling results
will not be significantly different.
The process of input selection can be approached from a bottom-up or a top-
down perspective. In the bottom-up approach, attributes are evaluated and selected
Search WWH ::




Custom Search