Regression Analysis - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

example, a biologist studying the sleep habits of small mammals would not

want to include any measures belonging to large mammals in the dataset even

though the data is valid.

Open the CmpltHomes.csv dataset.

View the dataset in a parallel coordinate plot.

For readability, you may want to drag all axes belonging to nominal

attributes to the right to keep their labels from obscuring the numeric

attributes of interest.

Look at the distributions of lot and price. Almost all homes in the dataset have

lots less than one acre in size, while there are a few homes with lots ranging up

to 200 acres. Most of the homes are priced under about $750,000. The highest

priced home is $8.9 million. Even though the data of these large lot or high

priced homes may be valid, they may be outside our area of interest with respect

to our data mining objectives. Like the single observation in Figure 6.2c, these

observations, if included in the analyses, may bias the models we generate.

Suppose that the reason for generating estimates of a home's sale price is to

identify homes for purchase as an investment. If the predicted price is well above

the asking price, then the potential for a good return on the investment is greater.

Suppose also that the investors have a policy to only invest in homes on lots under

two acres, priced less than $750,000, and under 5,000 square feet in size.

Including observations in the model building process outside these restrictions

risks the construction of models that are biased by these observations.

Use the parallel coordinate plot or the Control Center's “Create filtered

dataset” option to eliminate all homes with lots over two acres, priced over

$750,000, or over 5,000 square feet in size.

Name this set “selectedHomes”.

Createadataset derivedfromselectedHomesnamed“homes”containingonly

those attributes deemed acceptable to the regression modeling process.

(Includeallbut: city,daysOnMarket, elementary, jrHigh, state, street, andzip.)

Note: in the tutorial that follows, measures and statistics reported may differ

slightly from those that you experience as you follow along. This is because

when filtering out observations, your slider positions may have been slightly

different. The dataset (homes) in the examples below contained 3,111 observa-

tions. Yours may have a few more or a few less; however, your modeling results

will not be significantly different.

The process of input selection can be approached from a bottom-up or a top-

down perspective. In the bottom-up approach, attributes are evaluated and selected

Search WWH ::

Custom Search

Home