Advanced Topics in Initial Exploration and Dataset Preparation Using VisMiner - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

introduced at this time to allow you to compare this method with other filtering

options available in VisMiner using the parallel plot and location plot viewers.

For most datasets, the parallel plot is the recommended tool to use when

generating filtered subsets. It has the advantage of visual feedback as observa-

tions are filtered out. Control Center filtering can be more effective when

applied to nominal data types. For example, in the just completed practice

example, only homes in the Alpine and Provo school districts were selected.

Such a selection is not possible when filtering via the parallel plot where

filtering is specified using the sliders, thus requiring that values to be filtered out

are limited by adjacency in the plot. Because the nominal values are listed

alphabetically, it would be impossible to keep the Alpine and Provo observa-

tions while eliminating the Nebo observations.

Exercise 3.1

Use the CmpltHomes.csv dataset prepared in the previous tutorial.

a. Look for patterns in the relationship between location and year built. What

areas have mostly newer homes?

b. When evaluating the relationship between lot size and location, as with

price, the few very large lot homes (up to 200 acres) dominate the color

encoding. To use the range sliders alone to restrict the selection lacks

precision because over 90% of the homes are on less than one acre lots, yet

the range slider moves in one acre increments. Thus, in moving the left

range “Lot” slider you can't gradually reduce the smaller lot homes. At zero,

they are all there, then at the next slider position, they are gone. Use the

parallel coordinate plot to first create a subset of the homes having lot sizes

less than two acres, then use the location plot to evaluate the relationship

between lot size and location.

c.

In many areas, proximity to a lake increases a home's value. Does this

appear to be the case for the Provo Metropolitan area homes? What

geographic setting appears to add value to a home in this area?

Dataset preparation - creating computed columns

Occasionally, needed or desirable columns in a dataset are not included, but

may be computed using other values in the set. For example, suppose that in the

CmpltHomes.csv dataset, the relationship between price and location is to be

explored. However, looking at price alone is not sufficient, since price is mostly

determined by the size of the home. A possible measure representing both price

Search WWH ::

Custom Search

Home