Multivariate Visualization by Density Estimation - Data Visualization

Graphics Reference

In-Depth Information

indicating that the variance will be reduced for large bin width. On the other hand,

the bias of f

(

)

for x

B j is dominated by

t j +

t j

f ′

−

(

)=(

m j

−

)

(

)

( . )

where m j is the midpoint of bin B j . Not surprisingly, the piecewise-constant his-

togram has greatest bias in bins where the true density has the largest (positive or

negative) slope. However, this effect can be reduced by the use of smaller bins, since

B j .

Clearly, nosingle bin width will perform optimally forboth variance and bias, but

we can balance these competing forces by considering the mean integrated squared

error (MISE), found as the sum of the integrated variance and the integrated square

of the bias. Optimization over a single bin width h suggests that asymptotic MISE

will be minimized for bin width

m j

−

h j

forallx

)

n − ,

( . )

(

f ′

∫ g

dx.Unfortunately,

wherethe“roughness”functional, R,isdefinedby R

(

)

f ′

the presence of R

limits the applicability of this rule, as it is highly unlikely to be

known when f itself must be estimated.

As a more practical alternative, many computer packages follow a recommenda-

tion by Sturges ( ) that the number of bins be roughly

(

)

log

.Sturgeschose

(

)

K −

the bin counts ν j

for j

, ,...,K

−

, so that

K −

j =

K −

K − .

ν j

)

( . )

Hencethe number of bins K

log

(

)

. Sturges' bincounts areproportional tothe

binomial B

probabilities, which is approximately normal for moderate K.

Hence,Sturges'ruleisaversionofanormalreferencerulebutmotivated bybinomial

approximations rather than by mean squared error considerations.

While Sturges' rule gives fairly reasonable widths for small samples from smooth

densities, thenumberofbins increases (andtherefore h decreases) at a rate far slower

than optimal for AMISE purposes. Better rules replace R

(

−

)

f ′

in the theoretical for-

mula with the value for a normal distribution, resulting in h

(

)

. σn − .Scott

. sn − ,whileFreed-

man and Diaconis ( ) suggest using the more robust interquartile range IQR in

h FD

( )suggestsusingthesample standarddeviation s in h S

IQRn − .Usually,h FD

h S ,sinceσ

IQR

. forthe normal density and

hence . σ

. IQR, which is % wider than h FD . Although less oversmoothed

than the estimates generated bySturges' rule,estimates using the normal-based rules

can still be smoother than optimal for more complicated underlying densities. his

may not be terrible; oversmoothed histograms oten look better to many viewers, as

the bias of overly large bins can be easier to mentally smooth out than the multiple

smallmodesthatcanappearinundersmoothedestimates.Nonetheless,thestrongef-

Data Visualization

Search WWH ::

Custom Search

Home