Information Technology Reference
In-Depth Information
assigned to class 1; the remaining corners assigned to class 2. The second method is
deterministic. Corners with odd number of
1 coordinates are assigned to class 1 and
the remaining corners are assigned to class 2. The points were generated using three
methods. In the first one, the points are generated from multidimensional Gaussian
distribution with mean zero and standard deviation one and assigned to the nearest
corner. In the second method, the multidimensional uniform distribution spanned on
(
interval was used instead of Gaussian. In the third one, points were drawn
from 2 D multidimensional Gaussian distributions with standard deviation 0
1
,
1
)
1, each
centred on the respective corner of the hypercube. Then two classes of additional
features were added to each data set. Features from the first class were obtained as a
linear combination of original variables. Features from the second class were drawn
randomly from the normal distribution. As a result we obtain the data set described
with three types of features. The generative features are the original variables used to
define the value of decision variable. The combination features are obtained as linear
combinations of generative features and hence they are also connected with decision
variable. These two sets of features are by definition relevant. One should note, how-
ever, that features of both types are weakly relevant—it is possible to replace any of
the features with combination of other features. The remaining variables are random
features —they are not connected with decision variable.
Multiple data sets with varying numbers of generative, combination and random
features, aswell as varying number of objectswere generated using four combinations
of the class assignment and point distributionsmethods, resulting in four series of data
sets. The first series, denoted as NORM used the deterministic class assignment and
single Gaussian in a centre for generation of data points. The second one, denoted as
UNI used deterministic class assignment and uniform distribution of points, the third
one used random class assignment and uniform distribution of points. The last series
was obtained using random class assignment and Gaussians centered on corners of
the hypercube for points generation. Two last series were generated with functions
mlbench.xor and mlbench.hypercube from the mlbench package [ 10 ]inR[ 16 ], with
default parameters for data dispersion and are denoted as XOR and HYPER .
In addition to analysis of synthetic data sets the relevance of the variables was
examined for four recently published data sets deposited in the UCI repository [ 1 ]:
MicroMass, QSAR biodegradation (Q-b) [ 12 ], Turkiye Student Evaluation Data Set
(TSE) [ 5 ] and Amazon Commerce Reviews Set (ACRS).
.
2.2.2 Classification
The tests of classification accuracy for random forest were performed for four series
of data sets described earlier. However, the data sets used for survey of classification
results were simpler than those used for feature selection. The number of generative
variables varied between 2 and 8, the number of combination features was either
zero or two times the number of original features and the number of objects varied
between 100 and 2,000. The systems were not extended with random features—the
goal of this survey was to find the region of parameter space that is feasible for
 
Search WWH ::




Custom Search