Information Technology Reference
In-Depth Information
nosocomial infection. At this stage the data and variables were also aggregated in
class in order to modelling the problem in a more accurate way. The following classes
were created:
Age Class: aggregation of the patients' age in ranges that correspond to
different age groups;
Intubation: aggregation of all the invasive devices related to intubation in a
single class (nasogastric intubation and nasotracheal intubation);
Catheterization: aggregation of all the invasive devices related to
catheterization in a single class (Urinary Catheter, Peripheral Catheter and
Central Catheter).
Moreover, oversampling techniques was applied to the dataset in order to replicate
the data associated with the occurrence of a nosocomial infection. Thus it was
possible, to obtain a number of records associated with the occurrence of a
nosocomial infection approximate to the number of records associated with the non-
occurrence of an infection. This technique consists on the minority class (full set) data
replication in order to increase its weight and this is necessary because the classifiers
tend to produce more classification errors in the presence of minority classes [10]. In
the case of this work, this technique was applied because the difference between the
number of forms associated with the occurrence of a nosocomial infection and the
number of forms associated with the non-occurrence of an infection was very
significative. Thus, the meaning of the infection occurrences could get lost because of
its lower occurrence rate in the population to study. After the oversampling, the
dataset had 517 records. With this stage of the CRISP-DM three datasets were
created: a dataset without replicated data (Approach A), a dataset with replicated data
(Approach B) and a dataset with replicated data and the variable age aggregated into
classes (Approach C).
3.5
Modeling
In this study the Support Vector Machines (SVM) and the Naïve Bayes (NB) were the
classification techniques used to perform DM. These techniques were used to
automatically induce the classification models with Oracle Data Miner 1 , a SQL
Developer extension that allows to build, evaluate and apply DM models. Another
techniques were explored but the first results were not satisfactory.
SVM is a powerful algorithm that is based on the statistical learning theory and
find the best decision plans that split data into different sets, can be used to model
complex problems and has a great capacity of generalization of the model to new data
[8] [11].
NB is also based on conditional probabilities, makes predictions considering the
Bayes Theorem and it is very fast and scalable [11].
Considering the different chosen variables, several scenarios were considered to
build the models:
1 http://www.oracle.com/technetwork/database/options/
advanced-analytics/odm/dataminerworkflow-168677.html
Search WWH ::




Custom Search