Database Reference
In-Depth Information
structures, leaves represent classifications and
branches represent conjunctions of features that
lead to those classifications (Utgoff 2004). The
machine learning technique for inducing a deci-
sion tree from data is called decision tree learning.
A well-known decision tree algorithm is C4.5
(Quinlan 1993).
Logistic regression is a model used to predict
the probability of occurrence of an event by fit-
ting data to a logistic curve (Hosmer & Stanley
2000). It makes use of several predictor variables
that may be either numerical or categorical. For
example, the probability that a person has a heart
attack within a specified time period might be
predicted from knowledge of the person's age, sex
and body mass index. Logistic regression is used
extensively in the medical and social sciences as
well as marketing applications, such as prediction
of a customer's propensity to purchase a product
or cease a subscription.
Neural network (NN) is a network of artificial
neurons that uses a mathematical or computational
model for information processing (Muller & In-
sua 1995). In most cases, a neural network is an
adaptive system that changes its structure based
on external or internal information that flows
through the network.
Donoho (2003) researches on the solution of
early detection of insider trading by using data
mining technologies. His research was inspired
by McMillian's hypothesis that people with inside
information leave evidence in option trading data
that might predict news. In order to automate the
analysis and discover unknown relationships, he
made use of different data mining technologies
to replace the large amount of human intuition
and manual analysis in McMillian's method. The
utilized technologies include C4.5, backwards
stepwise logic regression and neural networks.
The experimental data in the research came from
three sources: option trading, stock trading, and
news. Stock and option data were available on
all U.S. companies for which options are trades
(about 2160 companies). News covered these
companies plus others. The date range for which
all three data sources were available covered a six-
month time period from March 11, 2003 to Sept
17, 2003. An expert model was used in order to
evaluate the results.All three algorithms produced
lift over random and over the expert model, but no
algorithm clearly outperformed the others.
Outlier Mining On Multiple
tiMe SerieS in StOck Market
From the literature review, we can see that most
of the exceptions detection technologies handle
a single time series. It will be beneficial if we
could integrate multiple time series, such as price,
index, trade amount, etc. This is the motivation
of our research on outlier mining on multiple
time series (OMM). In this section, the design
of OMM, the experiments and the evaluation of
OMM are illustrated.
Outlier Mining on Multiple
time Series (OMM)
The idea of OMM is motivated to improve the
accuracy of stock market surveillance. In Shan-
non's information theory, information is defined
as that which removes or reduces uncertainty
(Cover & Thomas 1991). For outlier detection
task, more information means higher accuracy
of an outlier detection model, since the identified
outliers are more likely to be different from the
remaining data. For example, it is less accurate to
measure a stock and identify the outliers by using
price information only. The results will be more
reasonable if we add one or more measures, such
as volume, volatility and liquidity.
In the design of OMM, the key issue is how to
integrate multiple time series. There are two main
potential approaches for this. One is to integrate
the multiple time series before the outlier mining
process, and the other is to run outlier mining on
individual time series first and then integrate the
Search WWH ::




Custom Search