Database Reference
In-Depth Information
3.2.3.1
Integration
Data integration is the cornerstone of modern commercial informatics, which
involves the combination of data from different sources and provides users with
a uniform view of data [ 38 ]. This is a mature research field for traditional database.
Historically, two methods have been widely recognized: data warehouse and data
federation. Data warehousing includes a process named ETL (Extract, Transform
and Load). Extraction involves connecting source systems, selecting, collecting,
analyzing, and processing necessary data. Transformation is the execution of a
series of rules to transform the extracted data into standard formats. Loading means
importing extracted and transformed data into the target storage infrastructure.
Loading is the most complex procedure among the three, which includes opera-
tions such as transformation, copy, clearing, standardization, screening, and data
organization. A virtual database can be built to query and aggregate data from
different data sources, but such database does not contain data. On the contrary,
it includes information or metadata related to actual data and its positions. Such two
“storage-reading” approaches do not satisfy the high performance requirements of
data flows or search programs and applications. Compared with queries, data in such
two approaches is more dynamic and must be processed during data transmission.
Generally, data integration methods are accompanied with flow processing engines
and search engines [ 39 , 40 ].
3.2.3.2
Cleaning
Data cleaning is a process to identify inaccurate, incomplete, or unreasonable
data, and then modify or delete such data to improve data quality. Generally, data
cleaning includes five complementary procedures [ 41 ]: defining and determining
error types, searching and identifying errors, correcting errors, documenting error
examples and error types, and modifying data entry procedures to reduce future
errors. During cleaning, data formats, completeness, rationality, and restriction shall
be inspected. Data cleaning is of vital importance to keep the data consistency,
which is widely applied in many fields, such as banking, insurance, retail industry,
telecommunications, and traffic control.
In e-commerce, most data is electronically collected, which may have serious
data quality problems. Classic data quality problems mainly come from software
defects, customized errors, or system mis-configuration. Authors in [ 42 ] discussed
data cleaning in e-commerce by crawlers and regularly re-copying customer and
account information. In [ 43 ], the problem of cleaning RFID data was examined.
RFID is widely used in many applications, e.g., inventory management and target
tracking. However, the original RFID features low quality, which includes a lot
of abnormal data limited by the physical design and affected by environmental
noises. In [ 44 ], a probability model was developed to cope with data loss in mobile
environments. Khoussainova et al. in [ 45 ] proposed a system to automatically
correct errors of input data by defining global integrity constraints. Herbert et al. [ 46 ]
Search WWH ::




Custom Search