Information Technology Reference
In-Depth Information
A Heuristic Approach for Designing Regional Language
Based Raw-Text Extractor and Unicode
Font-Mapping Tool
Debnath Bhattacharyya 1 , Poulami Das 1 , Debashis Ganguly 1 , Kheyali Mitra 1 ,
Swarnendu Mukherjee 1 , Samir Kumar Bandyopadhyay 2 , and Tai-hoon Kim 3
1 Computer Science and Engineering Department, Heritage Institute of Technology,
Kolkata-700107, India
{debnathb,dasp88,DebashisGanguly,kheyalimitra,
mukherjee.swarnendu}@gmail.com
2 Department of Computer Science and Engineering, University of Calcutta,
Kolkata-700009, India
skb1@vsnl.com
3 Hannam University, Daejeon - 306791, Korea
taihoonn@empal.com
Abstract. Information Extraction (IE) is a type of information retrieval meant
for extracting structured information. In general, the information on the web is
well structured in HTML or XML format. And IE will be there to structure
these documents, by using learning techniques for pattern matching in the
content. A typical application of IE is to scan a set of documents written in a
natural language and populate a database with the information extracted. In this
paper, we have concentrated our research work to give a heuristic approach for
interactive information extraction technique where the information is in Indian
Regional Language. This enables any naive user to extract regional language
(Indian) based document from a web document efficiently. It is just similar to a
pre-programmed information extraction engine.
Keywords: Information Extraction, Indian Regional Language, Search engine,
pattern matching.
1 Introduction
The internet provides the vast source of textual information at a very low cost
(sometimes it is free of cost) and precisely. And that is why the World Wide Web
offers a tremendously rich source of data. But it is quite unfortunate to see that it fails
to satisfy a user's information needs. The only reason is that the information
providers are limited in their ability to present data or information to end users. They
do not have much flexibility to represent the exact data demanded by end users
especially the challenge comes when extraction of information requires heterogeneous
sources. Thus a new field of research related to Information Extraction (IE)
incorporating a wide range of new knowledge driven applications and services have
 
Search WWH ::




Custom Search