A Heuristic Approach for Designing Regional Language Based Raw–Text Extractor and Unicode Font–Mapping Tool - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

Not only that, it must convert the extracted raw text by reading character by character

from it and writing the corresponding UTF-8 glyphs supported by the Proprietary font

of that regional Language.

Now the special glyphs which are to be taken a special care here are not so easy to

maintain. For that a detailed study on ISO 8859-15, (UTF8), Hindi Fonts &

Proprietary Fonts [1, 2, 3] are necessary. Not only that, a detailed study of the Hindi

fonts is highly required as the algorithm is based on this language [4, 5]. The

approach for raw text extraction is same for any other language(s) as it follows a

generic heuristic depending on regular expression searching a matched pattern of tags

associated with the keyword specifying the proprietary font name as. But, here Hindi

is used for our test set.

The whole technique is divided into set of modules, as discussed in details as

below, which comprises the actual heuristic together upon interacting amongst each

other effectively.

3.1 Main ( )

This is the main method of our algorithm. This function will take the InputFolderPath

and OutputFolderPath from user and call the major module of our algorithm namely

Process. It mainly represents the User-friendly interface of the algorithm.

1.

Ask the absolute path for the folder containing web-pages distributed in nu-

merous subfolders in it, representing the complete corpus website, from the

user.

2.

Also ask for the absolute path of the folder which will contain the extracted

and font-mapped raw documents within it after successful completion of the

algorithm.

3.

Call PROCESS (INPUTFOLDERPATH, OUTPUTFOLDERPATH).

4.

Return.

3.2 Process (Inputfolderpath, Outputfolderpath)

This method in the algorithm calls different modules like RecursiveTraverse,

RawTextExtractor and FontMap.

Arguments: This function will take InputFolderPath and OutputFolderPath, as argu-

ments and finally it will store the resultant RawTextDocuments into Output Directory.

1.

It will call RecursiveTraverse module with the InputFolderPath as its

argument.

2.

It will store all the Web-pages, i.e., files with web extensions in list.

3.

For each item in the Web-page list structure execute till Step 6.

4.

Call RawTextExtractor module with the Webpage element as its argument

and stores the extracted regional language raw text string.

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home