Information Technology Reference
In-Depth Information
Not only that, it must convert the extracted raw text by reading character by character
from it and writing the corresponding UTF-8 glyphs supported by the Proprietary font
of that regional Language.
Now the special glyphs which are to be taken a special care here are not so easy to
maintain. For that a detailed study on ISO 8859-15, (UTF8), Hindi Fonts &
Proprietary Fonts [1, 2, 3] are necessary. Not only that, a detailed study of the Hindi
fonts is highly required as the algorithm is based on this language [4, 5]. The
approach for raw text extraction is same for any other language(s) as it follows a
generic heuristic depending on regular expression searching a matched pattern of tags
associated with the keyword specifying the proprietary font name as. But, here Hindi
is used for our test set.
The whole technique is divided into set of modules, as discussed in details as
below, which comprises the actual heuristic together upon interacting amongst each
other effectively.
3.1 Main ( )
This is the main method of our algorithm. This function will take the InputFolderPath
and OutputFolderPath from user and call the major module of our algorithm namely
Process. It mainly represents the User-friendly interface of the algorithm.
1.
Ask the absolute path for the folder containing web-pages distributed in nu-
merous subfolders in it, representing the complete corpus website, from the
user.
2.
Also ask for the absolute path of the folder which will contain the extracted
and font-mapped raw documents within it after successful completion of the
algorithm.
3.
Call PROCESS (INPUTFOLDERPATH, OUTPUTFOLDERPATH).
4.
Return.
3.2 Process (Inputfolderpath, Outputfolderpath)
This method in the algorithm calls different modules like RecursiveTraverse,
RawTextExtractor and FontMap.
Arguments: This function will take InputFolderPath and OutputFolderPath, as argu-
ments and finally it will store the resultant RawTextDocuments into Output Directory.
1.
It will call RecursiveTraverse module with the InputFolderPath as its
argument.
2.
It will store all the Web-pages, i.e., files with web extensions in list.
3.
For each item in the Web-page list structure execute till Step 6.
4.
Call RawTextExtractor module with the Webpage element as its argument
and stores the extracted regional language raw text string.
 
Search WWH ::




Custom Search