Java Reference
In-Depth Information
the source of text, we will find that it is marked up with HTML tags. These are not neces-
sarily relevant to the analysis process and may need to be removed.
The Multi-Purpose Internet Mail Extensions ( MIME ) type is used to characterize the
format used by a file. Common file types are listed in the following table. Either we need
to explicitly remove or alter the markup found in a file or use specialized software to deal
with it. Some of the NLP APIs provide tools to deal with specialized file formats.
File format
MIME type
Description
Text
plain/text
Simple text file
application/msword
Microsoft Office
Office Type Document
application/vnd.oasis.opendocument.text
Open Office
PDF
application/pdf
Adobe Portable Document Format
HTML
text/html
Web pages
XML
text/xml
eXtensible Markup Language
Database
Not applicable
Data can be in a number of different formats
Many of the NLP APIs assume that the data is clean. When it is not, it needs to be cleaned
lest we get unreliable and misleading results.
Search WWH ::




Custom Search