Analytics for Noisy Unstructured Text Data I (Artificial Intelligence)

INTRODUCTION

Unfortunately computing systems are not yet as smart as the human mind. Over the last couple of years a significant number of researchers have been focussing on noisy text analytics. Noisy text data is found in informal settings (online chat, SMS, e-mails, message boards, among others) and in text produced through automated speech recognition or optical character recognition systems. Noise can possibly degrade the performance of other information processing algorithms such as classification, clustering, summarization and information extraction. We will identify some of the key research areas for noisy text and give a brief overview of the state of the art. These areas will be, (i) classification of noisy text, (ii) correcting noisy text, (iii) information extraction from noisy text. We will cover the first one in this chapter and the later two in the next chapter.

We define noise in text as any kind of difference in the surface form of an electronic text from the intended, correct or original text. We see such noisy text everyday in various forms. Each of them has unique characteristics and hence requires special handling. We introduce some such forms of noisy textual data in this section.

Online Noisy Documents: E-mails, chat logs, scrap-topic entries, newsgroup postings, threads in discussion fora, blogs, etc., fall under this category. People are typically less careful about the sanity of written content in such informal modes of communication. These are characterized by frequent misspellings, commonly and not so commonly used abbreviations, incomplete sentences, missing punctuations and so on. Almost always noisy documents are human interpretable, if not by everyone, at least by intended readers.

SMS: Short Message Services are becoming more and more common. Language usage over SMS text significantly differs from the standard form ofthe language. An urge towards shorter message length facilitating faster typing and the need for semantic clarity, shape the structure of this non-standard form known as the texting language (Choudhury et. al., 2007).

Text Generated by ASR Devices: ASR is the process of converting a speech signal to a sequence of words. An ASR system takes speech signal such as monologs, discussions between people, telephonic conversations, etc. as input and produces a string a words, typically not demarcated by punctuations as transcripts. An ASR system consists of an acoustic model, a language model and a decoding algorithm. The acoustic model is trained on speech data and their corresponding manual transcripts. The language model is trained on a large monolingual corpus. ASR convert audio into text by searching the acoustic model and language model space using the decoding algorithm. Most conversations at contact centers today between agents and customers are recorded. To do any processing of this data to obtain customer intelligence it is necessary to convert the audio into text.

Text Generated by OCR Devices: Optical character recognition, or ‘ OCR’, is a technology that allows digital images of typed or handwritten text to be transferred into an editable text document. It takes the picture of text and translates the text into Unicode or ASCII. . For handwritten optical character recognition, the rate of recognition is 80% to 90% with clean handwriting.

Call Logs in Contact Centers: Today’s contact centers (also known as call centers, BPOs, KPOs) produce huge amounts of unstructured data in the form of call logs apart from emails, call transcriptions, SMS, chat transcripts etc. Agents are expected to summarize an interaction as soon as they are done with it and before picking up the next one. As the agents work under immense time pressure hence the summary logs are very poorly written and sometimes even difficult for human interpretation. Analysis of such call logs are important to identify problem areas, agent performance, evolving problems etc.

In this chapter we will be focussing on automatic classification of noisy text. Automatic text classification refers to segregating documents into different topics depending on content. For example, categorizing customer emails according to topics such as billing problem, address change, product enquiry etc. It has important applications in the field of email categorization, building and maintaining web directories e.g. DMoz, spam filter, automatic call and email routing in contact center, pornographic material filter and so on.

NOISY TEXT CATEGORIZATION

The text classification task is one of the learning models for a given set of classes and applying these models to new unseen documents for class assignment. This is an important component in many knowledge extraction tasks; real time sorting of email or files into folder hierarchies, topic identification to support topic-specific processing operations, structured search and/or browsing, or finding documents corresponding to long-term standing interests or more dynamic task-based interests. Two types of classifiers are generally commonly found viz. statistical classifiers and rule based classifiers.

In statistical techniques a model is typically trained on a corpus of labelled data and once trained the system can be used for automatic assignment of unseen data. A survey of text classification can be found in the work by Aas & Eikvil (Aas & Eikvil, 1999). Given a training document collection D={d1, d2,….., dM} with true classes {y1, y2, ….., yM} the task is to learn a model.

This model is used for categorizing a new unlabelled document du. Typically words appearing in the text are used as features. Other applications including search rely heavily on taking the markup or link structure of documents into account but classifiers only depend on the content of the documents or the collection of words present in the documents. Once features are extracted from documents, each document is converted into a document vector. Documents are represented in a vector space; each dimension of this space represents a single feature and the importance of that feature in that document gives the exact distance from the origin. The simplest representation of document vectors uses the binary event model, where if a feature j e Kappears in document di, then the jth component of di is 1 otherwise it is 0. One of the most popular statistical classification techniques is naive Bayes (McCallum, 1998). In the naive Bayes technique the probability of a document di belonging to class c is computed as:

The final approximation of the above equation refers to the naive part of such a model, i.e., the assumption of word independence which means the features are assumed to be conditionally independent, given the class variable.

Rule-based learning systems have been adopted in the document classification problem since it has considerable appeal. They perform well at finding simple axis-parallel frontiers. A typical rule-based classification scheme for a category, say C, has the form:

Assign category C ifantecedent or Do no assign category C ifantecedent or

The antecedent in the premise of a rule usually involves some kind of feature value comparison. A rule is said to cover a document or a document is said to satisfy a rule if all the feature value comparisons in the antecedent of the rule are true for the document. One of the well known works in the rule based text classification domain is RIPPER. Like a standard separate-and-conquer algorithm, it builds a rule set incrementally. When a rule is found, all documents covered by the rule are discarded including positive and negative documents. The rule is then added to the rule set. The remaining documents are used to build other rules in the next iteration.

In both statistical as well as rule based text classification techniques, the content of the text is the sole determiner of the category to be assigned. However noise in the text distorts the content and hence readers can expect the categorization performance to get affected by noise in the text. Classifiers are essentially trained to identify correlation between extracted features (words) with different categories which can be later utilized to categorize new documents. For example, words like exciting offer get a free laptop might have stronger correlation with category spam emails than non-spam emails. Noise in text distorts this feature space excitinng ofer get frree lap top will be new set of features and the categorizer will not be able to relate it to the spam emails category. The feature space explodes as the same feature can appear in different forms due to spelling errors, poor recognition, wrong transcription, etc. In the remaining part of this section we will give an overview how people have approached the problem of categorizing noisy text.

Categorization of OCRed Documents

Electronically recognized handwritten documents and documents generated from OCR process are typical examples of noisy text because of the errors introduced by the recognition process. Vinciarelli (Vinciarelli, 2004) has studied the characteristics of noise present in such data and its effects on categorization accuracy. A subset of documents from the Reuters-21578 text classification dataset were taken and noise was introduced using two methods: first a subset of documents were manually written and recognized using an offline handwriting recognition system. In the second the OCR based extraction process was simulated by randomly changing a certain percentage of characters. According to them for recall values up to 60-70 percent depending on the sources, the categorization system is robust to noise even when the Term Error Rate is higher than 40 percent. It was also observed that the results from the handwritten data appeared to be lower than those obtained from OCR simulations. Generic systems for text categorization based on statistical analysis of representative text corpora have been proposed (Bayer et. al., 1998). Features are extracted from training texts by selecting substrings from actual word forms and applying statistical information and general linguistic knowledge followed by dimensionality reduction by linear transformation. The actual categorization system is based on minimum least-squares approach. The system is evaluated on the tasks of categorizing abstracts of paper-based German technical reports and business letters concerning complaints. Approximately 80% classification accuracy is obtained and it is seen that the system is very robust against recognition or typing errors.

Issues with categorizing OCRed documents are also discussed by many other authors (Brooks & Teahan, 2007), (Hoch, 1994) and (Taghva et. al., 2001).

Categorization of ASRed Documents

Automatic Speech Recognition (ASR) is simply the process of converting an acoustic signal to a sequence of words. Researchers have proposed different techniques for speech recognition tasks based on Hidden Markov model (HMM), neural networks, Dynamic time warping (DTW) (Trentin & Gori, 2001). The performance of an ASR system is typically measured in terms of Word Error Rate (WER), which is derived from the Levenshtein distance, working at word level instead of character. WER can be computed as

where S is the number of substitutions, D is the number of the deletions, I is the number of the insertions, and N is the number of words in the reference. Bahl et.al. (Bahl et. al. 1995) have built an ASR system and demonstrated its capability on benchmark datasets.

ASR systems give rise to word substitutions, deletions and insertions, while OCR systems produce essentially word substitutions. Moreover, ASR systems are constrained by a lexicon and can give as output only words belonging to it, while OCR systems can work without a lexicon (this corresponds to the possibility of transcribing any character string) and can output sequences of symbols not necessarily corresponding to actual words. Such differences are expected to have strong influence on performance of systems designed for categorizing ASRed documents in comparison to categorization of OCRed documents. A lot of work on automatic call type classification for the purpose of categorizing calls (Tang et al., 2003), call routing (Kuo and Lee, 2003; Haffner et al., 2003), obtaining call log summaries (Douglas et al., 2005), agent assisting and monitoring (Mishne et al., 2005) has appeared in the past.Here calls are classified based on the transcription from an ASR system. One interesting work on seeing effect of ASR noise on text classification was done on a subset of benchmark text classification dataset Re-uters-2 1 5 782 (Agarwal et. al., 2007). They read out and automatically transcribed 200 documents and applied a text classifier trained on clean Reuters-21578 training corpus3. Surprisingly, in spite of high degree of noise, they did not observe much degradation in accuracy.

Effect of Spelling Errors on Categorization

Spelling errors are an integral part of written text—electronic as well as non-electronic. Every reader reading this topic must have been scolded by their teacher in school for spelling words wrongly! In this era of electronic text people have become less careful while writing resulting poorly written text containing abbreviations, short forms, acronyms, wrong spellings. Such electronic text documents including email, chat log, postings, SMSs are sometimes difficult to interpret even for human beings. It goes without saying that text analytics on such noisy data is a non trivial task.

Wrong spellings can affect automatic classification performance in multiple ways depending on the nature of the classification technique being used. In the case of statistical techniques, spelling differences distort the feature space. If training as well as the test data corpus are noisy, while learning the model the classifier will treat variants of the same words as different features. As a result the observed joint probability distribution will be different from the actual distribution. If the proportion of wrongly spelt words is high then the distortion can be significant and will hurt the accuracy of the resultant classifier. However, if the classifier is trained on a clean corpus and the test documents are noisy, then wrongly spelt words will be treated as unseen words and will not help in classification. In an unlikely situation a wrongly spelt word present in a test document may become a different valid feature and worse, may become a valid indicative feature of a different class. A standard technique in the text classification process is feature selection which happens after feature extraction and before training. Feature selection typically employs some statistical measures over the training corpus and ranks features in order of the amount of information (correlation) they have with respect to the class labels of the classification task at hand. After the feature set has been ranked, the top few features are retained (typically order of hundreds or a few thousand) and the others are discarded. Feature selection should be able to eliminate wrongly spelt words present in the training data provided (i) the proportion of wrongly spelt words is not very large and (ii) there is no regular pattern in spelling errors4. However it has been observed, even at high degree of spelling errors the classification accuracy does not suffer much (Agarwal et al., 2007).

Rule based classification techniques also get negatively affected by spelling errors. If the training data contains spelling errors then some of the rules may not get the required statistical significance. Due to spelling errors present in the test data a valid rule may not fire and worse, an invalid rule may fire leading to a wrong categorization. Suppose RIPPER has learnt a rule set like:

Assign category”sports” IF (the document contains {\it sports}) OR (the document contains {\it exercise} AND {\it outdoor}) OR

(the document contains {\it exercise} but not {\it homework} {\it exam}) OR

(the document contains {\it play} AND {\it rule}) OR

A hypothetical test document containing repeated occurrences of exercise, but each time wrongly spelt as exarcise, will not be categorized to the sports category and hence lead to misclassification.

CONCLUSION

In this chapter we have looked at noisy text analytics. This topic is gaining in importance as more and more noisy data gets generated and needs processing. In particular we have looked at techniques for correcting noisy text and for doing classification. We have presented a survey of existing techniques in the area and have shown that even though it is a difficult problem it is possible to address it with a combination of new and existing techniques.

KEY TERMS

Automatic Speech Recognition: Machine recognition and conversion of spoken words into text.

Data Mining: The application of analytical methods and tools to data for the purpose of identifying patterns, relationships or obtaining systems that perform useful tasks such as classification, prediction, estimation, or affinity grouping.

Information Extraction: Automatic extraction of structured knowledge from unstructured documents.

Noisy Text: Text with any kind of difference in the surface form, from the intended, correct or original text.

Optical Character Recognition: Translation of images of handwritten or typewritten text (usually captured by a scanner) into machine-editable text.

Rule Induction: Process of learning, from cases or instances, if-then rule relationships that consist of an antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent (then-part, stating a classification, prediction, or other expression of a property that holds for cases defined in the antecedent).

TextAnalytics: The process of extracting useful and structured knowledge from unstructured documents to find useful associations and insights.

Text Classification (or Text Categorization): Is the task of learning models for a given set of classes and applying these models to new unseen documents for class assignment.