Information Technology Reference
In-Depth Information
(RJ) Process and the Candidate Features Finding (CFF) Process. In the preprocessing
step, we use Stemming [22] and TreeTagger [23] to process the input documents, and
then it splits the documents into sentences. The Relationship Judgment step is used to
extract some sentences which are more distinguished. Transitions, related, positive
and negative words have stronger influence to judge the contradictions in patent doc-
uments. A sentence relates to at least one important word set is defined as strong sen-
tence. We design an important algorithm named Verb Including Split and Associate
Termsets (VISAT) which is included in the CFF Process to generate more meaningful
termsets and to find candidate features from documents. We will give the detail of the
CFF process and the VISAT algorithm in section 3.2.
All of the other blocks are mainly used to classify testing patent documents contra-
diction based on Engineering Parameters. As shown in Fig. 1, the Most Similar Doc-
ument Extraction is the first layer of classification which extract the most similar
training document. If there is such a training document which can be extracted, the
classes belonged to the training document are assigned to the testing document.
The Termset-based Classification is the second layer of classification which is a
rule-based classifier and tries to find out whether there are some training termset rules
can match the termsets in the testing document. If there are some termset rules suc-
cessfully match to the termsets in the testing document, the class labels in these rules
are assigned to testing document.
The Weaker Pattern Based Classification is the third layer of classification is also a
rule-based classifier. This classification is very similar to the second layer classifica-
tion, but it only judges whether these patent documents belong to some very frequent
classes by the sequential-termset rules and the one-word-termset rules.
After running through all above processes, possible conflicting Engineer Parame-
ters are found out. The final process of MCIVC named Contradiction Judgment is
performed to classify the type of technical Contradiction of testing patents.
This type of dataset has some challenging properties. The amount of data is very
limited, the distribution is imbalanced, and the data are partially labeled or incom-
plete. These properties cause that the most common used method Bag-of-word cannot
extract features discriminative enough, and some classification methods such as the
SVM are not directly suitable for these datasets. Therefore we propose the VISAT
algorithm to find more meaningful termset features and combine the VISAT with the
knowledge base and the rule-based classifiers which consider the semantic relation-
ship among terms to classify patents contradiction based on Engineering Parameters.
3.2
Candidate Features Finding Process (CFF Process) and the VISAT
Algorithm
The process named Candidate Features Finding process (CFF process) is used for
finding out candidate features. It generates two types of features, the TFIDF type
vectors of set of sentences and the candidate termsets of set of each sentence. As
shown in Fig. 2, the inputs of CFF process include strong sentences and all sentences
included in training and in testing documents.
Search WWH ::




Custom Search