Translation of Natural Language Patterns to Object and Process Modeling

INTRODUCTION

In order to shorten the engineering period of the complex information systems (ISs), the integration of the models is needed for uniformly: (1) traversing all phases of ISs’ lifecycle (analysis, design, coding, testing), and (2) representing object, functional, process and organizational models on the business domain.
The seamless integration of the four models has not yet been a satisfactory solution from the conceptual, notational, semantic and logical viewpoints. The existing tools direct the designers to object-oriented modeling, possibly combined with “use case” and “state transition” diagrams. But, the functional, process and organizational models are not well and completely integrated with the object models. Moreover, using the agent-based technology for IS implementation, the knowledge modeling and integration will be necessary.
The integration solutions proposed in important methodologies like UML (Unified Modeling Language) (OMG, 2003) and IDEF (KBSI, 2000) (for object, functional and process models) or Workflow Reference Model (WfMC, 2003) (for organizational, process and functional models) mainly merge the models. Their conceptual integration is devolved to the developers of CASE tools or to the human designers and is accomplished during the coding phase. The existing methodologies, usually relying on the symbolic notation, do not provide the seamless and explicit (outside the code) integration of the object and activity like symbols, from the semantic and logical viewpoints.
The integration abilities of natural language (NL) appear from the observation that people describe in NL any kind of information about objects, processes, information flows, the organization of their life and work, their knowledge, belief, intention, rationale and so forth. The universality and syntactic stability of a linguistic model is supposed to facilitate the communication among distributed ISs and users.
In computational linguistics, the main objective is to solve NL ambiguities (when more than one meaning is possible in a sentence) and to correctly identify the syntactic categories. Some NL analyzers build models for knowledge representation (logical models, semantic networks, frames, conceptual dependencies, conceptual graphs, etc.).
Instead, the discourse in the conceptual models (CMs) is a priori considered unambiguous and is expected to have a sound theoretical background. The objectives of NL analysis for model integration should be: (1) a linguistic theoretical foundation to the modeling interface, and (2) a uniform translation of NL to all types of CMs, without information loss. This article will focus on the second objective.

BACKGROUND

This section gives technical reasons for the research on natural language (NL) translation to conceptual models (CMs) (basically for model integration purpose) and a review of the state of the art and of the basic concepts and problems in this domain.
The complexity of information system (IS) representation comes from the complexity and diversity of the concepts it should integrate, defined in object, functional, process and organizational models. The integration of these models should be accomplished during the IS analysis or design phase.
These models result from the abstraction of the IS requirements, expressed by the analysts in NL. A comparison between the NL model (Allen, 1995; Sag, 1999), and the existing CMs reveals the conceptual relationships between NL and conceptual modeling. The most important relationships are:
• Between categories in CMs and NL. For example, between objects and nouns, activities and verbs, object attributes and adjectives, activity attributes and adverbs. Also, the noun and verb determiners/ modifiers/substitutes in NL have counterparts in CMs.
• Between semantic relationships in CMs and NL. For example, between object aggregation/fragmentation and noun meronymy/holonymy; between object specialization/generalization and noun hyponymy/hypernymy; between functional composition/decomposition of processes and verb meronymy/ holonymy, and so forth.
• Between syntactic roles in NL (subject, predicate, direct/indirect/prepositional object, complement, adverbial modifier) and primitives in CMs. For example, the predicate is represented by an activity/ event; the subject in active voice is the object-like sender of the message or initiator of an activity; the direct, indirect, prepositional objects are object-like parameters in activity execution; and so forth. • Between structures in NL and CMs. For example, the simple sentence in NL is represented by the activity signature (list of object-like parameters that participate in the activity execution); the complex sentences are represented by (sub)process diagrams, and so forth.
Lately, the co-references between concepts in two or several sentences and the ellipses in NL are also represented in CMs.
This similarity made the researchers think that conceptual modeling could be as powerful as NL for representing the reality. But, the translation of NL patterns into a CM that integrates the four models is still an open problem.
NL-CM translation has been tried in several research domains: linguistic interpretation of the models (mainly, entity-relationship and object-oriented models), semantic integration of the conceptual schemas, modeling the systems’ dynamics, human-computer interaction, requirements engineering, organization modeling, knowledge representation, formal ontologies and their application to the search on Web, business communication modeling based on speech act theory, and so forth.
The most important results have been obtained for the translation of NL to object and event models. Among the NL-oriented representations of these models, the most important are the functional grammar (FG) and the semantic networks (especially, conceptual dependencies and conceptual graphs). They propose the representation of the object models by syntactic categories and rules.
Functional grammar represents the functional aspects in NL, by the description and classification of predicate frames. FG has been further used for defining CPL (conceptual prototyping language), which focuses on NL simple sentences (the intersentential relations are not explicitly revealed). Also, CPL does not approach general semantic relationships inside the lexical categories (e.g., noun/verb synonymy, antonymy, homonymy, etc). COLOR-X (Riet, 1998) can be considered the most important application of CPL and practical result for CMs’ integration. It integrates static object and event models of information and communication systems, abstracted from their textual descriptions. It relies on OMT (object modeling techniques). Like in any object-oriented model, the processes merely trace events that compose scenarios, similar to use case diagrams. Lately, the textual requirements are transformed into UML-schemata, for example (Fliedl, 2000).
The conceptual dependencies and, lately, the conceptual graphs (CG) (Sowa, 2000) are other linguistic representations for CMs. The syntactic categories are suggested by their roles to each other (meaning relationships between nouns and the verb that governs them in a simple sentence, e.g., agent, patient, instrument, recipient, location, time, source, destination, etc). A similar representation is the frame description in FrameNet (Filmore, 2002). But, all these representations are data-centric (ISs’ dynamic behaviour is not important as a modeling goal).
The translation of NL to organizational models (workflows) has been obtained mainly with respect to the modeling of the business communication (e.g., Steuten, 2000). The theory of communicative actions and, lately, the speech act theory (Johannesson, 2001) are the main linguistic representations of the communication aspects in ISs.
The globalization of the organizations has a great impact on IS representation, especially with respect to the common vocabularies and the interoperability between distributed and heterogeneous applications. In this context, the conceptual modeling must step into a new era and intersect a new field: formal ontologies. A first benefit from ontologies for IS representation is that they describe, categorize and constrain concepts and relationships at the development time (Guarino, 2000). Using CMs, the constraints are basically imposed at run time. Another benefit is that the ontology specification is outside the code, while many object-oriented modeling specifications (especially constraints) are implemented inside the code.
Unfortunately, for the conceptual integration and for the representation of the ISs’ dynamics, the existing ontologies have the same limits as the CMs. Most of them are object-oriented, relying on OKBC (open knowledge base connectivity) specification. For building business process ontologies, PSL (process specification language) (initiated by NIST) is recommended (e.g., Gruninger, 2003), mainly because it can be logically integrated with KIF (knowledge interchange format), appropriate for object and knowledge description and exchange.
For ontology integration, two alternatives can be considered: (1) by an upper-level ontology, able to represent all aspects in the real world, or (2) by a translation and correlation algorithm between the concepts and rules in different ontologies. Such an algorithm is mostly encoded and the ontology integration is recommended to be accomplished at the development time. So, the first alternative appears as a better solution. For the conceptual integration of all aspects in the real world (and, implicitly,in ISs), upper-level linguistic ontologies have been proposed (see next section), with benefits for ontology integration as well.

MAIN THRUST OF THE ARTICLE

This section first gives the limits in the existing research on natural language (NL) translation to conceptual models (CMs) and, lately, to ontologies. Then, the representation of the linguistic meta-models or upper-level ontologies is briefly analyzed from the syntactic and semantic viewpoints.
Most important limits of the existing translations of NL patterns to object and process modeling and the basic open problems are:
• NL translation only deals with the object-oriented aspects of the real world (RW). The functional, process and organizational aspects are not retrieved or are not well integrated in the translation results. Also, the correlations between the concepts resulted from translation usually confine to the inter-object relationships. The inter-activity ones and the system’s dynamic behaviour are not properly treated.
• NL translation of different aspects in RW is accomplished with different software tools and results in different NL-oriented representations (languages). For example, the object-oriented aspects are translated into functional grammar structures or semantic networks and the communication aspects are translated into communicative acts (proposed in the speech act theory). The resulted Iss’ representation is still not integrated from the conceptual, nota-tional, semantic and logical viewpoints. The only benefits from the translation are: the automated abstraction of RW from textual requirements; and, a user interface closer to human reasoning and understanding, because it is closer to NL.
• Impossible use of the concepts resulted from NL translation for the expression of coherent and unambiguous ideas on any aspect in RW and IS, mainly because these concepts are heterogeneously represented (see previous). The brainstorming is going to become important in business management (several arguments are in Galatescu & Greceanu, 2002). Today, its automation for a virtual organization confines to the exchange of ideas in NL, by the electronic mail or chat, substantially increasing the virtual traffic. The communication by structures of ideas or modeling structures, issued during brain-storming sessions, and the automated comparison of these structures, must be other objectives of NL-CMs translation.
Other reasons that make difficult the coherent expression of the modeling ideas using NL translation results are: (a) the intersentential relations in complex sentences and many semantic relations in NL are usually not considered; (b) the resulted concepts representing objects and processes are not stable and general enough. Also, many of them are not known at the development time.
• The logical consistency of the resulted IS representation from NL translation is hardly and incompletely verified at the development time, mainly because of the varied representations and implicitly, formalizations of the different aspects in ISs. This consistency is a pre-condition for the further automatic reasoning, simultaneously on objects, processes, workflows, and so forth.
The logical consistency of the results from both NL-object model and NL-process/workflow model translation is, generally, proved using the first or higher-order logic or their variants (especially sorted, prepositional, modal, temporal logics). The logical consistency of the resulted process models is incompletely verified at the development time, because the logic of the procedural aspects cannot be represented with these formalisms.
• The meaning of the objects/activities and of the relationships between them in the resulted models from NL translation is, usually, incomplete and is not represented outside the code. This limit impacts on the epistemological aspects of ISs that, lately, are going to be conceived as multi-agent systems. NL translation is mainly concerned with NL syntax and only partially with its semantics.
The linguistic meta-models or the upper-level linguistic ontologies appear lately a solution to the model (or ontology) integration. Two research directions are important for their representation:
• The abstraction of NL semantics, using a predefined taxonomy of universal types of objects, processes, activities and so forth (and relationships among them). This taxonomy is supposed to allow the subsumption (from the semantic point of view) of the words found in NL expressions, belonging to any category (noun, verb, adjective, adverb).
• The abstraction of NL syntax, using predefined rules for building sentence-like structures, that stylize the NL (simple, compound or complex) sentences and comply with NL syntax. These structures are supposed to be used for the unambiguous description of any type of object, process, activity and so forth, in the modeled reality, as well as for the representation of coherent and unambiguous ideas about them.
These two directions are complementary and should be both considered in the definition of a linguistic meta-model or ontology.
From the semantic viewpoint, there are several proposals for taxonomies of the concepts in RW. They are compared in Bateman (2003). As linguistically oriented taxonomies, the most important are proposed in Oltramari (2002) and Sowa (2000).
From the syntactic viewpoint, the limits (emphasized in the previous section) of the functional grammar, conceptual dependencies and conceptual graphs, used today for NL translation to CMs, impose an improvement with respect to the model integration. Based on activity-oriented conceptual graphs, a new linguistic representation is proposed in Galatescu (2001, 2002), mainly for the model integration purpose. The same representation has been used as an upper-level linguistic ontology for the integration of three ontologies (re-engineering, domain and communication ontologies) and for the ideas expression (Galatescu & Greceanu, 2002).

FUTURE TRENDS

Related to the limits and existing solutions sketched in the previous sections, the expected results from the future research on natural language (NL) translation to object and process modeling should be:
• the translation from textual requirements to a uniform representation of all aspects in information systems (ISs), both externally (during the analysis and design phases) and internally (during the coding phase);
• the unification of all existing proposals for NL abstraction from the semantic point of view and the standardization of a unique universal taxonomy for all concepts in the real world and NL;
• the automatic subsumption of the words found in the textual requirements (and belonging to any syntactic category) inside a universal taxonomy of the concepts in the real world;
• ontology-driven modeling (i.e., based on concepts in a minimal upper-level ontology) (Guarino, 2000), expected to help for the globalization of the modeling activities and for the modeling completeness at the IS analysis or design time;
• the integration of the object, functional, process, organizational and epistemological views on organizations by means of linguistic meta-models or upper-level ontologies;
• the linguistic representation of the epistemologi-cal and communication aspects in IS multi-agent architectures. This representation should be integrated with the linguistic models for objects, processes, workflows and so forth; and
• the multi-lingual representation of IS. In this case, a problem in the abstraction of NL syntax will arise and must be solved, due to the differences in the syntactic rules of various languages.
The expected results enumerated previously will have an important impact on the IS lifecycle and on the developers’ performance.
With respect to the future technological means and user interface for the linguistic conceptual modeling, new aspects should be considered:
• the interfaces of the CASE tools should become natural, close to human understanding. This means a shift from the exclusive symbolic notation to a notation that abstracts NL syntax and semantics; and
• the natural language processing (NLP) technology should be more deeply involved in the resolution of the problems in the conceptual modeling, as a basic translation technology; and
• the future CASE tools should involve practical and theoretical results from other domains, at least from NLP and linguistic ontologies.

CONCLUSION

This article has motivated the research on natural language translation to conceptual models and has reviewed the main limits of the existing models, the open problems and the expected theoretical and practical results (seen as future research trends).
The presentation has focused on the need for the conceptual integration of the existing models and for their representation outside the code. In this respect, as a requirement imposed by the globalization ofthe organizations, a recent trend was mentioned: the use of linguistic ontologies, able to represent all aspects in the real world (in particular, in any type of organization).
The limits and expectations enumerated for both domains (linguistic modeling and linguistic ontologies) lead to the conclusion that these domains are in an incipient stage, far from results that can be standardized and used in commercial products. The convergence and the integration of research activities and results in several fields (at least, conceptual modeling, NLP, and ontologies) appear necessary for obtaining, as soon as possible, the expected results.

KEY TERMS

Business Process: The sequence of activities, the people and the technology involved in carrying out some business or achieving some desired results in an organization.
Category in NL: Noun, verb, adjective, adverb. Nouns are described by noun head, substitute, determiner, and modifier. Verbs are described by verb determiner/ modifier.
Conceptual Model: Abstraction of the real world/ domain and a mechanism for understanding and representing organizations and the information systems that support them. The most important types of models are:
• object model: describes objects by data and operations on the data. The object’s identity encapsulates its state (attributes and relationships with other objects) and its behaviour (allowed operations on/with that object).
• process model: describes (sub)processes by the activities they involve, the activity order, decision points, pre-/post-conditions for the activity execution;
• functional model: describes the information flow and transformation, as well as the constraints and functional dependencies among the activities in a process;
• organizational model: describes the workflow (activities for the creation and movement of the documents) within an organization, the people’s roles and the communication among people for performing the activities.
Object: An entity (e.g., “person”) or a value (e.g., “phone number”). In object-oriented models, it is the instance of a class.
Ontology: Hierarchical structuring of knowledge about things by subcategorizing them according to their essential (or at least relevant and/or cognitive) qualities. Practically, a vocabulary used to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the concepts in the vocabulary. The relations between concepts allow inferences (e.g., information interpretation and the derivation of new information/knowledge). The explicit axioms allow the approximation of the term meaning and the validation of the ontology specification at the development time.
Semantic Relationship in NL: Relation between the words in the same category (mainly, noun and verb category). Some relationships are bi-directional. Most important relationships are: noun or verb meronymy/ holonymy, hyponymy/ hypernymy, synonymy, antonymy, homonymy; verb troponymy or entailment; cause-effect relations between verbs (see the glossary of terms for WordNet, at http://www.cogsci.princeton.edu/~wn/man/ wngloss.7WN.html).
Sentences in NL: Simple sentence: built around the static or dynamic action of the verb; compound sentence: joins independent simple sentences by conjunctions; complex sentence: composed of subordinated sentences (subclauses) correlated to a main sentence (clause).
Structure in NL: Sentence (simple, compound, complex); phrase (e.g., verb, noun phrase, adverb, prepositional, adjectival phrase); clause (noun, adverbial, relative, appositive).