Natural Language Processing and Biological Methods (Artificial Intelligence)

INTRODUCTION

During the 20th century, biology—especially molecular biology—has become a pilot science, so that many disciplines have formulated their theories under models taken from biology. Computer science has become almost a bio-inspired field thanks to the great development of natural computing and DNA computing.

From linguistics, interactions with biology have not been frequent during the 20th century. Nevertheless, because of the “linguistic” consideration of the genetic code, molecular biology has taken several models from formal language theory in order to explain the structure and working of DNA. Such attempts have been focused in the design of grammar-based approaches to define a combinatorics in protein and DNA sequences (Searls, 1993). Also linguistics of natural language has made some contributions in this field by means of Collado (1989), who applied generativist approaches to the analysis of the genetic code.

On the other hand, and only from theoretical interest a strictly, several attempts of establishing structural parallelisms between DNA sequences and verbal language have been performed (Jakobson, 1973, Marcus, 1998, Ji, 2002). However, there is a lack of theory on the attempt of explaining the structure of human language from the results of the semiosis of the genetic code. And this is probably the only arrow that remains incomplete in order to close the path between computer science, molecular biology, biosemiotics and linguistics.

Natural Language Processing (NLP) -a subfield of Artificial Intelligence that concerns the automated generation and understanding of natural languages— can take great advantage of the structural and “semantic” similarities between those codes. Specifically, taking the systemic code units and methods of combination of the genetic code, the methods of such entity can be translated to the study of natural language. Therefore, NLP could become another “bio-inspired” science, by means of theoretical computer science, that provides the theoretical tools and formalizations which are necessary for approaching such exchange of methodology.

In this way, we obtain a theoretical framework where biology, NLP and computer science exchange methods and interact, thanks to the semiotic parallelism between the genetic code and natural language.

BACKGROUND

Most current natural language approaches show several facts that somehow invite to the search of new formalisms to account in a simpler and more natural way for natural languages. Two main facts lead us to look for a more natural computational system to give a formal account of natural languages: a) natural language sentences cannot be placed in any of the families of the Chomsky hierarchy (Chomsky, 1956) in which current computational models are basically based, and b) rewriting methods used in a large number of natural language approaches seem to be not very adequate, from a cognitive perspective, to account for the processing of language.

Now, if to these we add (1) that languages that have been generated following a molecular computational model are placed in-between Context-Sensitive and Context-Free families; (2) that genetic model offers simpler alternatives to the rewriting rules; (3) and that genetics is a natural informational system as natural language is, we have the ideal scene to propose biological models in NLP.

The idea of using biological methods in the description and processing of natural languages is backed up by a long tradition of interchanging methods in biology and natural/formal language theory:

1. Results and methods in the field of formal language theory have been applied to biology:

(1) Pawlak (1965) dependency grammars as an approach in the study of protein formation; (2) transformational grammars for modeling gene regulations (Collado, 1989); (3) stochastic context-free grammars for modeling RNA (Sakaki-bara et al., 1994); (4) definite clause grammars and cut grammars to investigate gene structure and mutations and rearrangement in it (Searls, 1989); (5) tree-adj oining grammars for predicting RNA structure of biological data (Uemura et al., 1999).

2. Natural languages as models for biology: (1) Watson (1968) understanding of heredity as a form of communication; (2) Asimov (1968) idea that nucleotide bases are letters and they form an alphabet; (3) Jacob (1970) consideration that the sense of the genetic message is given by the combination of its signs in words and by the arrangement of words in phrases; (4) Jakobson (1970) ideas about taking the nucleotide bases as phonemes of the genetic code or about the binary oppositions in phonemes and in the nucleic code.

3. Biological ideas in linguistics: (1) the “tree model” proposed by Schleicher (1863); (2) the “wave model” due to Schmidt (1872); (3) the “geometric network model” proposed by Forster (1997); or (3) the naturalistic metaphor in Linguistics defended by Jakobson (1970, 1973).

4. Using DNA as a support for computation is the basic idea of Molecular Computing (Paun et al., 1998). Speculations about this possibility can be found in Feynman (1961), Bennett (1973) and Conrad (1995).

BIOLOGICAL METHODS IN NLP

Here, we present an overview of different bio-inspired methods that during the last years have been successfully applied to several NLP issues, from syntax to pragmatics. Those methods are taken mainly from computer science and are basically the following: DNA computing, membrane computing and networks of evolutionary processors.

DNA Computing

One of the most developed lines of research in natural computing is the named molecular computing, a model based on molecular biology, which arose mainly after Adleman (1994).An active area in molecular computing is DNA computing (Paun et al., 1998) inspired in the way that DNA perform operations to generate, replicate or change the configuration of the strings.

Application of molecular computing methods to natural language syntax gives rise to molecular syntax (Bel-Enguix & Jimenez-Lopez, 2005a). Molecular syntax takes as a model two types of mechanisms used in biology in order to modify or generate DNA sequences: mutations and splicing. Mutations refer to changes performed in a linguistic string, being this a phrase, sentence or text. Splicing is a process carried out involving two or more linguistic sequences. It is a good framework for approaching syntax, both from the sentential or dialogical perspective.

Methods used by molecular syntax are based on basic genetic processes: cut, paste, delete and move. Combining these elementary rules most ofthe complex structures of natural language can be obtained, with a high degree of simplicity.

This approach is a test of the generative power of splicing for syntax. It seems, according to the results achieved, that splicing is quite powerful for generating, in a very simple way, most of the patterns of the traditional syntax. Moreover, the new perspectives and results it provides, could mean a transformation in the general perspective of syntax.

From here, we think that bio-NLP, applied in a methodological and clear way, is a powerful and simple model that can be very useful to a) formulate some systems capable of generating the larger part of structures of language, and b) define a formalization that can be implemented and may be able to describe and predict the behavior of natural language structures.

membrane Computing

Membrane Systems (MS) (Paun, 2000) are models of computation inspired by some basic features of biological membranes. They can be viewed as a new paradigm in the field of natural computing based on the functioning of membranes inside the cell. MS can be used as generative, computing or decidability devices. This new computing model has several intrinsically interesting features such as, for example, the use of multisets and the inherent parallelism in its evolution and the possibility of devising computations which can solve exponential problems in polynomial time. This framework provides a powerful tool for formalizing any kind of interaction, both among agents and among agents and environment. One of key ideas of MS is that generation is made by evolution. Therefore, most of evolving systems can be formalized by means of membrane systems.

Linguistic Membrane Systems (LMS) (Bel-Enguix & Jimenez-Lopez, 2005b) aim to model linguistic processes, taking advantage of the flexibility of MS and their suitability for dealing with some fields where contexts are a central part of the theory. LMS can be easily adapted to deal with different aspects of the description and processing of natural languages. The most developed applications of LMS are semantics and dialogue.

MS are a good framework for developing a semantic theory because they are evolving systems by definition, in the same sense that we take meaning to be a dynamic entity. Moreover, MS provide a model in which contexts, either isolated or interacting, are an important element which is already formalized and can give us the theoretical tools we need. Semantic membranes may be seen as an integrative approach to semantics coming from formal languages, biology and linguistics. Taking into account results obtained in the field of computer science as well as the naturalness and simplicity of the formalism, it seems the formalization of contexts by means of membranes is a promising area of research for the future. Examples of application of MS to semantics can be found in Bel-Enguix and Jimenez-Lopez (2007).

A topic where context and interaction among agents is essential is the field of dialogue modeling and its applications to the design of effective and user-friendly computer dialogue systems. Taking into account a pragmatic perspective of dialogue and based on speech acts, multi-agent theory and dialogue games, Dialogue Membrane Systems have arisen, as an attempt to compute speech acts by means of MS. Considering membranes as agents, and domains as a personal background and linguistic competence, the application to dialogue is almost natural, and simple from the formal point of view. For examples of this application see Bel-Enguix and Jimenez-Lopez (2006b).

NepS-Networks of evolutionary processors

Networks of Evolutionary Processors (NEPs) are a new computing mechanism directly inspired in the behavior of cell populations. Every cell is described by a set of words (DNA) evolving by mutations, which are represented by operations on these words. At the end of the process, only the cells with correct strings will survive. In spite of the biological inspiration, the architecture of the system is directly related to the Connection Machine (Hillis, 1985) and the Logic Flow paradigm (Errico et al. 1994). Moreover, the global framework for the development of NEPs has to be completed with the biological background of DNA computing (Paun et al., 1998), membrane computing (Paun, 2000) and, specially, with grammar systems (Csuhaj-Varju et. al., 1994), which share with NEPs the idea of several devices working together and exchanging results.

First precedents of NEPs as generating devices can be found in Csuhaj-Varju & Salomaa (1997) and Csuhaj-Varju & Mitrana (2000). The topic was introduced in Castellanos et al. (2003) and Martin-Vide et al. (2003), and further developed in Castellanos et al. (2005), Csuhaj-Varju et al. (2005).

With this background and theoretical connections, it is easy to understand how NEPs can be described as agential bio-inspired context-sensitive systems. Many disciplines are needed of these types of models that are able to support a biological framework in a collaborative environment. The conjunction of these features allows applying the system to a number of areas, beyond generation and recognition in formal language theory. NLP is one of the fields with a lack of biological models and with a clear suitability for agential approaches.

NEPs have significant intrinsic multi-agent capabilities together with the environmental adaptability that is typical of bio-inspired models. Some of the characteristics of NEPs architecture are the following: Modularization, contextualization and redefinition of agent capabilities, synchronization, evolvability and learnability.

Inside of the construct, every agent is autonomous, specialized, context-interactive and learning-capable.

In what refers to the functioning of NEPs, two main features deserve to be highlighted: emergence and parallelism.

Because of those features, NEPs seems to be a suitable model for tackling natural languages. One of the main problems of natural language is that it is generated in the brain, and there is a lack of knowledge of the mental processes the mind undergoes to bring about a sentence. While expecting new advances in neuro-science, we have to use models that seem to fit better to NLP. Modularity has shown to be an important idea in a wide range of fields: cognitive science, computer science and, of course, NLP. NEPs provide a suitable theoretical framework for formalization of modularity in NLP.

Another chief problem for the formalization and processing of natural language is its changing nature. Not only words, but also rules, meaning and phonemes can take different shapes during the process of computation. Formal models based in mathematical language have a lack of flexibility to describe natural language. Biological models seem to be better to this task, since biological entities share with languages the concept of “evolution”. From this perspective, NEPs offer enough flexibility to model any change at any moment in any part of the system. Besides, as a bio-inspired method of computation, they have the capability of simulating natural evolution in a highly pertinent and specialized way.

Some linguistic disciplines, as pragmatics or semantics, are context-driven areas, where the same utterance has different meanings in different contexts. To model such variation, a system with a good definition of environment is needed. NEPs offer some kind of solution to approach formal semantics and formal pragmatics from a natural computing perspective.

Finally, the multimodal approach to communication, where not just production, but also gestures, vision and supra-segmental features of sounds have to be tackled, refers to a parallel way of processing. NEPs allow modules to work in parallel. The autonomy of every one of the processors and the possible miscoordina-tion between them can also give account of several problems of speech.

Examples ofNEPs applications to NLP can be found in Bel-Enguix and Jimenez-Lopez (2005c, 2006a).

FUTURE TRENDS

Three general formalisms for dealing with NLP by means of biological methods have been introduced, focusing on the formal definition of several frameworks that adapt models coming from the area of bio-inspired computation to NLP needs. The main trends for the future focus on the implementation of these models in order to test their computational advantages over classical models of NLP without biological inspiration.

CONCLUSION

The coincidences between several structures of language and biology allow us, in the field of NLP, to take advantage of the bio-inspired models formalized by theoretical computer science. Moreover, the multi-agent capabilities of some of these models make them a suitable tool for simulating the processes of generation and recognition in natural language.

Biological methods coming from computer science can be very useful in the field of natural language, since they provide simple, flexible and intuitive tools for describing natural languages and making easier their implementation in NLP systems.

This research provides an integrative path for biology, computer science and NLP – three branches of human knowledge that have to be together in the development of new systems of communication for future global society.

KEY TERMS

Grammar Systems Theory: A consolidated and active branch in the field of formal languages that provides syntactic models for describing multi-agent systems at the symbolic level using tools from formal languages and grammars.

Membrane Systems: In a membrane system multisets of objects are placed in the compartments defined by the membrane structure that delimits the system from its environment. Each membrane identifies a region, the space between it and all directly inner membranes. Objects evolve by means of reaction rules associated with compartments, and applied in a maximally parallel, nondeterministic manner. Objects can pass through membranes, membranes can change their permeability, dissolve and divide.

Multi-Agent System: A system composed of a set of computational agents that perform local problem solving and cooperatively interact to solve a single problem (or reach a goal) difficult to be solve (achieved) by an individual agent.

Mutations: Several types of transformations in a single string.

Natural Computing: Research field that deals with computational techniques inspired by nature and natural systems. This type of computing includes evolutionary algorithms, neural networks, molecular computing and quantum computing.

Neural Network: Interconnected group of artificial neurons that uses a mathematical or a computational model for information processing based on a connec-tionist approach to computation. It involves a network of simple processing elements that can exhibit complex global behaviour.

Splicing: Operation which consists of splitting up two strings in an arbitrary way and sticking the left side of the first one to the right side of the second one (direct splicing), and the left side of the second one to the right side of the first one (inverse splicing).