and exclusively select those tuples that satisfy a condition C. The notation for
theta-joins of relations R and S based on condition C is R C .WeusetheSQL
keyword ON to keep this condition C separated from the other WHERE condi-
tions since it reflects a database requirement and shouldn't match to anything
of the NL question. (e.g. city JOIN state ON name = ).
The complexity of generated queries is fairly high indeed, since we can deal
with questions that require nesting, aggregation and negation in addition to basic
projection, selection and joining (e.g. “ How many states have major non-capital
cities excluding Texas ”).
2.3 Problem Definition
The question answering task of finding an SQL query that retrieves an answer
for a given NL question reduces to the following problem.
Given a question q represented by means of one typed dependency collapsed
list SDC q , generate the three sets of clauses
S, F, W
(argument of SELECT,
FROM and WHERE, respectively) such that:
∃s ∈S, ∃f ∈F, ∃w ∈W
s.t. π s ( σ w ( f )) answers q
The query answer π s ( σ w ( f )) is chosen among the set of all possible queries
in a way that maximizes the proba-
bility of generating a result set answering question q .
3 Building Clauses Sets
In order to generate all possible queries for a question q we need to find their pos-
sible SELECT, FROM and WHERE clauses (
S, F
). We start from a de-
pendency list SDC q and (a) prune and stem its components, (b) add synonyms,
(c) create the set of stems used to build S and W and (d) keep only dependencies
possibly used in the recursive step to generate nested queries. Building the set
F from S and W is straightforward.
We are now going to briefly discuss some examples to introduce the objec-
tive of individual steps and clarify how the entire process is carried out. The
first question we take into account is the simplest one: “ What is the capital
of Texas? ”. Its answer can be retrieved executing the query: SELECT capital
FROM state WHERE state.state name='Texas' . We can see that they share only
two stems, capital and Texas . The key of categorizing stems (Section 3.2) is to
recognize that the first stem will be used in
particular, since the word Texas is not a value in the IS , it is used as a r-value
in the WHERE expression, while the l-value is derived from the column name
under where it appears (Section 3.4).
The fact of being respectively projection and selection oriented can be in-
ferred looking at their grammar relations, i.e. inspecting the dependency list
(e.g. root of the sentence together subject dependent are typically used for pro-
jections). This list needs to be preprocessed (section 3.1) to take into account
and the second one in
