Text to SQL, Natural Language Processing (NLP)

Overview

Text to SQL is a system that converts English statements to SQL queries. This can help in retrieving information stored in a database by expressing commands in natural language (e.g. English).

Prior Work and Challenges

Past attempts can be placed into three broad categories, based on the type of approach involved:

Symbolic Approach (Rule-Based Approach): In systems developed with a symbolic approach, entities are picked from a list of query tokens and are translated based on general grammar rules and self-defined sketches on a case-by-case basis.
Empirical Approach (Corpus-Based Approach): Some systems have been developed with an empirical approach in mind: queries are translated based on statistical analyses of large collections of texts (i.e. a corpus).
Connectionist Approach (Using Neural Networks): In this approach, Sequence-to-Sequence models are used to translate one language to another – in this case, English to SQL. The model is trained with an extensive collection of English commands and the corresponding SQL queries (such as WikiSQL) and finally used to predict a SQL query from any given English query and the corresponding database schema.

Limitations of Prior Work

The limitations of the Symbolic and Empirical approaches include varied processing times, the limited scope of queries, and high memory usage.
The limitation of the Connectionist approach is reduced accuracy for complex databases.

Our Approach

Considering the complex nature of both the target databases and the English-language queries, we developed a hybrid of the existing approaches.
This hybrid approach had the following components:

A rule-based system that used the corpus and WordNet (a popular lexical database) to generate the structure of the SQL query.
Word Embeddings, which were trained using neural networks to generate measures and dimensions.

Limitations

Even with the Hybrid approach, there are some limitations:

Highly contextual or vague queries like ‘last 5 years’ and ‘top 10’ are not being accurately processed.
Queries like ‘average spend’ lead to ambiguity (i.e. ‘average of overall spend’ / ‘sum of average spend’ / ‘average of average spend’).
Having too many similar columns lead to erroneous results, due to the minimal difference in the distance between the concerned words in the Word2Vec model.

Improvements

Maybe we can use the latest developments in the NLP space, such as Transformer-based architectures, to solve this problem. But these are very computationally expensive.

References

Nihalani, Neelu. “Natural Language Interface for Database: A Brief Review.” International Journal of Computer Science Issues, 2011, www.ijcsi.org/papers/IJCSI-8-2-600-608.pdf.
Salesforce. “Salesforce/WikiSQL.” GitHub, 23 July 2018, github.com/salesforce/WikiSQL.
Xu, Xiaojun. “SQLNet: GENERATING STRUCTURED QUERIES FROM NATURAL LANGUAGE WITHOUT REINFORCEMENT LEARNING.” arXiv, https://arxiv.org/pdf/1711.04436.pdf
Tao, Yu. “TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation.” arXiv, https://arxiv.org/pdf/1804.09769.pdf
Victor, Zhong. “SEQ2SQL: GENERATING STRUCTURED QUERIES FROM NATURAL LANGUAGE USING REINFORCEMENT LEARNING.” SalesForce Research, https://arxiv.org/pdf/1709.00103.pdf
Alexandar, Rukshan. “Natural Language Web Interface for Database (NLWIDB)” SalesForce Research, https://arxiv.org/ftp/arxiv/papers/1308/1308.3830.pdf
Couderc, Benoît & Ferrero, Jérémy. (2015). fr2sql : Interrogation de bases de données researchgate.net/publication/280700277_fr2sql_Interrogation_de_bases_de_donnees_en_francais
Garima, Singh. “An algorithm to transform natural language into SQL queries for relational databases” International Academy of Ecology and Environment Sciences, http://www.iaees.org/publications/journals/selforganizology/articles/2016-3(3)/algorithm-to-transform-natural-language-into-SQL-queries.pdf
NSS. “Intuitive Understanding of Word Embeddings: Count Vectors to Word2Vec.” Analytics Vidhya, 22 Mar. 2018, www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/.

-Authored by Paresh Pradhan, Data Scientist at Absolutdata

Technical articles are published from the Absolutdata Labs group, and hail from The Absolutdata Data Science Center of Excellence. These articles also appear in BrainWave, Absolutdata’s quarterly data science digest.

Subscribe to BrainWave