327 0

Improving Column Prediction Accuracy of NLIDB with Historical Column Occurrences Score

Title
Improving Column Prediction Accuracy of NLIDB with Historical Column Occurrences Score
Author
Shanza Abbas
Alternative Author(s)
수엔저어바스
Advisor(s)
Scott Uk-Jin Lee
Issue Date
2022. 8
Publisher
한양대학교
Degree
Doctor
Abstract
Natural Language Interfaces to Databases (NLIDB) has been a research topic for a decade. Significant data collections are available in the form of databases. To utilize them effectively, a system that can translate a natural language query into a structured one can make a huge difference in the data science research. Efforts toward such systems have been made with methods combined from the data science and natural language processing fields, for more than a decade. Natural language processing techniques integrated with data science methods are known as pipelining NLIDB systems. With significant advancements in machine learning and natural language processing, NLIDB task with the integrated techniques from these two areas has emerged as a new research trend. Deep learning has shown potential for rapid growth and improvement in text-to-SQL tasks. In deep learning NLIDB, closing the semantic gap in predicting users’ intended columns has arisen as one of the critical and fundamental problems. Despite of the various significant efforts made to contribute towards this problem, it is still an open issue. Working towards closing the semantic gap between user intention and predicted columns, we present an approach for deep learning text-to-SQL tasks that includes previous columns’ occurrences scores as an additional input feature. Overall exact match accuracy of text to SQL task can be improved by emphasizing the improvement of columns’ prediction accuracy, which depends significantly on column prediction itself. For this purpose, we extract the fragments from previously executed queries and obtain the columns’ occurrences and co occurrences scores. Column occurrences and co-occurrences scores are processed as input features for the encoder–decoder-based text to SQL model. We experimented with our approach on the currently popular text-to-SQL dataset, Spider. Spider is a complex data set containing multiple databases. This dataset includes question-query pairs along with database schema information. We compared our exact match accuracy performance with a base model using their test and training data splits. It outperformed the base model’s accuracy, and it was further boosted in experiments with the pre-trained language model BERT where we showed that inducing the column occurrences graph enhanced the accuracy in both cases.
URI
http://hanyang.dcollection.net/common/orgView/200000627064https://repository.hanyang.ac.kr/handle/20.500.11754/174236
Appears in Collections:
GRADUATE SCHOOL[S](대학원) > COMPUTER SCIENCE & ENGINEERING(컴퓨터공학과) > Theses (Ph.D.)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML


qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE