Georgia Koutrika - Data Science and Information Technologies

Georgia Koutrikas’s Topics

The general direction for the following topics is:

Empowering Data Access Through Intelligent Data Exploration Tools

Data growth and availability as well as data democratization have radically changed data exploration. Many different data sets, generated by users, systems and sensors, are continuously being collected. These data sets contain information about scientific experiments, health, energy, education etc., and they are highly heterogeneous in nature, ranging from highly structured data in tabular form to unstructured text, images or videos. These data can potentially benefit many types of users, from analysts exploring data sets for insight, scientists looking for patterns, to dashboard interactors and consumers looking for information. While the benefit of data exploration becomes increasingly more prominent, the limitations of existing tools make accessing and combining data from different data sources a non-trivial, time-consuming, and often fruitless endeavor.

The suggested topics combine a research and a development component (a different mix per case)

Bibliography search: https://dblp.uni-trier.de/

Topic 1: Query Recommendations from Natural Language and SQL Queries

Databases have been designed for the specific purpose of answering queries, i.e., a user issues a query, the query is executed on the underlying data, and the system returns matching results. However, in several realistic situations, users are only partially familiar with data and its structure, or their needs are partially expressed. As a result, they do not know what type of questions to ask over a dataset. Additionally, query languages such as SQL or SPARQL are cumbersome and “unnatural” for users with no specific technical background. While query suggestions or recommendations are the norm when you are looking for information on a search engine like Google, there are relatively few efforts when it comes to querying structured data.

The purpose of this thesis is to combine modern NLP techniques (embeddings) and recommendation methods (e.g., content-based or collaborative filtering), to build a system that will allow recommendations for queries expressed in natural language or SQL. The proposed system will be compared with other systems to see which one is the winner for different scenarios.

Implementation of a system with different recommendation methods
Detailed experiments for evaluating the performance of the different algorithms
Use and evaluation on real data

Related readings:

Customized query auto-completion and suggestion — A review (2020)
QueRIE reloaded: Using matrix factorization to improve database query recommendations
Cluster-Driven Navigation of the Query Space
Query2Vec: NLP Meets Databases for Generalized Workload Analytics
Similarity metrics for SQL Query clustering

Topic 2: Interactive Translation of Natural Language Queries to SQL

A plethora of data is stored in databases. However, query languages such as SQL or SPARQL are cumbersome and “unnatural” for users with no specific technical background. The last few years, there is a growing need to enable users to query data in natural language. Several systems have emerged that enable users to ask NL queries combining methods from the database, NLP, and machine learning community. All of them assume that the user completes the NL and then the system tries to translate it to SQL.

The goal of this thesis is to build a system that translates a NL query to SQL as the user types it! This translate-as-you-type system will combine database and deep learning techniques.

Implementation of a system with different methods that builds on existing work and makes it possible to do incremental translation
Detailed experiments for evaluating the performance
Use and evaluation on real data

Related readings:

Constructing an Interactive Natural Language Interface for Relational Databases.
RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers
Interactive SQL(NL) Query Suggestion: Making Databases User-Friendly
Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation

Topic 3: Building a meta-search engine for Natural Language Queries over Databases

A plethora of data is stored in databases but query languages such as SQL or SPARQL are cumbersome and “unnatural” for users with no specific technical background. The last few years, there is a growing need to enable users to query data in natural language. Several systems have emerged that enable users to ask NL queries combining methods from the database, NLP, and machine learning community. All of them with their pros and cons.

The goal of this thesis is to leverage the pros and hide the cons of different systems by building a meta-search engine on top of different systems. The challenge here is not just to take a NL query and execute it over all systems. This is inefficient and the returned results may be diverse and hard to combine. Hence, two challenges need to be solved. First, how to find which system is more fit for a query to avoid probing the different systems unnecessarily. Second, find a way to combine and rank results returned by the different systems. Which result is the best? Which system to trust more? This requires a mix of Information Retrieval and Machine Learning methods.

Implementation of a system with different methods
Detailed experiments for evaluating the performance (how fast it is) and its effectiveness (is it wiser than the individual systems?)
Use and evaluation on real data