1.TITLE: End-to-end neural Text-to-Speech synthesis
ABSTRACT (up to 100 words): Text-to-Speech (TTS) synthesis is the problem of transforming written sentences into waveforms of speech utterances. The aim of the thesis is to investigate a neural end-to-end generative speech synthesis architecture derived from Tacotron2. The training of the network will be based on text-audio-pairs and the training behaviour of the constituent components will be analyzed. The proposed system will be compared with current TTS systems in terms of its quality.
2.TITLE: Sequence-to-Sequence Learning for Music Improvisation based on Accompaniment
ABSTRACT (up to 100 words): Sequence-to-Sequence (S2S) learning has proven effective in automatic translation, where a text sequence/sentence from a language is used as input and the output is the translation of the input sequence to another language. The most basic form of music improvisation based on accompaniment involves a “fixed” harmonic and rhythmic background, over which an improviser performs. The hypothesis maintained in this thesis is that the note sequences involved in “fixed” background and the ones in the improvisation can be considered as sentences of two different languages: the “harmonic” language of the background and the “melodic” language of the improvisation. Of course, translations can be attempted both ways, i.e. a) given the background generate a melody and b) given a melody generate an accompanying background. The research questions that this thesis could attempt to answer are the following:
- To what extent are S2S models capable of modeling musical improvisation based on accompaniment?
- Do S2S models perform better on translating harmonic background to melodic improvisations or vice versa?
- What adjustments would be necessary for making S2S systems capable of performing in real time setups?
Jazz standards will be used as training material, since they incorporate clear harmonic/rhythmical backgrounds that accompany melodies which can be considered as improvisations.
3.TITLE: Studying the Particularities of Image-Inspired representations of Musical Data for Machine Learning
ABSTRACT (up to 100 words): Having sufficient data and computational power for modeling sufficiently many variables, modern Deep Learning methods can perform impressively in tasks related to image and sound/music processing. Generally, methods for image processing rely on a “static” representation of data (e.g. learning convolutional filters by traverse images) while methods for music processing mainly incorporate dynamic modeling (e.g. with recursive nodes). The research questions that can be approached in the context of this thesis are the following:
- How efficiently can several methods that follow the static, image-related, data modeling paradigm actually model musical data?
- Which architectures present a greater performance gap when it comes to modeling image and musical data and what could the reason be for inefficiency in musical data?
- Which architectures and what musical data transformations perform better?
This thesis could start as a comparative study on the aforementioned performance gap between several architectures that perform well on the MNIST hand written digits dataset. Images of musical score data could incorporate fixed-sized musical excerpts with given musical characteristics (e.g. 4 meters of 4/4 music and rhythm resolution of 16ths). Initial studies support that most architectures performing well with the MNIST dataset “collapse” when they process musical data in image-like form.
4.TITLE: Text detection and recognition in natural scenes
ABSTRACT (up to 100 words): Text in natural scenes exist in many cases of daily life from labels on the facade of buildings or in the streets to the covers of books. Automated text detection and recognition enables us to avoid the complexity of the environment, focus on the useful information and recognize the text. The aim of the project is to develop and apply algorithms for the localization of text regions in natural scene images, the segmentation into words and the optical character recognition. The proposed algorithms will be tested on datasets of the ICDAR Series of Robust Reading Competition.
NAME & POSITION OF THE SUPERVISOR: Vassilis Katsouros, Research Director at Athena R.C. & Vassilis Papavassiliou, Associate Researcher at Athena R.C.
LAB/GROUP, DEPARTMENT, INSTITUTION where the thesis will be executed: Institute for Language and Speech Processing