The Challenge of Speaker Diarization

Introduction

Speaker Diarization is the task to partition audio recordings into segments corresponding to the identity of the speaker. In short, a task to identify “who spoke when”. Speaker diarization can be used to analyze audio data in various scenarios, such as business meetings, court proceedings, online videos, just to name a few. However, it is also a very challenging task since characteristics of several speakers have to be jointly modelled.

Traditionally, a diarization system is a combination of multiple, independent sub-modules, which are optimized individually. Recently, end-to-end deep learning methods are becoming more popular. The metric for speaker diarization is diarization error rate (DER), which is the sum of false alarm, missed detection and confusion between speaker labels.

In this article, we will introduce our upcoming speaker diarization system first, then give an overview of the latest research in end-to-end speaker diarization.

 

Our Method

Our product uses a traditional diarization pipeline, which consists of several components: speech turn segmentation, speaker embedding extraction, and clustering. We utilize pyannote.audio [2], an open-source toolkit for speaker diarization, to train most of our models. 

 

1. Speech Turn Segmentation

The first step of our diarization system is to partition the audio recordings into possible speaker turn segments. A voice activity detection model is used to detect speech regions, while removing non-speech parts. Speaker change detection model is used to detect speaker change points in the audio. Each of these models are trained to optimize a sequence labeling task: With sequence of audio features as input, output a sequence of labels.

For both models, there are some tunable hyperparameters which determine how sensitive the models are on segmenting the audio. For the speaker change detection model, only audio frames with detection score higher than the threshold “alpha” are marked as speaker change points. We notice that it is generally more beneficial to segment more aggressively (i.e. split the whole audio into more segments) in the speech turn segmentation stage, so as to make sure most segments only contain one speaker. After the clustering stage, we can merge segments assigned to the same speaker for better Speech Recognition performance in following stages.

 

2. Speaker Embedding Extraction

After performing speech turn segmentation, in order to facilitate clustering of speaker segments in the clustering module, a speaker embedding model is used to obtain a compact, fine-grained representation for each segment.

The model we choose is SincTDNN which is basically a x-vector architecture where filters are learned instead of being handcrafted. Additive angular margin loss is applied to train the model. 

We utilize CNCELEB [3], an open source large-scale chinese speaker recognition dataset, to fine tune the model pretrained on voxceleb. CNCELEB is a challenging, multi-domain dataset consisting of 3,000 speakers in 11 different genres. By using such a diverse dataset, we expect our model to be stable enough when facing various real life scenarios. We also notice that only very few epoch is needed to fine tune the model, and that speaker embedding with lower EER may not always imply lower diarization error rate.

 

3. Clustering

Traditional cluster methods can be used to cluster the speaker embedding, identifying which speaker each segment belongs to. In our system, we leverage affinity propagation for short audio, and KMeans for long audio. Since we do not know how many speakers are in the recording, affinity propagation has relatively better performance than KMeans as it can determine the number of clusters directly. While for KMeans, we estimate the best number of clusters by finding the elbow of derivatives in MSCD. However, we resort to KMeans instead of affinity propagation for longer audio files since affinity propagation tends to be slower in this situation.

Combining all components

With all the aforementioned components trained, we can combine them into a complete pipeline for diarization. Several hyperparameters are optimized jointly to minimize diarization error rate on a certain dataset. 

Compared to our previous system, the current system obtains around 30% DER relative improvement on our internal dataset flow-inc, which consists of a few thousands news recordings.  Most improvements stem from improved speaker embedding and clustering method, which improved from 15% to 8% DER for oracle speech turn segmentation setting.

In our production system, we also allow some extra customization for our users. For instance, if the user provides the number of speakers in the recording, we can use KMeans to recluster according to the number of speakers. Also, if the user provides or corrects some speaker labels of the utterances, this extra information is involved when updating the cluster centers. Re-cluster will progress with partial initialization based on updated centers, resulting in more accurate predictions.

Finally, the output of speaker diarization can be further fed to the input of an ASR system, and the final output is the transcript for each speaker turn. It is not very certain if diarization can help improve ASR performance; however, recent work has shown that ASR with diarization can obtain comparable WER as state-of-the-art ASR system, but with extra speaker label information.

 

What’s Next?

While most systems in production still follow the traditional pipeline of speech turn segmentation + clustering, End-to-end diarization systems are receiving more and more attention. As traditional pipeline assumes single-speaker per block while extracting speaker embedding, it can by no means handle the problem of overlapping speech. End-to-end diarization systems receive frame-level speech features as input, and directly output frame-level speaker activity for each speaker, thus can handle overlapping speech in an easy manner. Also, end-to-end systems minimize diarization error directly, getting rid of the need to tune hyper-parameters of several components. Today, end-to-end systems are able to perform on par with, or even better than traditional pipeline systems in many datasets.

However, end-to-end systems still have several drawbacks. Firstly, these systems cannot easily handle an arbitrary number of speakers. The originally proposed EEND [4] can only deal with a fixed number. Several follow-up works attempt to address this problem, but their systems still struggle to generalize to real world settings, where there may be more than 3 speakers.

Secondly, while end-to-end systems are optimized directly for diarization error rate, they may easily overfit to the training dataset, for instance the number of speakers, speaker characteristics, as well as background conditions. In contrast, traditional clustering-based approaches are shown to be more robust across datasets. Lastly, most end-to-end diarization systems use transformer or conformer as their core architecture. Self-attention mechanism in transformer has quadratic complexity, which hinders the model’s capability to process long utterances or meetings.

Recently, hybrid systems integrating both an end-to-end neural network and clustering are proposed [5][6], which try to take the pros of both worlds. Hybrid systems handling overlapping speech with an end-to-end network, and able to handle an arbitrary number of speakers using clustering algorithms. Also, block processing of input audio data  [5] significantly shortens the process time of long recordings. This series of research is very promising and deserves more attention.

Finally, many state-of-the-art systems still leverage simulated mixture recordings for training, then fine tune and evaluate on real dataset. This indicates the lack of large-scale real-life dataset for end-to-end training. By introducing more challenging real-life dataset, it is believed that diarization performance can be further improved. 

 

Conclusion

To address the challenging task of speaker diarization, we leverage a system consisting of several neural networks: speech activity detection, speaker change detection, speaker embedding, as well as clustering algorithm. While these components are trained separately, they are optimized jointly by tuning hyper-parameters. For further work, we would like to incorporate an end-to-end segmentation model to handle overlapping speech, as well as making our system more real-time.

 

Reference

[1] A Review of Speaker Diarization: Recent Advances with Deep Learning https://arxiv.org/abs/2101.09624 

[2] pyannote.audio: neural building blocks for speaker diarization https://arxiv.org/abs/1911.01255

[3] CN-Celeb: multi-genre speaker recognition https://arxiv.org/abs/2012.12468   

[4] End-to-End Neural Speaker Diarization with Self-attention https://arxiv.org/abs/1909.06247 

[5] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech https://arxiv.org/abs/2105.09040 

[6] Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors  https://arxiv.org/abs/2107.01545 

[7] The yellow brick road of diarization, challenges and other neural paths https://dihardchallenge.github.io/dihard3workshop/slide/The%20yellow%20brick%20road%20of%20diarization,%20challenges%20and%20other%20neural%20paths.pdf 

 

Lyrics-free Singing Voice Generation

The conventional approach to generate singing voices is through singing voice synthesis (SVS) techniques. A human user feeds lyrics and MIDI scores (a sequence of notes) to a well-trained SVS model, and the model generates audio recordings following the given lyrics and scores faithfully. The synthesis models have little freedom deciding what to “sing.”

In contrast to this conventional approach, we are interested in a singing voice generation model that does not take any lyrics and MIDI scores as input, but instead decides the phonemes and pitches underlying its singing pretty much on its own.  We consider this setting as more interesting as it emphasizes the “creativity” of the model.

Accordingly, we chose to directly work on audio recordings without explicitly extracting pitches or lyrics from the singing audios. The first problem was how to gather enough singing voice audios in order to train neural network models. Luckily, we were experienced in music source separation and the music source separation can already separate the singing voices with reasonably good quality.

To allow the singing voice generation model to work together with our piano generation models (introduced here), we also want the model or a variant of it to sing according to a given piano track but not sing the exact notes of the piano track.

To model the singing voices, in the winter of 2019, we designed a GAN model that can generate singing voices freely as well as a variant of it that can sing along with a given accompaniment audio. Different from conventional GANs for image generation, the proposed model can generate audios with indefinite durations. This work was published in the International Conference on Computational Creativity (ICCC) 2020. This version of models will be referred to as the first generation.

 

The first generation of models produce very creepy singing voices that could make you have nightmares, so we developed the second generation of models. They were also GAN-based models, but the architecture and the vocoder were both improved. One particular extra mechanism is the cycle regularization, which largely improved the pronunciation and the intelligibility of phones. This work was published in INTERSPEECH 2020.

 

Although the sound quality was largely improved in the second generation, the generated singing voices did not sound good musically. Therefore, we started to build new models for better sequence modeling capability. Almost at the same time, OpenAI proposed JukeBox, a two-step method for music generation that directly worked on audios. In this two-step method, 1) an audio was converted into a sequence of discrete tokens with VQ-VAE, and 2) the tokens were modeled by a Transformer. In our third generation of models, we adopted this two-step method, while redesigning the architecture based on our goals.

 

One particular improvement was that the model can run in realtime. A variant of it could accompany the piano playing in realtime. The music team here at the Taiwan AI Labs has deployed it for realtime interaction in various exhibitions and art installations since the autumn of 2020.

 

We have been working on various improvements and variants of this model to have different usages for different applications.  Our best model now can generate singing voices with quality much better than what described above.  The new model is likely to make it debut soon.  Stay tuned with us for more fun!

 

By: Jen-Yu Liu and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

MuseMorphose: Music Style Transfer with A Transformer VAE

At Taiwan AI Labs, we are constantly pushing the frontier of deep music generation models. In the past year, we have rolled out Guitar Transformer (blog), which can compose human-readable guitar tabs with plausible fingerings, and Compound Word Transformer (blog), which vastly accelerated model training and inference thanks to carefully re-engineered music representation. Today, proudly making its debut is MuseMorphose, our brand new model for music style transfer.

Unlike our previous works, which offered limited possibilities for user interaction, MuseMorphose is designed to extensively engage users in the machine creative process. With MuseMorphose, one may input his/her favorite song, the length of which unlimited, and set two musical style attributes, namely, rhythmic intensity and polyphony (i.e., harmonic fullness), of every bar to his/her desired level (0~7 possible). The model will then re-create the song, taking into account the user-specified sequence of bar-level style attributes.

Listening Samples

To showcase MuseMorphose’s capabilities, let us present some of its compositions first!

In the upper example, we feed the famous 8-bar theme of Mozart’s “Twinkle, Twinkle, Little Star” to the model, and ask it to generate a style-transferred version with increasing rhythmic and harmonic intensities (going up by 1 level per bar). For the next one, we choose an 8-bar excerpt from our AILabs1K7 pop song dataset (published here) and pick three drastically different style settings for MuseMorphose to wield its creativity.

Across all the samples, MuseMorphose responds precisely to the style settings. What’s more, it sticks faithfully to the musical flow of input song while adding its own creative and harmonious touches.

MuseMorphose: The Model

Now that we have had some pleasant music, it is time to delve into the technical underpinnings of MuseMorphose.

Architecture of MuseMorphose

Architecture of MuseMorphose

MuseMorphose is based on two popular deep learning models for music generation: Variational Autoencoder (VAE), and Transformer. VAE (see also: MusicVAE & Attr-Aware VAE) grants users the freedom to harness the generation through operations on its learned latent space, much like we have shown in the examples; however, its RNN backbone greatly limits the length of music it can model.

Transformer (check out: Music Transformer & MuseNet), on the other hand, can generate music of up to 5 minutes long, but its conditional generation use cases remain underexplored. People could only use global condition tokens to affect a composition’s style or instrumentation, or, in some other cases, they have to supply a full melody/track for the model to come up with an accompaniment. It has not been possible to freely edit a piece according to its high-level musical flow. Therefore, we integrate the two models to construct MuseMorphose, which exhibits both of their strengths, and gets rid of their weaknesses.

In MuseMorphose, a Transformer encoder is tasked with extracting the musical skeleton of each bar in a piece as vectors (called “latent conditions” in the figure above). Then, these skeletons are concatenated with bar-level style inputs from the user, i.e., the rhythmic intensity and polyphony levels, mapped to learned embedding vectors. Finally, the concatenated conditions enter sequentially, i.e., bar by bar, into a Transformer decoder through the in-attention mechanism developed by us, to produce the style-transferred piece.

This asymmetric encoder-decoder design ensures the fine granularity of conditions, and maintains Transformer’s inherent ability to generate long sequences. Moreover, the in-attention mechanism, which injects each bar-level condition to corresponding timesteps and all layers of the Transformer decoder, is the key to effective conditioning.

Further Materials

You may take a look at our paper to find more discussions on the architecture design, training objective, and evaluation. Our demo website provides even more compositions by MuseMorphose, as well as those by the baselines it outperforms, for you to listen. Want to compose some music with MuseMorphose? No problem! Just check out our open-source implementation and pre-trained checkpoint.

If you find our work exciting, and have some thoughts/suggestions about it (e.g., what other style attributes may be added to MuseMorphose), feel free to drop us a mail. We are definitely looking forward to a lively discussion.

By: Shih-Lun Wu, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Compound Word Transformer: Generate Pop Piano Music of Full-Song Length

Over the past months, we attempted to let transformer models learn to generate full-song music, and here is our first attempt towards that, the Compound Word Transformer. A  paper describing this work is going to be published as a full paper at AAAI 2021, the premier conference in the field of artificial intelligence.

You can find a preprint of the paper, and the open source code of the model, from the links below:

We made two major improvements in this work: First, a new representation – Compound Word is presented, which can let transformer models accept longer input sequence. Second, we adopt the linear transformer as our backbone model. The memory complexity of linear transformer scales linearly with respect to the sequence length, and such property enables us to train longer sequences on limited hardware budgets.

To our best knowledge, this is also the first AI model that deals with music at the full-song scale, instead of music segments. Currently, we let our model focus on learning to compose pop piano music, whose average length is approximate 3 to 5 minutes long. Before diving into the technical details, let’s listen to some generated pieces:

All clips above are generated from scratch. Aside from this, our model can also understand the lead sheet, which conveys only melody line and chord progression, and translate it into an expressive piano performance:

How do we nail it? We reviewed our previous work REMI and found the way to represent music can be further condensed. If the required sequence length is shorter, we can feed longer music into the model. The figure below displays the process of how we convert REMI into Compound Words representation. Instead of predicting only one token per timestep, CP groups consecutive and related tokens and predict them at once so as to reduce the sequence length. Each field of CP is related to a certain aspect/attribute of music, such as duration, velocity, chords, and etc

For modeling such CP sequences, we design the specialized input and output modules of the transformer. The figure below illustrates the proposed architecture in action at a certain timestep.  On the right half part, where the model makes token prediction, you can see that there are multiple feedforward heads,  each accounting for a field of CP, which corresponds to a single row of CP shown in the figure above. On the left half part, each field of CP has its own token embedding, which will be concatenated as the vector and then reshaped by a linear layer, to become the final input of the transformer.

Because each head only concentrates on a certain field of CP, we can have more precise control when either modeling or generating music. For example, in training, we can assign different sizes of embedding to tokens of different types, according to the difficulty level associated with each type of token. We set larger for harder ones like duration and pitches, and smaller for the easier ones like beats and bars. In the inference time, we can adopt different sampling policies. For example, we can use larger temperature to have more randomness in the prediction of velocity tokens; and smaller temperature for pitch tokens to avoid mistakes.

The proposed model shows good training and inference efficiency. Now we can train our model on a single GPU with 11GB memory within just one day. In inference time, to generate a 3-minute song takes only about 20 seconds, which is much faster than real-time.

Learning to generate full songs means that the model can take the whole song as input and knows when to start when generation. However, the music generated by the current model still does not exhibit clear structural patterns, like AABA form, or repetitive phrases, etc. We are working on this, hoping one day our AI can write a hit song.

More examples of the generated music can be found at:

https://drive.google.com/drive/folders/1G_tTpcAuVpYO-4IUGS8i8XdwoIsUix8o?usp=sharing

 

By: Wen-Yi Hsiao, Jen-Yu Liu,  Yin-Cheng Yeh, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Guitar Transformer and Jazz Transformer

At the Yating Music Team of the Taiwan AI Labs, we are developing new music composing AI models extending from our previous Pop Music Transformer model (see the previous blog).  In October 2020, we are going to present two full papers documenting some of our latest result at the International Society for Music Information Retrieval Conference (ISMIR), the premier international conference on music information retrieval and music generation.

 

The first paper, entitled “Automatic composition of guitar tabs by Transformers and groove modeling,” talks about a Guitar Transformer model that learns to generate guitar tabs.

Here are two tabs in the style of guitar fingerstyle  generated fully automatically by this model; no human curation involved.

The highlight of this work is to design and incorporate what we call the “grooving” tokens to the representation we use to represent a piece of symbolic guitar music.  Groove,  which can be in general considered as a rhythmic feeling of a changing or repeated pattern, or “humans’ pleasurable urge to move their bodies rhythmically in response to music,” is not explicitly specified in either a MIDI or TAB file.  Instead, groove is implicitly implied as a result of the arrangement of note onsets over time.  Therefore, existing methods for representing music do not involve the use of groove-related tokens.

What we did is to apply music information retrieval (MIR) techniques to extract a 16-dimensional vector representing the occurrence of note onsets over 16 possible equally-spaced quantized positions of a bar, and then use the classical kmeans algorithm to cluster such 16-dim vectors from all the bars from all the pieces of our training data, leading to k (=32) clusters (we need clustering for otherwise there will be too many unique such 16-dim vectors).  We then treat these cluster IDs as “grooving tokens” and assign a grooving token to each bar of a music piece.  In this way, what our Transformer model (specifically, we use Transformer-XL) sees during model training would be not only the note-related tokens but also such bar-level grooving tokens.  It turns that this improves quite a lot the quality of the generated music, compared to the baseline model that does not use grooving tokens.

The following figure shows the result of a user study asking subjects to choose the best among the three continuations generated by different models, with or without the grooving tokens, given a short human-made prompt. The result is broken down according to the self-report guitar proficiency level of the subjects.  We can see that the professionals are aware of the difference between the grooving-agnostic model and the two groove-aware models (we implemented two variants here, a `hard grooving` model and a `soft grooving model`, which differ in the way we represent the musical onsets).

 

The second paper, entitled “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” talks about a Jazz Transformer model that learns to generate Jazz-style lead sheets, using the Jazzomat dataset.

The focus of this paper can be said to be about the development of objective metrics tailored for symbolic music generation tasks (namely, not for general sequence generation tasks).  Specifically, we proposed the following metrics:

  • Pitch Usage: Entropy of 1- & 4-bar chromagrams;
  • Rhythm: Cross-bar similarity of grooving patterns;
  • Harmony: Percentage of unique chord trigrams;
  • Repeated structures: Short-, mid-, & long-term structureness indicators (computed from the fitness scape plot);
  • Overall musical knowledge: Multiple-choice continuation prediction (given 8 bars, predict the next 8 bars).

Implementation of these evaluation metrics can all be found in the `MusDr` repository listed above.  It has also been integrated into `MusPy,` an open source Python library for symbolic music generation developed by Hao-Wen Dong et al at UCSD.

These objective metrics can help us gain some ideas about the quality of the machine-generated music, dispensing the need to run the expensive user studies too many times.  These metrics also help elucidate the difference between machine-composed music and human-composed ones.  For example, from the structureness indicators, we can see clearly that Transformer-XL based music composing models, which represent a current state-of-the-art, still fall short of generating music with reasonable mid- and long-term structure.  See the following figure for a comparison between the fitness scape plot of a piece composed by the Jazz Transformer (marked as `Model (B)’), and that of a human-composed one.

We are actively using these new metrics to guide and to improve our models.  In particular, we are finding ways to induce structures in machine-composed music.  Let us know if you like what we are doing and/or have some ideas to chat with us!

 

Ref:
[1] Yu-Hua Chen, Yu-Siang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Automatic composition of guitar tabs by Transformers and groove modeling,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).
[2] Shih-Lun Wu and Yi-Hsuan Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).

By: Yu-Hua Chen, Shih-Lun Wu, Yu-Siang Huang, Wen-Yi Hsiao and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Missing facts and source classifier of daily news


Introduction

With the development of the Internet, it is convenient that one can get news from network rapidly. Nonetheless, it is also dangerous since a person might have a preference for a specific news medium over others, and the news medium may have its position to report news while missing some facts that they do not want people to know. For the purpose, we propose a system to detect missing facts from a news report.


Method

Work flow

First, we group the news using their embeddings derived by Universal Sentence Encoder(USE)[1]. Within each group, the news are highly related. In fact, most of them report the same event as exptected. Meanwhile, we extract the summary for each news report using an algorithm called PageRank. Then, we further summarize each news group using the summaries of the news in the group. Afterwards, we compare the summary of each news report with the group summary to get the missing facts of the report.

Grouping news

This part is implemented by Yu-An, Wang, and I omit it here.

PageRank

For each news report, we split the raw content into sentences. Then we count the similarity of each sentence pair using USE and editdistance. Once a similarity matrix is ready, we can sort the sentences by PageRank. We then choose top k sentences to be the report’s summary, where k is a hyperparameter.

Combining the summaries of the news in a group, we can summarize the news group in a similar way.

Missing facts

We call each sentence in a news summary “a fact”. Now we can obtain the difference between a news summary and the group summary, and the sentences existing in the group summary but not in the news summary are the missing facts of the report. Otherwise, the sentences existing in the news summary but not in the group summary is the exclusive content of the news report.


Application

島民衛星 https://islander.cc/

Source Classifier for Daily News and Farm Post


Introduction

Sometimes the news media are not neutral as expected. Hence, we can train a satisfactory classifier due to the biases. Or to say, the classifier will work since a news medium might have its specific writing style. We use BERT[2] as the contexual representation extractor to train a classifier in order to predict the source (news medium) of a news report given its content (or title). Except for the daily news, this kind of classifier can also be adopted to classify the source of farm posts, which are usually biased and contain fake information.


Data

  • Daily news
    • # classes: 4
    • 台灣四大報(中國時報、聯合報、自由時報、蘋果日報)
    • Preprocessing
      • Sometimes the reporter name, the news medium itself, or some slogans are in the raw content. We filter them using hand-crafted rules.
  • Farm post
    • # classes: 16
    • mission-tw, qiqu.live, qiqu.world, hssszn, qiqi.today, cnba.live, 77s.today, i77.today, nooho.net, hellotw, qiqu.pro, taiwan-politicalnews, readthis.one, twgreatdaily.live, taiwan.cnitaiwannews.cn

Method

C-512

  • As shown in the figure below, this classifier is composed of a BERT model followed by a linear layer. This model can handle the input whose content length is less than 512. Note that the latent vector is corresponding to the CLS token.

C-whole

  • Since the pretrained BERT model is for the cases with less-than-512 texts, we propose a method to deal the cases with more-than-512 texts.
  • First, we train a C-512 model. The C-whole model uses the BERT module in the trained C-512 model as its representation extractor. Given an input content with arbitary text length, ll, we split it into ⌈l/512⌉⌈l/512⌉ segments. The representation for each segment is derived by the BERT module, and then it is passed through a linear layer. Averaging all these outputs from the linear layer, the vector is fed into another linear layer to obtain the final output of the C-whole model.

Application

島民衛星 https://islander.cc/


1. Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018). ↩︎

2. Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 (2018). ↩︎

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

Paper (ACM Multimedia 2020):  https://arxiv.org/abs/2002.00212 (pre-print)

Code (GitHub):  https://github.com/YatingMusic/remi

 

We’ve developed Pop Music Transformer, a deep learning model that can generate pieces of expressive Pop piano music of several minutes.  Unlike existing models for music composition, our model learns to compose music over a metrical structure defined in terms of bars, beats, and sub-beats.  As a result, our model can generate music with more salient and consistent rhythmic structure.

 

Here are nine pieces of piano performances generated by our model in three different styles.  While generating the music, our model takes no human input (e.g., prompt or chord progressions) at all.  Moreover, no post-processing steps are needed to refine the generated music.  The model learns to generate expressive and coherent music automatically.

 

 

From a technical point of view, the major improvement we have made is to invent and employ a new approach to represent musical content.  The new representation, called REMI (REvamped MIDI-derived events), provides a deep learning model more contextual information for modeling music than existing MIDI-like representation.   Specifically, REMI uses position and bar events to provide a metrical context for models to “count the beats.”  And, it uses supportive musical tokens capturing the high-level music information of tempo and chord.  Please see the figure below for a comparison between REMI and the commonly-adopted MIDI-like token representation of music.

 

The new model can generate music with explicit harmonic and rhythmic structure, while allowing for expressive rhythmic freedom in music (e.g., tempo rubato).  While it can generate chord events and tempo changes events on its own, it also provides a mechanism for human users to control and manipulate the chord progression and local tempo of the music being generated as they wish.

 

The figure below show the piano-rolls of piano music generated by two baseline models (the first two rows) and the proposed model (the last one), when these models are asked to “continue” a 4-bar prompt excerpted from a human-composed music.  We can see that the proposed model continues the music better.

 

The figure below show the piano-roll of a generated piano music when we constrain the model not to use the same musical chord (F:minor) as the 4-bar prompt.

 

The paper describing this new model has been accepted for publication at ACM Multimedia 2020, the premier international conference in the field of multimedia computing.  You can find more details in our paper (see the link below the title) and try the model yourself with the code we’ve released!  We provide not only the source code but also the pre-trained model for developers to play with.

 

More examples of the generated music can be found at:

https://drive.google.com/drive/folders/1LzPBjHPip4S0CBOLquk5CNapvXSfys54

 

Enjoy listening!

 

By: Yu-Siang Huang, Chung-Yang Wang, Wen-Yi Hsiao, Yin-Cheng Yeh and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Telling Something from Your Face: Age, Beauty, and Anti-Spoofing

Figure 1. Demonstration of the facial age, beauty, pose estimation.

In the last few decades, facial analysis has been thoroughly explored due to its huge commercial potential. For example, learning the semantic meaning of a human face can help us to target potential customers more easily and more efficient. While face recognition is a well-established technology for surveillance systems or security control, it is also crucial for us to know the captured image is from a real person or a spoofing device. In this blog post, we are going to discuss what we do in Taiwan Ailabs and our demo system.

Predicting a human age or beauty from the appearance is a very subjective task. A man/woman can look younger than his/her real age while beauty is not even a quantifiable value. Nevertheless, knowing the approximated appearance age and beauty is still helpful for merchandise recommendation. For instance, recommending an old man/woman some soft drinks may not be a good idea, nor placing a massage chair in a children’s play zone.

There are several ways to estimate these two values with proper supervision. Classification tells us that we can use independent bins to represent a range of the age or beauty score. The drawback of such method is the range is manually set and might lead to quantization error. On the other hand, it is also natural to use regression to predict continuous values, but it might also lead to overfitting without constraints. In addition, the face pose and resolution also have a huge impact on the prediction performance. The above challenges make the prediction of age and beauty even more difficult.

 

 

Method:

Age and Beauty estimation

To resolve the above challenges while achieving very low cost, the demo system is built upon the recently published FSA-Net [1] in CVPR2019. It adopts Soft-Stagewise Regression (SSR) scheme [2] for eliminating the quantization error while maintaining low memory overhead. 

For the training, we choose 128x128x3 as the cropped face resolution, and put an auxiliary loss as the quantized supervision prediction.

A drawback of such scheme is that unstable prediction might appears with different input images. We use a sequential frame selecting pipeline to stabilized the final prediction.

Figure 2. Sequential frame selecting pipeline.

Finally, only when the detected face has enough resolution with very small pose angle, it will be considered as a valid face and proceed with the estimation pipeline. The replacement policy can be altered as long as the selected face image is high quality with enough resolution.

 

Face anti-spoofing

In order to achieve anti-spoofing with a pure RGB image, we divide the process into two different tasks, cell phone detection and denoising based anti-spoofing estimation.

Figure 3. Face anti-spoofing pipeline.

We adopt the famous YOLOv3 [3] as the detector for the cell phone, laptop, monitor detection. Task 1 is defined as detection inside the phone area will be considered as a fake spoof. While taks 2 takes the quality and the noise of the image for determining whether it’s Real or Fake.

 

Demo images:

The following demo shows that our system can predict age and beauty with decent

 Figure 4. Demo images for age and beauty estimation.

 

Figure 5. Demo image for face anti-spoofing. (Red: Fake. Blue: Real.)

Summary:

Face attributes estimation such as age and beauty is subjectively determined by the label data, but it is still useful for commercial analysis and recommendation. For face recognition, the anti-spoofing is also very important for the security and the robustness of the whole identity verification pipeline. We achieved these prototypes for showing that there is much more potential on these topics and the AI can truly help people to make a better decision with these estimations.

 

Reference:

[1] Yang, Tsun-Yi, et al. “FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation from a Single Image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Yang, Tsun-Yi, et al. “SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation.” IJCAI. Vol. 5. No. 6. 2018.

[3] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

Compete for the NT$20 Million Reward in the AI Chinese Comprehension Championship Challenge – 2020 “Formosa Grand Challenge – Talk to AI” –

2020 “Formosa Grand Challenge – Talk to AI”

Date: September 9, 2019

The 2020 “Formosa Grand Challenge – Talk to AI” officially kicks off today (September 9) to accept applications. The Ministry of Science and Technology invites experts to challenge the Band C Level of the Reading Comprehension Test for the “Test of Chinese as a Foreign Language,” or TOCFL for short, to compete for the first prize reward of NT$20 million.

Unlike the previous 1st “Talk to AI” competition, this challenge will be conducted in a single-choice mode, but will include voice reading and continuous dialogue tests to fully challenge AI’s ability to comprehend the Chinese language. The purpose is to allow AI to read the subject contents of different academic fields to test its comprehension and inference of the words and sentences, and further organize the entire dataset into application knowledge to achieve AI self-learning. The competition will be held in two stages, where the preliminary contest is scheduled for December 2019, and the final round in April 2020.

Dr. Wen Lian Hsu and his research team of the Academia Sinica Institute of Information Science is entrusted to work out the test dataset in conjunction with the National Academy for Educational Research to co-establish the dataset for training comprehension and dialogue. It is scheduled to be opened for use by the contestant teams in October 2019. On the other hand, the Speech-to-Text API uses Ya Ting Verbatim of AI Labs. Ya Ting Verbatim is developed by AI Labs led by Ethan Tu. It is indeed a great honor for Formosa Grand Challenge to cooperate with AI Labs.

To allow AI to take root, this “Talk to AI” competition is scheduled to hold a FUN CUP team event on November 16 to encourage general and vocational high school students as well as college students to participate. The FUN CUP team will work with the Association for Computational Linguistics and Chinese Language Processing to enable young students to understand the application of AI man-machine dialogue and use existing commercial AI tools to train the machine to learn and fully utilize existing voice materials and resources with the goal of achieving small-scale scientific research results of a certain quality.

The Ministry of Science and Technology will continue to enhance the scale of the voice dataset during the competition, allowing the teams to carry out technology development and testing. The research team of Associate Professor Yuan-Fu Liao of National Taipei University of Technology is invited to do the speech data transcription, releasing about 600 hours of the AI voice dataset to the data market of the National Center for High-Performance Computing (NCHC DATA MARKET), which is to be used by paid authorized users. The fee is 2000 NTD per 150 hours.

The Ministry of Science and Technology hopes to encourage innovators to use the potential of AI development, technology, and creativity to solve the challenges of voice applications through the “Formosa Grand Challenge” competition, and looks forward to any possible progress and thinking. It also expects that the event will attract more enterprises as well as academic and research institutions to get involved and work together to promote the upswing of Taiwan’s AI voice recognition technology and assist Taiwanese enterprises in digital transformation.

 


2020 “Formosa Grand Challenge – Talk to AI”

,

AI Labs released an annotation system: Long live the medical diagnosis experience.

The Dilemma of Taiwan’s Medical Laboratory Sciences

Thanks to the Breau Of National Health Insurance in Taiwan, abundant medical data are appropriately recorded. This is surely good news for us, an AI-based company. However, most of the medical data have not been labeled yet. What’s worse, Taiwan currently faces a terrible medical talent shortage. The number of experienced masters of medical laboratory sciences is getting smaller and smaller. Take Malaria diagnosis for example. Malaria parasites belong to the genus Plasmodium (phylum Apicomplexa). In humans, malaria is caused by P. falciparum, P. malariae, P. ovale, P. vivax and P. knowlesi. It is undoubtedly an arduous work for a human to detect and classify the affected cell to these five classes. Unfortunately, it is only one retiring master in this field in Taiwan that can indeed confirm the correctness of the diagnosis. We must take remedial action right away, yet it costs too much either time or money to train a human being to be a Malaria master. Only through the power of the technology can we preserve the valuable diagnosis experience.

Now, we decided to solve the problem by transferring human’s experience to the machines, and the first step is to annotate the medical data. Since the only one master cannot address the overwhelming data by himself, he needs some helpers to help him do the first detection job and, in the end, the master does the final confirmation. In this case, we need a system which allows multiple users to cooperate. It should be able to record the file path of the label data, the annotators, and the label time. We search for assorted off-the-shelf annotation systems, none of them, unfortunately again, meets our specification. So we decided to roll up our sleeves and revise a most relevant one for our propose.

An Improved Annotation System

Citing from an open-source annotation system resource [1], AILabs revised it and released a new, easy-to-use annotation system in order to help those who desire to create their valuable labeled data. With our labeling system, you will know who is the previous annotator and can systematically revise other’s work.

This system is designed for object labeling. By creating a rectangular box on your target object, you will be able to assign the label name and label coordinates to the chosen target. See the example below.

Also, you will obtain the label catalogs and the label coordinates in an XML file, which is in PASCAL VOC format. You can surely leverage the output XML file as the input of your machine learning programs.

How does it work?

Three steps: Load, Draw and Save.

In Load: Feed the system an image into this system. It is always fine if you do not have an XML file since it is your first time operating this system.

In Draw: Create as many as labels in an image as you want. Don’t forget that you may zoom in/out if the image is not clear enough.

In Save: Click save button. Everything is done. The system will output an XML file including all the labels data for an image.

What’s Next?

With the sufficient annotated data, we can then train our machines by learning the labels annotated by the medical master, which will make the machines able to make a diagnosis as brilliant as the last master. We will keep working on it!

Citation

[1] Tzutalin. LabelImg. Git code (2015). https://github.com/tzutalin/labelImg