Speaker Verification and Speaker Retrieval

A. Introduction

Speaker verification is the task that given two segments of audio signal, we are supposed to verify whether both signals came from the same identity of the speaker or not. While inferencing, we first transform the audio signal into the corresponding speaker embedding, then calculate the similarity between two embeddings which is used to compare with the certain threshold later.

Speaker Retrieval is the task that given a paragraph of audio content, usually long, and a segment of audio represented one specific speaker identity as the input, we have to precisely point out the corresponding parts in the audio paragraph which owns the same identity as the input signal. In our experiment setting, we construct a pool set which includes a plenty of voice segments of several speakers, along with a target set which is composed of voice segments of speakers same as those in the pool set, each speaker has one corresponding segment within the target set.

In this article, we will first introduce the results of the speaker verification task and several skills to improve the performance, then show a new method that add a quantization part in the loss function, in order to fasten the process and reduce the memory usage in the speaker retrieval task.


B. Speaker Verification

First, we utilize four different speaker embedding models [1][2][3][4] on three datasets [5][6][7] and evaluate the performance of the results through two metrics. Also note that pyannote.audio (pyannote) and Efficient-TDNN (sugar) are both pretrained on the voxceleb dataset, while ECAPA-TDNN (spkrec) is pretrained on the cnceleb dataset.

In the result, we can quickly observe that the language information somehow makes an impact on the performance of the speaker verification task. To alleviate such influence, we implement several skills toward the speaker embeddings generated from the pre-trained models, including the concatenation and domain adaptation.

The table below shows the speaker verification results generating from the concatenated embeddings. The corresponding models are pretrained Efficient-TDNN (A), pretrained ECAPA-TDNN (B), Efficient-TDNN trained on cnceleb for 12 epochs (C), and Efficient-TDNN trained on cnceleb for 64 epochs (D).

To realize the domain adaptation [8], we subtract the mean of the whole dataset from each embedding before the comparison. This also works even when the dataset that used to generate the mean is different from the target dataset.

Also, since we are more interested in the performance on Chinese datasets, we train the Efficient-TDNN, which is three times smaller than the ECAPA-TDNN, with the cnceleb dataset. What’s more, we further train the Efficient-TDNN with both voxceleb and cnceleb datasets to see whether such model has the ability to handle both languages itself.

The table below shows the results on the cnceleb dataset by using the embeddings from the Efficient-TDNN trained with such dataset under dynamic architectures and different training epochs. We list the results of training the Efficient-TDNN model with both voxceleb and cnceleb datasets, too.


C. Speaker Retrieval

When it comes to details of the new-added quantization part, the overall loss function can be separated into two terms. The first term is the cross-entropy loss for the classification during training, and the second term is the L2 loss, which is added as a quantization loss. The target of the L2 loss is the integer 1 or -1, which forces the content of embeddings to be as close to 1 or -1 as they can, thus leads to a smaller influence while directly clamping them to 1 and -1. [9][10]

We use only the cnceleb dataset to fine-tune the Efficient-TDNN and starts from the checkpoint which has trained 64 epochs on the very model and dataset, but with the original loss function. First, we show the result on the speaker verification task.

For the speaker retrieval task, we utilize three different architectures to generate embeddings and inference on three datasets. The metric to evaluate the performance is mAP, which is defined as:

The detailed of three architectures are listed below. The first architecture is composed by HuBERT, an SSL model, followed with an additional linear layer, and a hyperbolic tangent layer. [11] The output of the hyperbolic tangent layer then uses to calculate the L2 quantization loss and used to calculate the cross-entropy loss after passing another linear layer. The second architecture uses the fine-tuned Efficient-TDNN mentioned above rather than the SSL model. Though the third architecture seems to be similar to the second one, it uses the 64-epoch pretrained model with the original loss function and thus generating floating number embeddings as a baseline for the quantization method. Note that the first architecture is trained on the voxceleb1 dataset while the other two are trained on the cnceleb dataset.

After the embeddings are generated, there are two ways for us to calculate distances between voice segments in the target set and those in the pool set, cosine distance for embeddings with floating numbers and hamming distance for those with quantized numbers. For the number of files in the pool set, we choose 10 files for each speaker in voxceleb2 and cnceleb datasets and choose all of the files in the voxceleb1 dataset. The experiment result is shown as below.

Despite that the performance of the quantization method has an approximately 11%~14% decrease compared to the baseline, the computation speed can gain a nearly 60% increase while calculating the distance and comparing through the pool set. What’s more, the memory usage can be significantly reduced, since a k-bit floating number requires k times of storage spaces than a boolean number.


D. Reference

[1]     pyannote.audio: neural building blocks for speaker diarization


[2]     A Comparison of Metric Learning Loss Functions for End-To-End Speaker Verification


[3]     EfficientTDNN: Efficient Architecture Search for Speaker Recognition


[4]     SpeechBrain: A General-Purpose Speech Toolkit


[5]     VoxCeleb: a large-scale speaker identification dataset


[6]     CN-CELEB: a challenging Chinese speaker recognition dataset


[7]     FFSVC

FFSVC 2022

[8]     The SpeakIn System for Far-Field Speaker Verification Challenge 2022

speakin_22.pdf (ffsvc.github.io)

[9]     Scalable Identity-Oriented Speech Retrieval

Scalable Identity-Oriented Speech Retrieval | IEEE Journals & Magazine | IEEE Xplore

[10]   Deep Hashing for Speaker Identification and Retrieval

INTERSPEECH19_DAMH.pdf (jiangqy.github.io)

[11]    Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

[2204.12765] Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition? (arxiv.org)




Enhance MIDI Generation with Harmonic and Rhythmic Features

Music generation at Taiwan AI Labs is based on the generation of note sequences. This approach preserves most details of a piece of music, but to the human ear, music is not only a set of notes, but the patterns that are formed by them. This is exactly what chorder and groover aim to achieve; they are packages designed to extract information on harmony and groove respectively.

From extracting harmonic and rhythmic features, computers are now able to look at a piece of music on a larger scale than plain notes. However, readers should still be aware of the fact that there is no 100% objective right and wrong in the perception of music, so the harmonic and rhythmic features are far from being the sole correct analysis to a piece of music.


The main feature of chorder is chord detection from MIDI files. To successfully accomplish this task, chorder uses a 12-dimensional vectors to represents semitone distribution. For a certain time period, there is a vector v that sums up the duration of each note by their pitch class. For reference, there are different weight vectors w for different chord qualities. For example,

w_{\text{major}} = [1.0, -0.2, -0.1, -0.2, 1.0, -0.5, -0.2, 1.0, -0.2, -0.1, -0.2, 0.0]

Meaning semitones that are 0, 4, and 7 semitones away from the root note of a major chord are the most contributing factors, while the semitone that is 5 semitones away from the root is a reverse indicator. The quality and the pitch class of the root can then be expressed as follows:

\text{argmax}_{R \times Q} \text{  } v_r \cdot w_q

Where r is an integer between 0 to 11, representing the pitch class of the root. vr is v rotated left for r positions to find the root that best fits the chord’s weights wq. For now, Q contains six type of basic chords: major, minor, diminished, augmented, sus2 and sus4. One thing to note is that a segment will be determined as no chord if the dot product of vr and wq is less than the total duration of the segment.

The bass note, different from the root note, is simply the lowest notes found in the segment that has a combined duration of at least 1/8 of the duration of the entire segment. With knowledge of bass note and basic quality, certain rules are applied to correct the quality of the chord or to add seventh notes. For example, a Fmaj chord with D on the bass should be considered a Dmin7 chord instead.

The lengths of segments chorder use is 1 and 2 beats. If the 2-beat segment has a higher alignment score than both 1-beat segments, the chord of the 2-beat segment is applied to the 2 beats. The 2 beats will be assigned their separate chords if that is not the case.


Unlike chords, there are no symbols on grooving that are universally agreed on. That’s why in groover,  the grooving representations are simply classification of rhythmic patterns derived from a given MIDI dataset.

The rhythmic patterns are set at a certain length (for example, a measure), and divided into quantized periods. Each note contributes an intensity value to the pattern, and the intensity increases with lower pitch and higher note velocity. Then, the patterns are clustered, not with Euclidean distance, but a modified version of cosine similarity. The modification is to do a “blurring” before calculating cosine similarity. Let v(i, l) be the left shift of v for i positions and v(i, r) be the right shift of v for i positions, then we can define the blur of n positions as

v_{\text{blurred}} = v + \sum_{i=1}^{n} (1 – \frac{i}{n})(v_{i, l} + v_{i, r})

The magnitude can be altered to make it a weighted average, but for the purpose of applying cosine similarity, this is not necessary. The blurring makes positions that are closer to each other value more, instead of a mere 0 if they are not exactly the same. With this customized similarity, we can now apply clustering algorithms such as k-means, which is the one currently used by groover to label rhythmic patterns.

Below is an audiation of harmonic and rhythmic features of an 8-bar excerpt. The velocity of the chords are representative of the intensity value the detected pattern holds.

The original 8-bar excerpt:

Its audiation of harmonic and rhythmic features:


By:  Joshua Chang and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Investigating the Impact of Facebook Polarization on the 2021 Taiwanese Referendum

Taiwan has four important upcoming referendums on December 18, 2021. “Pork Imports”, “Referendum Dates”, “Nuclear Plant”, “Algae Reef Protection.” According to previous research, disinformation and manipulation of the media could influence public opinion on topics like these referendums. This study will examine how Facebook has affected the polarization of public opinion on these topics over the previous eleven months. The phenomenon of public opinion polarization and the factors affecting policy support and political attitudes will be analyzed using artificial intelligence technology. The polarization impact of pages that demonstrated evidence of potential media manipulation through coordinated behavior were considered. Within these pages, those that were shared most often are referred to as “amplifiers.”


Exploring Atypical Online Coincidental Behavior on PTT

This study focuses on atypical coincidental behavior on the Taiwan social media, PTT1, to discover attempts to manipulate public opinion during the outbreak of the COVID-19 in Taiwan from May through August 2021. The research team aims to identify atypical coincidental behaviors to uncover suspicious collaborative efforts which attempt to manipulate public opinions, together with developing AI tools to analyze the information comprehensively and efficiently over the outbreak period and assists researchers by saving human labor and time.

Since its launch in 1995, PTT has become one of the most used Chinese-language online services. In Taiwan, many users make it a habit to post news and information from various sources, which leads to a diverse spectrum of discussions, and often for many, the discussion board becomes the first stop of the news outlet. Almost all the information related to Taiwan can be discovered on PTT. The information on PTT is discussed by users with their opinions and ideas.

Such discussions are often later presented in news media for larger public consumption. This effect not only amplifies these ideas for a wider audience, but it also affects businesses and governmental institutions during their decision-making process when they need to adhere to public opinions. For example, when whistle-blower Dr. Wenliang Li of China first broke evidence in Chinese media of human-to-human transmission of COVID-19, a Taiwanese physician began sharing the information on PTT. Because of these discussions, authorities at the Taiwan Centers for Disease Control were alerted and began to take action much earlier than other countries. In this way, one could compare PTT to a speech stand in the center of the town square, upon which users can take space to give public addresses where the majority of the townsfolks are the audience. The more time someone spends on the speaker stand, the greater potential influence they will have on the town. Users who have the power to influence discussions could be likened to a bullying majority dominating the use of the speaker stands for their own ends.

Since 2018, Taiwan media and academia have been observed the potential of cyber armies conducting strategic information operations through social networks such as PTT, Line, and Facebook. A Pilot Study on PTT in the Context of Influence Operations introduces the interface, functions, and terminology of PTT. Other studies identified groups of cyber armies on the Gossiping Board of PTT. Compared to previous papers, this research contains not only case studies but also a data-driven and evidence-based approach, comprehensively quantifies “atypical coincidental behavior” and summarizes the differences between user groups.

The research team grouped users by the phi-coefficient score to measure coincidental behavior in many aspects, including shared IP addresses, time pattern of activity, narratives, emotion, and incitement of comments. Furthermore, the study compared several metrics of behavior among user groups in various events2 to discover the evidence of information manipulation and observed the correlation between ideological slogans used and groups of users. By observing the opinion manipulation on social media, users may be able to more readily distinguish atypical coincidental behavior and therefore decrease their chances of being manipulated. Through exposing more context behind user content, this research hopes to decrease the negative impact of atypical coincidental behavior on public opinion.

Furthermore, the researchers compared the behavior of coincidental users. Between the shared IP addresses, active times, narratives, etc., the researchers conclude that these patterns reflect more than just random chance. The evidence shows those coincidental users were engaged in various forms of intentional, collaborative effort in their posts, comments, and other behaviors. Although one cannot claim for certain that the intent of these users is to manipulate public opinion, the researchers cannot conclude any other plausible explanation to justify such behavior because it is so different from typical general users.

The following collaborative behaviors were observed during this time period:

  1. Users in each coincidental user groups were active during similar times
  2. Coincidental users were more likely to participate in events with more comments, and in events with higher incite scores
  3. Coincidental users showed much higher rates of manipulative patterns
  4. Each coincidental user group demonstrated different preferences in their patterns of behavior
  5. Some coincidental user groups favored just one negative/manipulative slogan, while others used more than one, or none at all.
  6. Different coincidental user groups favored different specific narratives and word choices.

In summary, the researchers want to emphasize that behavior indicative of the intent to manipulate opinions on social media platforms appears to be very active. In the time period selected for this study, there were 880 atypical events out of 1,985 targeted events. This means that there were collaborative users demonstrating intentionally manipulative behaviors in almost half of all targeted events. This report also showed that manipulation could and does happen on a variety of topics, including sporting, business, entertainment, politics, etc. The research team believes that more work should be done to study and catalog atypical, manipulative behaviors on social media across a variety of platforms. We believe that users of social media platforms should be made aware of collaborative, manipulative behavior in order to know when others may be attempting to influence public opinion.

to download the complete version of the research paper, please submit the request form at the end of this blog


This study focused on atypical coincidental behavior attempting to uncover the types of events that were manipulated via social media and the ways in which they were manipulated. To do so, the important events that happened during the Taiwanese outbreak of the COVID-19 pandemic, as well as the users with coincidental behavior, had to be identified to study the interactions between the events and the coincidental users.

Figure 1 presents an overview of the methodology framework. With the framework provided, all the data posted on the PTT Gossiping Board from May 1, 2021, to August 31, 2021, including 130,099 users and 8,413,675 comments on 293,370 posts, were analyzed to detect events and group users. Additionally, through the comment content analysis, features were extracted for the study of the interactions between the events and the coincidental users. With the results obtained from event user analysis and comment content analysis, coincidental users were first analyzed, then coincidental groups were analyzed.

GitHub: https://github.com/ailabstw/opinion-analysis


1 PTT is the largest local forum in Taiwan.

2 This research defines an ”event” as the sum total of collected news articles on a topic, combined with all social media reactions to it, on given platforms. For more details about how the data was clustered, please download the complete essay and refer to Chapter II Methodology.

The Challenge of Speaker Diarization


Speaker Diarization is the task to partition audio recordings into segments corresponding to the identity of the speaker. In short, a task to identify “who spoke when”. Speaker diarization can be used to analyze audio data in various scenarios, such as business meetings, court proceedings, online videos, just to name a few. However, it is also a very challenging task since characteristics of several speakers have to be jointly modelled.

Traditionally, a diarization system is a combination of multiple, independent sub-modules, which are optimized individually. Recently, end-to-end deep learning methods are becoming more popular. The metric for speaker diarization is diarization error rate (DER), which is the sum of false alarm, missed detection and confusion between speaker labels.

In this article, we will introduce our upcoming speaker diarization system first, then give an overview of the latest research in end-to-end speaker diarization.


Our Method

Our product uses a traditional diarization pipeline, which consists of several components: speech turn segmentation, speaker embedding extraction, and clustering. We utilize pyannote.audio [2], an open-source toolkit for speaker diarization, to train most of our models. 


1. Speech Turn Segmentation

The first step of our diarization system is to partition the audio recordings into possible speaker turn segments. A voice activity detection model is used to detect speech regions, while removing non-speech parts. Speaker change detection model is used to detect speaker change points in the audio. Each of these models are trained to optimize a sequence labeling task: With sequence of audio features as input, output a sequence of labels.

For both models, there are some tunable hyperparameters which determine how sensitive the models are on segmenting the audio. For the speaker change detection model, only audio frames with detection score higher than the threshold “alpha” are marked as speaker change points. We notice that it is generally more beneficial to segment more aggressively (i.e. split the whole audio into more segments) in the speech turn segmentation stage, so as to make sure most segments only contain one speaker. After the clustering stage, we can merge segments assigned to the same speaker for better Speech Recognition performance in following stages.


2. Speaker Embedding Extraction

After performing speech turn segmentation, in order to facilitate clustering of speaker segments in the clustering module, a speaker embedding model is used to obtain a compact, fine-grained representation for each segment.

The model we choose is SincTDNN which is basically a x-vector architecture where filters are learned instead of being handcrafted. Additive angular margin loss is applied to train the model. 

We utilize CNCELEB [3], an open source large-scale chinese speaker recognition dataset, to fine tune the model pretrained on voxceleb. CNCELEB is a challenging, multi-domain dataset consisting of 3,000 speakers in 11 different genres. By using such a diverse dataset, we expect our model to be stable enough when facing various real life scenarios. We also notice that only very few epoch is needed to fine tune the model, and that speaker embedding with lower EER may not always imply lower diarization error rate.


3. Clustering

Traditional cluster methods can be used to cluster the speaker embedding, identifying which speaker each segment belongs to. In our system, we leverage affinity propagation for short audio, and KMeans for long audio. Since we do not know how many speakers are in the recording, affinity propagation has relatively better performance than KMeans as it can determine the number of clusters directly. While for KMeans, we estimate the best number of clusters by finding the elbow of derivatives in MSCD. However, we resort to KMeans instead of affinity propagation for longer audio files since affinity propagation tends to be slower in this situation.

Combining all components

With all the aforementioned components trained, we can combine them into a complete pipeline for diarization. Several hyperparameters are optimized jointly to minimize diarization error rate on a certain dataset. 

Compared to our previous system, the current system obtains around 30% DER relative improvement on our internal dataset flow-inc, which consists of a few thousands news recordings.  Most improvements stem from improved speaker embedding and clustering method, which improved from 15% to 8% DER for oracle speech turn segmentation setting.

In our production system, we also allow some extra customization for our users. For instance, if the user provides the number of speakers in the recording, we can use KMeans to recluster according to the number of speakers. Also, if the user provides or corrects some speaker labels of the utterances, this extra information is involved when updating the cluster centers. Re-cluster will progress with partial initialization based on updated centers, resulting in more accurate predictions.

Finally, the output of speaker diarization can be further fed to the input of an ASR system, and the final output is the transcript for each speaker turn. It is not very certain if diarization can help improve ASR performance; however, recent work has shown that ASR with diarization can obtain comparable WER as state-of-the-art ASR system, but with extra speaker label information.


What’s Next?

While most systems in production still follow the traditional pipeline of speech turn segmentation + clustering, End-to-end diarization systems are receiving more and more attention. As traditional pipeline assumes single-speaker per block while extracting speaker embedding, it can by no means handle the problem of overlapping speech. End-to-end diarization systems receive frame-level speech features as input, and directly output frame-level speaker activity for each speaker, thus can handle overlapping speech in an easy manner. Also, end-to-end systems minimize diarization error directly, getting rid of the need to tune hyper-parameters of several components. Today, end-to-end systems are able to perform on par with, or even better than traditional pipeline systems in many datasets.

However, end-to-end systems still have several drawbacks. Firstly, these systems cannot easily handle an arbitrary number of speakers. The originally proposed EEND [4] can only deal with a fixed number. Several follow-up works attempt to address this problem, but their systems still struggle to generalize to real world settings, where there may be more than 3 speakers.

Secondly, while end-to-end systems are optimized directly for diarization error rate, they may easily overfit to the training dataset, for instance the number of speakers, speaker characteristics, as well as background conditions. In contrast, traditional clustering-based approaches are shown to be more robust across datasets. Lastly, most end-to-end diarization systems use transformer or conformer as their core architecture. Self-attention mechanism in transformer has quadratic complexity, which hinders the model’s capability to process long utterances or meetings.

Recently, hybrid systems integrating both an end-to-end neural network and clustering are proposed [5][6], which try to take the pros of both worlds. Hybrid systems handling overlapping speech with an end-to-end network, and able to handle an arbitrary number of speakers using clustering algorithms. Also, block processing of input audio data  [5] significantly shortens the process time of long recordings. This series of research is very promising and deserves more attention.

Finally, many state-of-the-art systems still leverage simulated mixture recordings for training, then fine tune and evaluate on real dataset. This indicates the lack of large-scale real-life dataset for end-to-end training. By introducing more challenging real-life dataset, it is believed that diarization performance can be further improved. 



To address the challenging task of speaker diarization, we leverage a system consisting of several neural networks: speech activity detection, speaker change detection, speaker embedding, as well as clustering algorithm. While these components are trained separately, they are optimized jointly by tuning hyper-parameters. For further work, we would like to incorporate an end-to-end segmentation model to handle overlapping speech, as well as making our system more real-time.



[1] A Review of Speaker Diarization: Recent Advances with Deep Learning https://arxiv.org/abs/2101.09624 

[2] pyannote.audio: neural building blocks for speaker diarization https://arxiv.org/abs/1911.01255

[3] CN-Celeb: multi-genre speaker recognition https://arxiv.org/abs/2012.12468   

[4] End-to-End Neural Speaker Diarization with Self-attention https://arxiv.org/abs/1909.06247 

[5] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech https://arxiv.org/abs/2105.09040 

[6] Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors  https://arxiv.org/abs/2107.01545 

[7] The yellow brick road of diarization, challenges and other neural paths https://dihardchallenge.github.io/dihard3workshop/slide/The%20yellow%20brick%20road%20of%20diarization,%20challenges%20and%20other%20neural%20paths.pdf 


Lyrics-free Singing Voice Generation

The conventional approach to generate singing voices is through singing voice synthesis (SVS) techniques. A human user feeds lyrics and MIDI scores (a sequence of notes) to a well-trained SVS model, and the model generates audio recordings following the given lyrics and scores faithfully. The synthesis models have little freedom deciding what to “sing.”

In contrast to this conventional approach, we are interested in a singing voice generation model that does not take any lyrics and MIDI scores as input, but instead decides the phonemes and pitches underlying its singing pretty much on its own.  We consider this setting as more interesting as it emphasizes the “creativity” of the model.

Accordingly, we chose to directly work on audio recordings without explicitly extracting pitches or lyrics from the singing audios. The first problem was how to gather enough singing voice audios in order to train neural network models. Luckily, we were experienced in music source separation and the music source separation can already separate the singing voices with reasonably good quality.

To allow the singing voice generation model to work together with our piano generation models (introduced here), we also want the model or a variant of it to sing according to a given piano track but not sing the exact notes of the piano track.

To model the singing voices, in the winter of 2019, we designed a GAN model that can generate singing voices freely as well as a variant of it that can sing along with a given accompaniment audio. Different from conventional GANs for image generation, the proposed model can generate audios with indefinite durations. This work was published in the International Conference on Computational Creativity (ICCC) 2020. This version of models will be referred to as the first generation.


The first generation of models produce very creepy singing voices that could make you have nightmares, so we developed the second generation of models. They were also GAN-based models, but the architecture and the vocoder were both improved. One particular extra mechanism is the cycle regularization, which largely improved the pronunciation and the intelligibility of phones. This work was published in INTERSPEECH 2020.


Although the sound quality was largely improved in the second generation, the generated singing voices did not sound good musically. Therefore, we started to build new models for better sequence modeling capability. Almost at the same time, OpenAI proposed JukeBox, a two-step method for music generation that directly worked on audios. In this two-step method, 1) an audio was converted into a sequence of discrete tokens with VQ-VAE, and 2) the tokens were modeled by a Transformer. In our third generation of models, we adopted this two-step method, while redesigning the architecture based on our goals.


One particular improvement was that the model can run in realtime. A variant of it could accompany the piano playing in realtime. The music team here at the Taiwan AI Labs has deployed it for realtime interaction in various exhibitions and art installations since the autumn of 2020.


We have been working on various improvements and variants of this model to have different usages for different applications.  Our best model now can generate singing voices with quality much better than what described above.  The new model is likely to make it debut soon.  Stay tuned with us for more fun!


By: Jen-Yu Liu and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

MuseMorphose: Music Style Transfer with A Transformer VAE

At Taiwan AI Labs, we are constantly pushing the frontier of deep music generation models. In the past year, we have rolled out Guitar Transformer (blog), which can compose human-readable guitar tabs with plausible fingerings, and Compound Word Transformer (blog), which vastly accelerated model training and inference thanks to carefully re-engineered music representation. Today, proudly making its debut is MuseMorphose, our brand new model for music style transfer.

Unlike our previous works, which offered limited possibilities for user interaction, MuseMorphose is designed to extensively engage users in the machine creative process. With MuseMorphose, one may input his/her favorite song, the length of which unlimited, and set two musical style attributes, namely, rhythmic intensity and polyphony (i.e., harmonic fullness), of every bar to his/her desired level (0~7 possible). The model will then re-create the song, taking into account the user-specified sequence of bar-level style attributes.

Listening Samples

To showcase MuseMorphose’s capabilities, let us present some of its compositions first!

In the upper example, we feed the famous 8-bar theme of Mozart’s “Twinkle, Twinkle, Little Star” to the model, and ask it to generate a style-transferred version with increasing rhythmic and harmonic intensities (going up by 1 level per bar). For the next one, we choose an 8-bar excerpt from our AILabs1K7 pop song dataset (published here) and pick three drastically different style settings for MuseMorphose to wield its creativity.

Across all the samples, MuseMorphose responds precisely to the style settings. What’s more, it sticks faithfully to the musical flow of input song while adding its own creative and harmonious touches.

MuseMorphose: The Model

Now that we have had some pleasant music, it is time to delve into the technical underpinnings of MuseMorphose.

Architecture of MuseMorphose

Architecture of MuseMorphose

MuseMorphose is based on two popular deep learning models for music generation: Variational Autoencoder (VAE), and Transformer. VAE (see also: MusicVAE & Attr-Aware VAE) grants users the freedom to harness the generation through operations on its learned latent space, much like we have shown in the examples; however, its RNN backbone greatly limits the length of music it can model.

Transformer (check out: Music Transformer & MuseNet), on the other hand, can generate music of up to 5 minutes long, but its conditional generation use cases remain underexplored. People could only use global condition tokens to affect a composition’s style or instrumentation, or, in some other cases, they have to supply a full melody/track for the model to come up with an accompaniment. It has not been possible to freely edit a piece according to its high-level musical flow. Therefore, we integrate the two models to construct MuseMorphose, which exhibits both of their strengths, and gets rid of their weaknesses.

In MuseMorphose, a Transformer encoder is tasked with extracting the musical skeleton of each bar in a piece as vectors (called “latent conditions” in the figure above). Then, these skeletons are concatenated with bar-level style inputs from the user, i.e., the rhythmic intensity and polyphony levels, mapped to learned embedding vectors. Finally, the concatenated conditions enter sequentially, i.e., bar by bar, into a Transformer decoder through the in-attention mechanism developed by us, to produce the style-transferred piece.

This asymmetric encoder-decoder design ensures the fine granularity of conditions, and maintains Transformer’s inherent ability to generate long sequences. Moreover, the in-attention mechanism, which injects each bar-level condition to corresponding timesteps and all layers of the Transformer decoder, is the key to effective conditioning.

Further Materials

You may take a look at our paper to find more discussions on the architecture design, training objective, and evaluation. Our demo website provides even more compositions by MuseMorphose, as well as those by the baselines it outperforms, for you to listen. Want to compose some music with MuseMorphose? No problem! Just check out our open-source implementation and pre-trained checkpoint.

If you find our work exciting, and have some thoughts/suggestions about it (e.g., what other style attributes may be added to MuseMorphose), feel free to drop us a mail. We are definitely looking forward to a lively discussion.

By: Shih-Lun Wu, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Compound Word Transformer: Generate Pop Piano Music of Full-Song Length

Over the past months, we attempted to let transformer models learn to generate full-song music, and here is our first attempt towards that, the Compound Word Transformer. A  paper describing this work is going to be published as a full paper at AAAI 2021, the premier conference in the field of artificial intelligence.

You can find a preprint of the paper, and the open source code of the model, from the links below:

We made two major improvements in this work: First, a new representation – Compound Word is presented, which can let transformer models accept longer input sequence. Second, we adopt the linear transformer as our backbone model. The memory complexity of linear transformer scales linearly with respect to the sequence length, and such property enables us to train longer sequences on limited hardware budgets.

To our best knowledge, this is also the first AI model that deals with music at the full-song scale, instead of music segments. Currently, we let our model focus on learning to compose pop piano music, whose average length is approximate 3 to 5 minutes long. Before diving into the technical details, let’s listen to some generated pieces:

All clips above are generated from scratch. Aside from this, our model can also understand the lead sheet, which conveys only melody line and chord progression, and translate it into an expressive piano performance:

How do we nail it? We reviewed our previous work REMI and found the way to represent music can be further condensed. If the required sequence length is shorter, we can feed longer music into the model. The figure below displays the process of how we convert REMI into Compound Words representation. Instead of predicting only one token per timestep, CP groups consecutive and related tokens and predict them at once so as to reduce the sequence length. Each field of CP is related to a certain aspect/attribute of music, such as duration, velocity, chords, and etc

For modeling such CP sequences, we design the specialized input and output modules of the transformer. The figure below illustrates the proposed architecture in action at a certain timestep.  On the right half part, where the model makes token prediction, you can see that there are multiple feedforward heads,  each accounting for a field of CP, which corresponds to a single row of CP shown in the figure above. On the left half part, each field of CP has its own token embedding, which will be concatenated as the vector and then reshaped by a linear layer, to become the final input of the transformer.

Because each head only concentrates on a certain field of CP, we can have more precise control when either modeling or generating music. For example, in training, we can assign different sizes of embedding to tokens of different types, according to the difficulty level associated with each type of token. We set larger for harder ones like duration and pitches, and smaller for the easier ones like beats and bars. In the inference time, we can adopt different sampling policies. For example, we can use larger temperature to have more randomness in the prediction of velocity tokens; and smaller temperature for pitch tokens to avoid mistakes.

The proposed model shows good training and inference efficiency. Now we can train our model on a single GPU with 11GB memory within just one day. In inference time, to generate a 3-minute song takes only about 20 seconds, which is much faster than real-time.

Learning to generate full songs means that the model can take the whole song as input and knows when to start when generation. However, the music generated by the current model still does not exhibit clear structural patterns, like AABA form, or repetitive phrases, etc. We are working on this, hoping one day our AI can write a hit song.

More examples of the generated music can be found at:



By: Wen-Yi Hsiao, Jen-Yu Liu,  Yin-Cheng Yeh, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Guitar Transformer and Jazz Transformer

At the Yating Music Team of the Taiwan AI Labs, we are developing new music composing AI models extending from our previous Pop Music Transformer model (see the previous blog).  In October 2020, we are going to present two full papers documenting some of our latest result at the International Society for Music Information Retrieval Conference (ISMIR), the premier international conference on music information retrieval and music generation.


The first paper, entitled “Automatic composition of guitar tabs by Transformers and groove modeling,” talks about a Guitar Transformer model that learns to generate guitar tabs.

Here are two tabs in the style of guitar fingerstyle  generated fully automatically by this model; no human curation involved.

The highlight of this work is to design and incorporate what we call the “grooving” tokens to the representation we use to represent a piece of symbolic guitar music.  Groove,  which can be in general considered as a rhythmic feeling of a changing or repeated pattern, or “humans’ pleasurable urge to move their bodies rhythmically in response to music,” is not explicitly specified in either a MIDI or TAB file.  Instead, groove is implicitly implied as a result of the arrangement of note onsets over time.  Therefore, existing methods for representing music do not involve the use of groove-related tokens.

What we did is to apply music information retrieval (MIR) techniques to extract a 16-dimensional vector representing the occurrence of note onsets over 16 possible equally-spaced quantized positions of a bar, and then use the classical kmeans algorithm to cluster such 16-dim vectors from all the bars from all the pieces of our training data, leading to k (=32) clusters (we need clustering for otherwise there will be too many unique such 16-dim vectors).  We then treat these cluster IDs as “grooving tokens” and assign a grooving token to each bar of a music piece.  In this way, what our Transformer model (specifically, we use Transformer-XL) sees during model training would be not only the note-related tokens but also such bar-level grooving tokens.  It turns that this improves quite a lot the quality of the generated music, compared to the baseline model that does not use grooving tokens.

The following figure shows the result of a user study asking subjects to choose the best among the three continuations generated by different models, with or without the grooving tokens, given a short human-made prompt. The result is broken down according to the self-report guitar proficiency level of the subjects.  We can see that the professionals are aware of the difference between the grooving-agnostic model and the two groove-aware models (we implemented two variants here, a `hard grooving` model and a `soft grooving model`, which differ in the way we represent the musical onsets).


The second paper, entitled “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” talks about a Jazz Transformer model that learns to generate Jazz-style lead sheets, using the Jazzomat dataset.

The focus of this paper can be said to be about the development of objective metrics tailored for symbolic music generation tasks (namely, not for general sequence generation tasks).  Specifically, we proposed the following metrics:

  • Pitch Usage: Entropy of 1- & 4-bar chromagrams;
  • Rhythm: Cross-bar similarity of grooving patterns;
  • Harmony: Percentage of unique chord trigrams;
  • Repeated structures: Short-, mid-, & long-term structureness indicators (computed from the fitness scape plot);
  • Overall musical knowledge: Multiple-choice continuation prediction (given 8 bars, predict the next 8 bars).

Implementation of these evaluation metrics can all be found in the `MusDr` repository listed above.  It has also been integrated into `MusPy,` an open source Python library for symbolic music generation developed by Hao-Wen Dong et al at UCSD.

These objective metrics can help us gain some ideas about the quality of the machine-generated music, dispensing the need to run the expensive user studies too many times.  These metrics also help elucidate the difference between machine-composed music and human-composed ones.  For example, from the structureness indicators, we can see clearly that Transformer-XL based music composing models, which represent a current state-of-the-art, still fall short of generating music with reasonable mid- and long-term structure.  See the following figure for a comparison between the fitness scape plot of a piece composed by the Jazz Transformer (marked as `Model (B)’), and that of a human-composed one.

We are actively using these new metrics to guide and to improve our models.  In particular, we are finding ways to induce structures in machine-composed music.  Let us know if you like what we are doing and/or have some ideas to chat with us!


[1] Yu-Hua Chen, Yu-Siang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Automatic composition of guitar tabs by Transformers and groove modeling,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).
[2] Shih-Lun Wu and Yi-Hsuan Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).

By: Yu-Hua Chen, Shih-Lun Wu, Yu-Siang Huang, Wen-Yi Hsiao and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Missing facts and source classifier of daily news


With the development of the Internet, it is convenient that one can get news from network rapidly. Nonetheless, it is also dangerous since a person might have a preference for a specific news medium over others, and the news medium may have its position to report news while missing some facts that they do not want people to know. For the purpose, we propose a system to detect missing facts from a news report.


Work flow

First, we group the news using their embeddings derived by Universal Sentence Encoder(USE)[1]. Within each group, the news are highly related. In fact, most of them report the same event as exptected. Meanwhile, we extract the summary for each news report using an algorithm called PageRank. Then, we further summarize each news group using the summaries of the news in the group. Afterwards, we compare the summary of each news report with the group summary to get the missing facts of the report.

Grouping news

This part is implemented by Yu-An, Wang, and I omit it here.


For each news report, we split the raw content into sentences. Then we count the similarity of each sentence pair using USE and editdistance. Once a similarity matrix is ready, we can sort the sentences by PageRank. We then choose top k sentences to be the report’s summary, where k is a hyperparameter.

Combining the summaries of the news in a group, we can summarize the news group in a similar way.

Missing facts

We call each sentence in a news summary “a fact”. Now we can obtain the difference between a news summary and the group summary, and the sentences existing in the group summary but not in the news summary are the missing facts of the report. Otherwise, the sentences existing in the news summary but not in the group summary is the exclusive content of the news report.


島民衛星 https://islander.cc/

Source Classifier for Daily News and Farm Post


Sometimes the news media are not neutral as expected. Hence, we can train a satisfactory classifier due to the biases. Or to say, the classifier will work since a news medium might have its specific writing style. We use BERT[2] as the contexual representation extractor to train a classifier in order to predict the source (news medium) of a news report given its content (or title). Except for the daily news, this kind of classifier can also be adopted to classify the source of farm posts, which are usually biased and contain fake information.


  • Daily news
    • # classes: 4
    • 台灣四大報(中國時報、聯合報、自由時報、蘋果日報)
    • Preprocessing
      • Sometimes the reporter name, the news medium itself, or some slogans are in the raw content. We filter them using hand-crafted rules.
  • Farm post
    • # classes: 16
    • mission-tw, qiqu.live, qiqu.world, hssszn, qiqi.today, cnba.live, 77s.today, i77.today, nooho.net, hellotw, qiqu.pro, taiwan-politicalnews, readthis.one, twgreatdaily.live, taiwan.cnitaiwannews.cn



  • As shown in the figure below, this classifier is composed of a BERT model followed by a linear layer. This model can handle the input whose content length is less than 512. Note that the latent vector is corresponding to the CLS token.


  • Since the pretrained BERT model is for the cases with less-than-512 texts, we propose a method to deal the cases with more-than-512 texts.
  • First, we train a C-512 model. The C-whole model uses the BERT module in the trained C-512 model as its representation extractor. Given an input content with arbitary text length, ll, we split it into ⌈l/512⌉⌈l/512⌉ segments. The representation for each segment is derived by the BERT module, and then it is passed through a linear layer. Averaging all these outputs from the linear layer, the vector is fed into another linear layer to obtain the final output of the C-whole model.


島民衛星 https://islander.cc/

1. Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018). ↩︎

2. Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 (2018). ↩︎