Exploring Atypical Online Coincidental Behavior on PTT

This study focuses on atypical coincidental behavior on the Taiwan social media, PTT1, to discover attempts to manipulate public opinion during the outbreak of the COVID-19 in Taiwan from May through August 2021. The research team aims to identify atypical coincidental behaviors to uncover suspicious collaborative efforts which attempt to manipulate public opinions, together with developing AI tools to analyze the information comprehensively and efficiently over the outbreak period and assists researchers by saving human labor and time.

Since its launch in 1995, PTT has become one of the most used Chinese-language online services. In Taiwan, many users make it a habit to post news and information from various sources, which leads to a diverse spectrum of discussions, and often for many, the discussion board becomes the first stop of the news outlet. Almost all the information related to Taiwan can be discovered on PTT. The information on PTT is discussed by users with their opinions and ideas.

Such discussions are often later presented in news media for larger public consumption. This effect not only amplifies these ideas for a wider audience, but it also affects businesses and governmental institutions during their decision-making process when they need to adhere to public opinions. For example, when whistle-blower Dr. Wenliang Li of China first broke evidence in Chinese media of human-to-human transmission of COVID-19, a Taiwanese physician began sharing the information on PTT. Because of these discussions, authorities at the Taiwan Centers for Disease Control were alerted and began to take action much earlier than other countries. In this way, one could compare PTT to a speech stand in the center of the town square, upon which users can take space to give public addresses where the majority of the townsfolks are the audience. The more time someone spends on the speaker stand, the greater potential influence they will have on the town. Users who have the power to influence discussions could be likened to a bullying majority dominating the use of the speaker stands for their own ends.

Since 2018, Taiwan media and academia have been observed the potential of cyber armies conducting strategic information operations through social networks such as PTT, Line, and Facebook. A Pilot Study on PTT in the Context of Influence Operations introduces the interface, functions, and terminology of PTT. Other studies identified groups of cyber armies on the Gossiping Board of PTT. Compared to previous papers, this research contains not only case studies but also a data-driven and evidence-based approach, comprehensively quantifies “atypical coincidental behavior” and summarizes the differences between user groups.

The research team grouped users by the phi-coefficient score to measure coincidental behavior in many aspects, including shared IP addresses, time pattern of activity, narratives, emotion, and incitement of comments. Furthermore, the study compared several metrics of behavior among user groups in various events2 to discover the evidence of information manipulation and observed the correlation between ideological slogans used and groups of users. By observing the opinion manipulation on social media, users may be able to more readily distinguish atypical coincidental behavior and therefore decrease their chances of being manipulated. Through exposing more context behind user content, this research hopes to decrease the negative impact of atypical coincidental behavior on public opinion.

Furthermore, the researchers compared the behavior of coincidental users. Between the shared IP addresses, active times, narratives, etc., the researchers conclude that these patterns reflect more than just random chance. The evidence shows those coincidental users were engaged in various forms of intentional, collaborative effort in their posts, comments, and other behaviors. Although one cannot claim for certain that the intent of these users is to manipulate public opinion, the researchers cannot conclude any other plausible explanation to justify such behavior because it is so different from typical general users.

The following collaborative behaviors were observed during this time period:

  1. Users in each coincidental user groups were active during similar times
  2. Coincidental users were more likely to participate in events with more comments, and in events with higher incite scores
  3. Coincidental users showed much higher rates of manipulative patterns
  4. Each coincidental user group demonstrated different preferences in their patterns of behavior
  5. Some coincidental user groups favored just one negative/manipulative slogan, while others used more than one, or none at all.
  6. Different coincidental user groups favored different specific narratives and word choices.

In summary, the researchers want to emphasize that behavior indicative of the intent to manipulate opinions on social media platforms appears to be very active. In the time period selected for this study, there were 880 atypical events out of 1,985 targeted events. This means that there were collaborative users demonstrating intentionally manipulative behaviors in almost half of all targeted events. This report also showed that manipulation could and does happen on a variety of topics, including sporting, business, entertainment, politics, etc. The research team believes that more work should be done to study and catalog atypical, manipulative behaviors on social media across a variety of platforms. We believe that users of social media platforms should be made aware of collaborative, manipulative behavior in order to know when others may be attempting to influence public opinion.

to download the complete version of the research paper, please submit the request form at the end of this blog


This study focused on atypical coincidental behavior attempting to uncover the types of events that were manipulated via social media and the ways in which they were manipulated. To do so, the important events that happened during the Taiwanese outbreak of the COVID-19 pandemic, as well as the users with coincidental behavior, had to be identified to study the interactions between the events and the coincidental users.

Figure 1 presents an overview of the methodology framework. With the framework provided, all the data posted on the PTT Gossiping Board from May 1, 2021, to August 31, 2021, including 130,099 users and 8,413,675 comments on 293,370 posts, were analyzed to detect events and group users. Additionally, through the comment content analysis, features were extracted for the study of the interactions between the events and the coincidental users. With the results obtained from event user analysis and comment content analysis, coincidental users were first analyzed, then coincidental groups were analyzed.

GitHub: https://github.com/ailabstw/opinion-analysis


1 PTT is the largest local forum in Taiwan.

2 This research defines an ”event” as the sum total of collected news articles on a topic, combined with all social media reactions to it, on given platforms. For more details about how the data was clustered, please download the complete essay and refer to Chapter II Methodology.

The Challenge of Speaker Diarization


Speaker Diarization is the task to partition audio recordings into segments corresponding to the identity of the speaker. In short, a task to identify “who spoke when”. Speaker diarization can be used to analyze audio data in various scenarios, such as business meetings, court proceedings, online videos, just to name a few. However, it is also a very challenging task since characteristics of several speakers have to be jointly modelled.

Traditionally, a diarization system is a combination of multiple, independent sub-modules, which are optimized individually. Recently, end-to-end deep learning methods are becoming more popular. The metric for speaker diarization is diarization error rate (DER), which is the sum of false alarm, missed detection and confusion between speaker labels.

In this article, we will introduce our upcoming speaker diarization system first, then give an overview of the latest research in end-to-end speaker diarization.


Our Method

Our product uses a traditional diarization pipeline, which consists of several components: speech turn segmentation, speaker embedding extraction, and clustering. We utilize pyannote.audio [2], an open-source toolkit for speaker diarization, to train most of our models. 


1. Speech Turn Segmentation

The first step of our diarization system is to partition the audio recordings into possible speaker turn segments. A voice activity detection model is used to detect speech regions, while removing non-speech parts. Speaker change detection model is used to detect speaker change points in the audio. Each of these models are trained to optimize a sequence labeling task: With sequence of audio features as input, output a sequence of labels.

For both models, there are some tunable hyperparameters which determine how sensitive the models are on segmenting the audio. For the speaker change detection model, only audio frames with detection score higher than the threshold “alpha” are marked as speaker change points. We notice that it is generally more beneficial to segment more aggressively (i.e. split the whole audio into more segments) in the speech turn segmentation stage, so as to make sure most segments only contain one speaker. After the clustering stage, we can merge segments assigned to the same speaker for better Speech Recognition performance in following stages.


2. Speaker Embedding Extraction

After performing speech turn segmentation, in order to facilitate clustering of speaker segments in the clustering module, a speaker embedding model is used to obtain a compact, fine-grained representation for each segment.

The model we choose is SincTDNN which is basically a x-vector architecture where filters are learned instead of being handcrafted. Additive angular margin loss is applied to train the model. 

We utilize CNCELEB [3], an open source large-scale chinese speaker recognition dataset, to fine tune the model pretrained on voxceleb. CNCELEB is a challenging, multi-domain dataset consisting of 3,000 speakers in 11 different genres. By using such a diverse dataset, we expect our model to be stable enough when facing various real life scenarios. We also notice that only very few epoch is needed to fine tune the model, and that speaker embedding with lower EER may not always imply lower diarization error rate.


3. Clustering

Traditional cluster methods can be used to cluster the speaker embedding, identifying which speaker each segment belongs to. In our system, we leverage affinity propagation for short audio, and KMeans for long audio. Since we do not know how many speakers are in the recording, affinity propagation has relatively better performance than KMeans as it can determine the number of clusters directly. While for KMeans, we estimate the best number of clusters by finding the elbow of derivatives in MSCD. However, we resort to KMeans instead of affinity propagation for longer audio files since affinity propagation tends to be slower in this situation.

Combining all components

With all the aforementioned components trained, we can combine them into a complete pipeline for diarization. Several hyperparameters are optimized jointly to minimize diarization error rate on a certain dataset. 

Compared to our previous system, the current system obtains around 30% DER relative improvement on our internal dataset flow-inc, which consists of a few thousands news recordings.  Most improvements stem from improved speaker embedding and clustering method, which improved from 15% to 8% DER for oracle speech turn segmentation setting.

In our production system, we also allow some extra customization for our users. For instance, if the user provides the number of speakers in the recording, we can use KMeans to recluster according to the number of speakers. Also, if the user provides or corrects some speaker labels of the utterances, this extra information is involved when updating the cluster centers. Re-cluster will progress with partial initialization based on updated centers, resulting in more accurate predictions.

Finally, the output of speaker diarization can be further fed to the input of an ASR system, and the final output is the transcript for each speaker turn. It is not very certain if diarization can help improve ASR performance; however, recent work has shown that ASR with diarization can obtain comparable WER as state-of-the-art ASR system, but with extra speaker label information.


What’s Next?

While most systems in production still follow the traditional pipeline of speech turn segmentation + clustering, End-to-end diarization systems are receiving more and more attention. As traditional pipeline assumes single-speaker per block while extracting speaker embedding, it can by no means handle the problem of overlapping speech. End-to-end diarization systems receive frame-level speech features as input, and directly output frame-level speaker activity for each speaker, thus can handle overlapping speech in an easy manner. Also, end-to-end systems minimize diarization error directly, getting rid of the need to tune hyper-parameters of several components. Today, end-to-end systems are able to perform on par with, or even better than traditional pipeline systems in many datasets.

However, end-to-end systems still have several drawbacks. Firstly, these systems cannot easily handle an arbitrary number of speakers. The originally proposed EEND [4] can only deal with a fixed number. Several follow-up works attempt to address this problem, but their systems still struggle to generalize to real world settings, where there may be more than 3 speakers.

Secondly, while end-to-end systems are optimized directly for diarization error rate, they may easily overfit to the training dataset, for instance the number of speakers, speaker characteristics, as well as background conditions. In contrast, traditional clustering-based approaches are shown to be more robust across datasets. Lastly, most end-to-end diarization systems use transformer or conformer as their core architecture. Self-attention mechanism in transformer has quadratic complexity, which hinders the model’s capability to process long utterances or meetings.

Recently, hybrid systems integrating both an end-to-end neural network and clustering are proposed [5][6], which try to take the pros of both worlds. Hybrid systems handling overlapping speech with an end-to-end network, and able to handle an arbitrary number of speakers using clustering algorithms. Also, block processing of input audio data  [5] significantly shortens the process time of long recordings. This series of research is very promising and deserves more attention.

Finally, many state-of-the-art systems still leverage simulated mixture recordings for training, then fine tune and evaluate on real dataset. This indicates the lack of large-scale real-life dataset for end-to-end training. By introducing more challenging real-life dataset, it is believed that diarization performance can be further improved. 



To address the challenging task of speaker diarization, we leverage a system consisting of several neural networks: speech activity detection, speaker change detection, speaker embedding, as well as clustering algorithm. While these components are trained separately, they are optimized jointly by tuning hyper-parameters. For further work, we would like to incorporate an end-to-end segmentation model to handle overlapping speech, as well as making our system more real-time.



[1] A Review of Speaker Diarization: Recent Advances with Deep Learning https://arxiv.org/abs/2101.09624 

[2] pyannote.audio: neural building blocks for speaker diarization https://arxiv.org/abs/1911.01255

[3] CN-Celeb: multi-genre speaker recognition https://arxiv.org/abs/2012.12468   

[4] End-to-End Neural Speaker Diarization with Self-attention https://arxiv.org/abs/1909.06247 

[5] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech https://arxiv.org/abs/2105.09040 

[6] Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors  https://arxiv.org/abs/2107.01545 

[7] The yellow brick road of diarization, challenges and other neural paths https://dihardchallenge.github.io/dihard3workshop/slide/The%20yellow%20brick%20road%20of%20diarization,%20challenges%20and%20other%20neural%20paths.pdf 


Lyrics-free Singing Voice Generation

The conventional approach to generate singing voices is through singing voice synthesis (SVS) techniques. A human user feeds lyrics and MIDI scores (a sequence of notes) to a well-trained SVS model, and the model generates audio recordings following the given lyrics and scores faithfully. The synthesis models have little freedom deciding what to “sing.”

In contrast to this conventional approach, we are interested in a singing voice generation model that does not take any lyrics and MIDI scores as input, but instead decides the phonemes and pitches underlying its singing pretty much on its own.  We consider this setting as more interesting as it emphasizes the “creativity” of the model.

Accordingly, we chose to directly work on audio recordings without explicitly extracting pitches or lyrics from the singing audios. The first problem was how to gather enough singing voice audios in order to train neural network models. Luckily, we were experienced in music source separation and the music source separation can already separate the singing voices with reasonably good quality.

To allow the singing voice generation model to work together with our piano generation models (introduced here), we also want the model or a variant of it to sing according to a given piano track but not sing the exact notes of the piano track.

To model the singing voices, in the winter of 2019, we designed a GAN model that can generate singing voices freely as well as a variant of it that can sing along with a given accompaniment audio. Different from conventional GANs for image generation, the proposed model can generate audios with indefinite durations. This work was published in the International Conference on Computational Creativity (ICCC) 2020. This version of models will be referred to as the first generation.


The first generation of models produce very creepy singing voices that could make you have nightmares, so we developed the second generation of models. They were also GAN-based models, but the architecture and the vocoder were both improved. One particular extra mechanism is the cycle regularization, which largely improved the pronunciation and the intelligibility of phones. This work was published in INTERSPEECH 2020.


Although the sound quality was largely improved in the second generation, the generated singing voices did not sound good musically. Therefore, we started to build new models for better sequence modeling capability. Almost at the same time, OpenAI proposed JukeBox, a two-step method for music generation that directly worked on audios. In this two-step method, 1) an audio was converted into a sequence of discrete tokens with VQ-VAE, and 2) the tokens were modeled by a Transformer. In our third generation of models, we adopted this two-step method, while redesigning the architecture based on our goals.


One particular improvement was that the model can run in realtime. A variant of it could accompany the piano playing in realtime. The music team here at the Taiwan AI Labs has deployed it for realtime interaction in various exhibitions and art installations since the autumn of 2020.


We have been working on various improvements and variants of this model to have different usages for different applications.  Our best model now can generate singing voices with quality much better than what described above.  The new model is likely to make it debut soon.  Stay tuned with us for more fun!


By: Jen-Yu Liu and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

MuseMorphose: Music Style Transfer with A Transformer VAE

At Taiwan AI Labs, we are constantly pushing the frontier of deep music generation models. In the past year, we have rolled out Guitar Transformer (blog), which can compose human-readable guitar tabs with plausible fingerings, and Compound Word Transformer (blog), which vastly accelerated model training and inference thanks to carefully re-engineered music representation. Today, proudly making its debut is MuseMorphose, our brand new model for music style transfer.

Unlike our previous works, which offered limited possibilities for user interaction, MuseMorphose is designed to extensively engage users in the machine creative process. With MuseMorphose, one may input his/her favorite song, the length of which unlimited, and set two musical style attributes, namely, rhythmic intensity and polyphony (i.e., harmonic fullness), of every bar to his/her desired level (0~7 possible). The model will then re-create the song, taking into account the user-specified sequence of bar-level style attributes.

Listening Samples

To showcase MuseMorphose’s capabilities, let us present some of its compositions first!

In the upper example, we feed the famous 8-bar theme of Mozart’s “Twinkle, Twinkle, Little Star” to the model, and ask it to generate a style-transferred version with increasing rhythmic and harmonic intensities (going up by 1 level per bar). For the next one, we choose an 8-bar excerpt from our AILabs1K7 pop song dataset (published here) and pick three drastically different style settings for MuseMorphose to wield its creativity.

Across all the samples, MuseMorphose responds precisely to the style settings. What’s more, it sticks faithfully to the musical flow of input song while adding its own creative and harmonious touches.

MuseMorphose: The Model

Now that we have had some pleasant music, it is time to delve into the technical underpinnings of MuseMorphose.

Architecture of MuseMorphose

Architecture of MuseMorphose

MuseMorphose is based on two popular deep learning models for music generation: Variational Autoencoder (VAE), and Transformer. VAE (see also: MusicVAE & Attr-Aware VAE) grants users the freedom to harness the generation through operations on its learned latent space, much like we have shown in the examples; however, its RNN backbone greatly limits the length of music it can model.

Transformer (check out: Music Transformer & MuseNet), on the other hand, can generate music of up to 5 minutes long, but its conditional generation use cases remain underexplored. People could only use global condition tokens to affect a composition’s style or instrumentation, or, in some other cases, they have to supply a full melody/track for the model to come up with an accompaniment. It has not been possible to freely edit a piece according to its high-level musical flow. Therefore, we integrate the two models to construct MuseMorphose, which exhibits both of their strengths, and gets rid of their weaknesses.

In MuseMorphose, a Transformer encoder is tasked with extracting the musical skeleton of each bar in a piece as vectors (called “latent conditions” in the figure above). Then, these skeletons are concatenated with bar-level style inputs from the user, i.e., the rhythmic intensity and polyphony levels, mapped to learned embedding vectors. Finally, the concatenated conditions enter sequentially, i.e., bar by bar, into a Transformer decoder through the in-attention mechanism developed by us, to produce the style-transferred piece.

This asymmetric encoder-decoder design ensures the fine granularity of conditions, and maintains Transformer’s inherent ability to generate long sequences. Moreover, the in-attention mechanism, which injects each bar-level condition to corresponding timesteps and all layers of the Transformer decoder, is the key to effective conditioning.

Further Materials

You may take a look at our paper to find more discussions on the architecture design, training objective, and evaluation. Our demo website provides even more compositions by MuseMorphose, as well as those by the baselines it outperforms, for you to listen. Want to compose some music with MuseMorphose? No problem! Just check out our open-source implementation and pre-trained checkpoint.

If you find our work exciting, and have some thoughts/suggestions about it (e.g., what other style attributes may be added to MuseMorphose), feel free to drop us a mail. We are definitely looking forward to a lively discussion.

By: Shih-Lun Wu, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Compound Word Transformer: Generate Pop Piano Music of Full-Song Length

Over the past months, we attempted to let transformer models learn to generate full-song music, and here is our first attempt towards that, the Compound Word Transformer. A  paper describing this work is going to be published as a full paper at AAAI 2021, the premier conference in the field of artificial intelligence.

You can find a preprint of the paper, and the open source code of the model, from the links below:

We made two major improvements in this work: First, a new representation – Compound Word is presented, which can let transformer models accept longer input sequence. Second, we adopt the linear transformer as our backbone model. The memory complexity of linear transformer scales linearly with respect to the sequence length, and such property enables us to train longer sequences on limited hardware budgets.

To our best knowledge, this is also the first AI model that deals with music at the full-song scale, instead of music segments. Currently, we let our model focus on learning to compose pop piano music, whose average length is approximate 3 to 5 minutes long. Before diving into the technical details, let’s listen to some generated pieces:

All clips above are generated from scratch. Aside from this, our model can also understand the lead sheet, which conveys only melody line and chord progression, and translate it into an expressive piano performance:

How do we nail it? We reviewed our previous work REMI and found the way to represent music can be further condensed. If the required sequence length is shorter, we can feed longer music into the model. The figure below displays the process of how we convert REMI into Compound Words representation. Instead of predicting only one token per timestep, CP groups consecutive and related tokens and predict them at once so as to reduce the sequence length. Each field of CP is related to a certain aspect/attribute of music, such as duration, velocity, chords, and etc

For modeling such CP sequences, we design the specialized input and output modules of the transformer. The figure below illustrates the proposed architecture in action at a certain timestep.  On the right half part, where the model makes token prediction, you can see that there are multiple feedforward heads,  each accounting for a field of CP, which corresponds to a single row of CP shown in the figure above. On the left half part, each field of CP has its own token embedding, which will be concatenated as the vector and then reshaped by a linear layer, to become the final input of the transformer.

Because each head only concentrates on a certain field of CP, we can have more precise control when either modeling or generating music. For example, in training, we can assign different sizes of embedding to tokens of different types, according to the difficulty level associated with each type of token. We set larger for harder ones like duration and pitches, and smaller for the easier ones like beats and bars. In the inference time, we can adopt different sampling policies. For example, we can use larger temperature to have more randomness in the prediction of velocity tokens; and smaller temperature for pitch tokens to avoid mistakes.

The proposed model shows good training and inference efficiency. Now we can train our model on a single GPU with 11GB memory within just one day. In inference time, to generate a 3-minute song takes only about 20 seconds, which is much faster than real-time.

Learning to generate full songs means that the model can take the whole song as input and knows when to start when generation. However, the music generated by the current model still does not exhibit clear structural patterns, like AABA form, or repetitive phrases, etc. We are working on this, hoping one day our AI can write a hit song.

More examples of the generated music can be found at:



By: Wen-Yi Hsiao, Jen-Yu Liu,  Yin-Cheng Yeh, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Guitar Transformer and Jazz Transformer

At the Yating Music Team of the Taiwan AI Labs, we are developing new music composing AI models extending from our previous Pop Music Transformer model (see the previous blog).  In October 2020, we are going to present two full papers documenting some of our latest result at the International Society for Music Information Retrieval Conference (ISMIR), the premier international conference on music information retrieval and music generation.


The first paper, entitled “Automatic composition of guitar tabs by Transformers and groove modeling,” talks about a Guitar Transformer model that learns to generate guitar tabs.

Here are two tabs in the style of guitar fingerstyle  generated fully automatically by this model; no human curation involved.

The highlight of this work is to design and incorporate what we call the “grooving” tokens to the representation we use to represent a piece of symbolic guitar music.  Groove,  which can be in general considered as a rhythmic feeling of a changing or repeated pattern, or “humans’ pleasurable urge to move their bodies rhythmically in response to music,” is not explicitly specified in either a MIDI or TAB file.  Instead, groove is implicitly implied as a result of the arrangement of note onsets over time.  Therefore, existing methods for representing music do not involve the use of groove-related tokens.

What we did is to apply music information retrieval (MIR) techniques to extract a 16-dimensional vector representing the occurrence of note onsets over 16 possible equally-spaced quantized positions of a bar, and then use the classical kmeans algorithm to cluster such 16-dim vectors from all the bars from all the pieces of our training data, leading to k (=32) clusters (we need clustering for otherwise there will be too many unique such 16-dim vectors).  We then treat these cluster IDs as “grooving tokens” and assign a grooving token to each bar of a music piece.  In this way, what our Transformer model (specifically, we use Transformer-XL) sees during model training would be not only the note-related tokens but also such bar-level grooving tokens.  It turns that this improves quite a lot the quality of the generated music, compared to the baseline model that does not use grooving tokens.

The following figure shows the result of a user study asking subjects to choose the best among the three continuations generated by different models, with or without the grooving tokens, given a short human-made prompt. The result is broken down according to the self-report guitar proficiency level of the subjects.  We can see that the professionals are aware of the difference between the grooving-agnostic model and the two groove-aware models (we implemented two variants here, a `hard grooving` model and a `soft grooving model`, which differ in the way we represent the musical onsets).


The second paper, entitled “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” talks about a Jazz Transformer model that learns to generate Jazz-style lead sheets, using the Jazzomat dataset.

The focus of this paper can be said to be about the development of objective metrics tailored for symbolic music generation tasks (namely, not for general sequence generation tasks).  Specifically, we proposed the following metrics:

  • Pitch Usage: Entropy of 1- & 4-bar chromagrams;
  • Rhythm: Cross-bar similarity of grooving patterns;
  • Harmony: Percentage of unique chord trigrams;
  • Repeated structures: Short-, mid-, & long-term structureness indicators (computed from the fitness scape plot);
  • Overall musical knowledge: Multiple-choice continuation prediction (given 8 bars, predict the next 8 bars).

Implementation of these evaluation metrics can all be found in the `MusDr` repository listed above.  It has also been integrated into `MusPy,` an open source Python library for symbolic music generation developed by Hao-Wen Dong et al at UCSD.

These objective metrics can help us gain some ideas about the quality of the machine-generated music, dispensing the need to run the expensive user studies too many times.  These metrics also help elucidate the difference between machine-composed music and human-composed ones.  For example, from the structureness indicators, we can see clearly that Transformer-XL based music composing models, which represent a current state-of-the-art, still fall short of generating music with reasonable mid- and long-term structure.  See the following figure for a comparison between the fitness scape plot of a piece composed by the Jazz Transformer (marked as `Model (B)’), and that of a human-composed one.

We are actively using these new metrics to guide and to improve our models.  In particular, we are finding ways to induce structures in machine-composed music.  Let us know if you like what we are doing and/or have some ideas to chat with us!


[1] Yu-Hua Chen, Yu-Siang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Automatic composition of guitar tabs by Transformers and groove modeling,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).
[2] Shih-Lun Wu and Yi-Hsuan Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).

By: Yu-Hua Chen, Shih-Lun Wu, Yu-Siang Huang, Wen-Yi Hsiao and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Missing facts and source classifier of daily news


With the development of the Internet, it is convenient that one can get news from network rapidly. Nonetheless, it is also dangerous since a person might have a preference for a specific news medium over others, and the news medium may have its position to report news while missing some facts that they do not want people to know. For the purpose, we propose a system to detect missing facts from a news report.


Work flow

First, we group the news using their embeddings derived by Universal Sentence Encoder(USE)[1]. Within each group, the news are highly related. In fact, most of them report the same event as exptected. Meanwhile, we extract the summary for each news report using an algorithm called PageRank. Then, we further summarize each news group using the summaries of the news in the group. Afterwards, we compare the summary of each news report with the group summary to get the missing facts of the report.

Grouping news

This part is implemented by Yu-An, Wang, and I omit it here.


For each news report, we split the raw content into sentences. Then we count the similarity of each sentence pair using USE and editdistance. Once a similarity matrix is ready, we can sort the sentences by PageRank. We then choose top k sentences to be the report’s summary, where k is a hyperparameter.

Combining the summaries of the news in a group, we can summarize the news group in a similar way.

Missing facts

We call each sentence in a news summary “a fact”. Now we can obtain the difference between a news summary and the group summary, and the sentences existing in the group summary but not in the news summary are the missing facts of the report. Otherwise, the sentences existing in the news summary but not in the group summary is the exclusive content of the news report.


島民衛星 https://islander.cc/

Source Classifier for Daily News and Farm Post


Sometimes the news media are not neutral as expected. Hence, we can train a satisfactory classifier due to the biases. Or to say, the classifier will work since a news medium might have its specific writing style. We use BERT[2] as the contexual representation extractor to train a classifier in order to predict the source (news medium) of a news report given its content (or title). Except for the daily news, this kind of classifier can also be adopted to classify the source of farm posts, which are usually biased and contain fake information.


  • Daily news
    • # classes: 4
    • 台灣四大報(中國時報、聯合報、自由時報、蘋果日報)
    • Preprocessing
      • Sometimes the reporter name, the news medium itself, or some slogans are in the raw content. We filter them using hand-crafted rules.
  • Farm post
    • # classes: 16
    • mission-tw, qiqu.live, qiqu.world, hssszn, qiqi.today, cnba.live, 77s.today, i77.today, nooho.net, hellotw, qiqu.pro, taiwan-politicalnews, readthis.one, twgreatdaily.live, taiwan.cnitaiwannews.cn



  • As shown in the figure below, this classifier is composed of a BERT model followed by a linear layer. This model can handle the input whose content length is less than 512. Note that the latent vector is corresponding to the CLS token.


  • Since the pretrained BERT model is for the cases with less-than-512 texts, we propose a method to deal the cases with more-than-512 texts.
  • First, we train a C-512 model. The C-whole model uses the BERT module in the trained C-512 model as its representation extractor. Given an input content with arbitary text length, ll, we split it into ⌈l/512⌉⌈l/512⌉ segments. The representation for each segment is derived by the BERT module, and then it is passed through a linear layer. Averaging all these outputs from the linear layer, the vector is fed into another linear layer to obtain the final output of the C-whole model.


島民衛星 https://islander.cc/

1. Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018). ↩︎

2. Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 (2018). ↩︎

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

Paper (ACM Multimedia 2020):  https://arxiv.org/abs/2002.00212 (pre-print)

Code (GitHub):  https://github.com/YatingMusic/remi


We’ve developed Pop Music Transformer, a deep learning model that can generate pieces of expressive Pop piano music of several minutes.  Unlike existing models for music composition, our model learns to compose music over a metrical structure defined in terms of bars, beats, and sub-beats.  As a result, our model can generate music with more salient and consistent rhythmic structure.


Here are nine pieces of piano performances generated by our model in three different styles.  While generating the music, our model takes no human input (e.g., prompt or chord progressions) at all.  Moreover, no post-processing steps are needed to refine the generated music.  The model learns to generate expressive and coherent music automatically.



From a technical point of view, the major improvement we have made is to invent and employ a new approach to represent musical content.  The new representation, called REMI (REvamped MIDI-derived events), provides a deep learning model more contextual information for modeling music than existing MIDI-like representation.   Specifically, REMI uses position and bar events to provide a metrical context for models to “count the beats.”  And, it uses supportive musical tokens capturing the high-level music information of tempo and chord.  Please see the figure below for a comparison between REMI and the commonly-adopted MIDI-like token representation of music.


The new model can generate music with explicit harmonic and rhythmic structure, while allowing for expressive rhythmic freedom in music (e.g., tempo rubato).  While it can generate chord events and tempo changes events on its own, it also provides a mechanism for human users to control and manipulate the chord progression and local tempo of the music being generated as they wish.


The figure below show the piano-rolls of piano music generated by two baseline models (the first two rows) and the proposed model (the last one), when these models are asked to “continue” a 4-bar prompt excerpted from a human-composed music.  We can see that the proposed model continues the music better.


The figure below show the piano-roll of a generated piano music when we constrain the model not to use the same musical chord (F:minor) as the 4-bar prompt.


The paper describing this new model has been accepted for publication at ACM Multimedia 2020, the premier international conference in the field of multimedia computing.  You can find more details in our paper (see the link below the title) and try the model yourself with the code we’ve released!  We provide not only the source code but also the pre-trained model for developers to play with.


More examples of the generated music can be found at:



Enjoy listening!


By: Yu-Siang Huang, Chung-Yang Wang, Wen-Yi Hsiao, Yin-Cheng Yeh and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Telling Something from Your Face: Age, Beauty, and Anti-Spoofing

Figure 1. Demonstration of the facial age, beauty, pose estimation.

In the last few decades, facial analysis has been thoroughly explored due to its huge commercial potential. For example, learning the semantic meaning of a human face can help us to target potential customers more easily and more efficient. While face recognition is a well-established technology for surveillance systems or security control, it is also crucial for us to know the captured image is from a real person or a spoofing device. In this blog post, we are going to discuss what we do in Taiwan Ailabs and our demo system.

Predicting a human age or beauty from the appearance is a very subjective task. A man/woman can look younger than his/her real age while beauty is not even a quantifiable value. Nevertheless, knowing the approximated appearance age and beauty is still helpful for merchandise recommendation. For instance, recommending an old man/woman some soft drinks may not be a good idea, nor placing a massage chair in a children’s play zone.

There are several ways to estimate these two values with proper supervision. Classification tells us that we can use independent bins to represent a range of the age or beauty score. The drawback of such method is the range is manually set and might lead to quantization error. On the other hand, it is also natural to use regression to predict continuous values, but it might also lead to overfitting without constraints. In addition, the face pose and resolution also have a huge impact on the prediction performance. The above challenges make the prediction of age and beauty even more difficult.




Age and Beauty estimation

To resolve the above challenges while achieving very low cost, the demo system is built upon the recently published FSA-Net [1] in CVPR2019. It adopts Soft-Stagewise Regression (SSR) scheme [2] for eliminating the quantization error while maintaining low memory overhead. 

For the training, we choose 128x128x3 as the cropped face resolution, and put an auxiliary loss as the quantized supervision prediction.

A drawback of such scheme is that unstable prediction might appears with different input images. We use a sequential frame selecting pipeline to stabilized the final prediction.

Figure 2. Sequential frame selecting pipeline.

Finally, only when the detected face has enough resolution with very small pose angle, it will be considered as a valid face and proceed with the estimation pipeline. The replacement policy can be altered as long as the selected face image is high quality with enough resolution.


Face anti-spoofing

In order to achieve anti-spoofing with a pure RGB image, we divide the process into two different tasks, cell phone detection and denoising based anti-spoofing estimation.

Figure 3. Face anti-spoofing pipeline.

We adopt the famous YOLOv3 [3] as the detector for the cell phone, laptop, monitor detection. Task 1 is defined as detection inside the phone area will be considered as a fake spoof. While taks 2 takes the quality and the noise of the image for determining whether it’s Real or Fake.


Demo images:

The following demo shows that our system can predict age and beauty with decent

 Figure 4. Demo images for age and beauty estimation.


Figure 5. Demo image for face anti-spoofing. (Red: Fake. Blue: Real.)


Face attributes estimation such as age and beauty is subjectively determined by the label data, but it is still useful for commercial analysis and recommendation. For face recognition, the anti-spoofing is also very important for the security and the robustness of the whole identity verification pipeline. We achieved these prototypes for showing that there is much more potential on these topics and the AI can truly help people to make a better decision with these estimations.



[1] Yang, Tsun-Yi, et al. “FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation from a Single Image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Yang, Tsun-Yi, et al. “SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation.” IJCAI. Vol. 5. No. 6. 2018.

[3] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

Compete for the NT$20 Million Reward in the AI Chinese Comprehension Championship Challenge – 2020 “Formosa Grand Challenge – Talk to AI” –

2020 “Formosa Grand Challenge – Talk to AI”

Date: September 9, 2019

The 2020 “Formosa Grand Challenge – Talk to AI” officially kicks off today (September 9) to accept applications. The Ministry of Science and Technology invites experts to challenge the Band C Level of the Reading Comprehension Test for the “Test of Chinese as a Foreign Language,” or TOCFL for short, to compete for the first prize reward of NT$20 million.

Unlike the previous 1st “Talk to AI” competition, this challenge will be conducted in a single-choice mode, but will include voice reading and continuous dialogue tests to fully challenge AI’s ability to comprehend the Chinese language. The purpose is to allow AI to read the subject contents of different academic fields to test its comprehension and inference of the words and sentences, and further organize the entire dataset into application knowledge to achieve AI self-learning. The competition will be held in two stages, where the preliminary contest is scheduled for December 2019, and the final round in April 2020.

Dr. Wen Lian Hsu and his research team of the Academia Sinica Institute of Information Science is entrusted to work out the test dataset in conjunction with the National Academy for Educational Research to co-establish the dataset for training comprehension and dialogue. It is scheduled to be opened for use by the contestant teams in October 2019. On the other hand, the Speech-to-Text API uses Ya Ting Verbatim of AI Labs. Ya Ting Verbatim is developed by AI Labs led by Ethan Tu. It is indeed a great honor for Formosa Grand Challenge to cooperate with AI Labs.

To allow AI to take root, this “Talk to AI” competition is scheduled to hold a FUN CUP team event on November 16 to encourage general and vocational high school students as well as college students to participate. The FUN CUP team will work with the Association for Computational Linguistics and Chinese Language Processing to enable young students to understand the application of AI man-machine dialogue and use existing commercial AI tools to train the machine to learn and fully utilize existing voice materials and resources with the goal of achieving small-scale scientific research results of a certain quality.

The Ministry of Science and Technology will continue to enhance the scale of the voice dataset during the competition, allowing the teams to carry out technology development and testing. The research team of Associate Professor Yuan-Fu Liao of National Taipei University of Technology is invited to do the speech data transcription, releasing about 600 hours of the AI voice dataset to the data market of the National Center for High-Performance Computing (NCHC DATA MARKET), which is to be used by paid authorized users. The fee is 2000 NTD per 150 hours.

The Ministry of Science and Technology hopes to encourage innovators to use the potential of AI development, technology, and creativity to solve the challenges of voice applications through the “Formosa Grand Challenge” competition, and looks forward to any possible progress and thinking. It also expects that the event will attract more enterprises as well as academic and research institutions to get involved and work together to promote the upswing of Taiwan’s AI voice recognition technology and assist Taiwanese enterprises in digital transformation.


2020 “Formosa Grand Challenge – Talk to AI”