## Enhance MIDI Generation with Harmonic and Rhythmic Features

Music generation at Taiwan AI Labs is based on the generation of note sequences. This approach preserves most details of a piece of music, but to the human ear, music is not only a set of notes, but the patterns that are formed by them. This is exactly what chorder and groover aim to achieve; they are packages designed to extract information on harmony and groove respectively.

From extracting harmonic and rhythmic features, computers are now able to look at a piece of music on a larger scale than plain notes. However, readers should still be aware of the fact that there is no 100% objective right and wrong in the perception of music, so the harmonic and rhythmic features are far from being the sole correct analysis to a piece of music.

### chorder

The main feature of chorder is chord detection from MIDI files. To successfully accomplish this task, chorder uses a 12-dimensional vectors to represents semitone distribution. For a certain time period, there is a vector v that sums up the duration of each note by their pitch class. For reference, there are different weight vectors w for different chord qualities. For example,

w_{\text{major}} = [1.0, -0.2, -0.1, -0.2, 1.0, -0.5, -0.2, 1.0, -0.2, -0.1, -0.2, 0.0]

Meaning semitones that are 0, 4, and 7 semitones away from the root note of a major chord are the most contributing factors, while the semitone that is 5 semitones away from the root is a reverse indicator. The quality and the pitch class of the root can then be expressed as follows:

\text{argmax}_{R \times Q} \text{  } v_r \cdot w_q

Where r is an integer between 0 to 11, representing the pitch class of the root. vr is v rotated left for r positions to find the root that best fits the chord’s weights wq. For now, Q contains six type of basic chords: major, minor, diminished, augmented, sus2 and sus4. One thing to note is that a segment will be determined as no chord if the dot product of vr and wq is less than the total duration of the segment.

The bass note, different from the root note, is simply the lowest notes found in the segment that has a combined duration of at least 1/8 of the duration of the entire segment. With knowledge of bass note and basic quality, certain rules are applied to correct the quality of the chord or to add seventh notes. For example, a Fmaj chord with D on the bass should be considered a Dmin7 chord instead.

The lengths of segments chorder use is 1 and 2 beats. If the 2-beat segment has a higher alignment score than both 1-beat segments, the chord of the 2-beat segment is applied to the 2 beats. The 2 beats will be assigned their separate chords if that is not the case.

### groover

Unlike chords, there are no symbols on grooving that are universally agreed on. That’s why in groover,  the grooving representations are simply classification of rhythmic patterns derived from a given MIDI dataset.

The rhythmic patterns are set at a certain length (for example, a measure), and divided into quantized periods. Each note contributes an intensity value to the pattern, and the intensity increases with lower pitch and higher note velocity. Then, the patterns are clustered, not with Euclidean distance, but a modified version of cosine similarity. The modification is to do a “blurring” before calculating cosine similarity. Let v(i, l) be the left shift of v for i positions and v(i, r) be the right shift of v for i positions, then we can define the blur of n positions as

v_{\text{blurred}} = v + \sum_{i=1}^{n} (1 – \frac{i}{n})(v_{i, l} + v_{i, r})

The magnitude can be altered to make it a weighted average, but for the purpose of applying cosine similarity, this is not necessary. The blurring makes positions that are closer to each other value more, instead of a mere 0 if they are not exactly the same. With this customized similarity, we can now apply clustering algorithms such as k-means, which is the one currently used by groover to label rhythmic patterns.

Below is an audiation of harmonic and rhythmic features of an 8-bar excerpt. The velocity of the chords are representative of the intensity value the detected pattern holds.

The original 8-bar excerpt:

Its audiation of harmonic and rhythmic features:

By:  Joshua Chang and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

## Investigating the Impact of Facebook Polarization on the 2021 Taiwanese Referendum

Taiwan has four important upcoming referendums on December 18, 2021. “Pork Imports”, “Referendum Dates”, “Nuclear Plant”, “Algae Reef Protection.” According to previous research, disinformation and manipulation of the media could influence public opinion on topics like these referendums. This study will examine how Facebook has affected the polarization of public opinion on these topics over the previous eleven months. The phenomenon of public opinion polarization and the factors affecting policy support and political attitudes will be analyzed using artificial intelligence technology. The polarization impact of pages that demonstrated evidence of potential media manipulation through coordinated behavior were considered. Within these pages, those that were shared most often are referred to as “amplifiers.”

FB_Research_Project

## Exploring Atypical Online Coincidental Behavior on PTT

This study focuses on atypical coincidental behavior on the Taiwan social media, PTT1, to discover attempts to manipulate public opinion during the outbreak of the COVID-19 in Taiwan from May through August 2021. The research team aims to identify atypical coincidental behaviors to uncover suspicious collaborative efforts which attempt to manipulate public opinions, together with developing AI tools to analyze the information comprehensively and efficiently over the outbreak period and assists researchers by saving human labor and time.

Since its launch in 1995, PTT has become one of the most used Chinese-language online services. In Taiwan, many users make it a habit to post news and information from various sources, which leads to a diverse spectrum of discussions, and often for many, the discussion board becomes the first stop of the news outlet. Almost all the information related to Taiwan can be discovered on PTT. The information on PTT is discussed by users with their opinions and ideas.

Such discussions are often later presented in news media for larger public consumption. This effect not only amplifies these ideas for a wider audience, but it also affects businesses and governmental institutions during their decision-making process when they need to adhere to public opinions. For example, when whistle-blower Dr. Wenliang Li of China first broke evidence in Chinese media of human-to-human transmission of COVID-19, a Taiwanese physician began sharing the information on PTT. Because of these discussions, authorities at the Taiwan Centers for Disease Control were alerted and began to take action much earlier than other countries. In this way, one could compare PTT to a speech stand in the center of the town square, upon which users can take space to give public addresses where the majority of the townsfolks are the audience. The more time someone spends on the speaker stand, the greater potential influence they will have on the town. Users who have the power to influence discussions could be likened to a bullying majority dominating the use of the speaker stands for their own ends.

Since 2018, Taiwan media and academia have been observed the potential of cyber armies conducting strategic information operations through social networks such as PTT, Line, and Facebook. A Pilot Study on PTT in the Context of Influence Operations introduces the interface, functions, and terminology of PTT. Other studies identified groups of cyber armies on the Gossiping Board of PTT. Compared to previous papers, this research contains not only case studies but also a data-driven and evidence-based approach, comprehensively quantiﬁes “atypical coincidental behavior” and summarizes the differences between user groups.

The research team grouped users by the phi-coeﬃcient score to measure coincidental behavior in many aspects, including shared IP addresses, time pattern of activity, narratives, emotion, and incitement of comments. Furthermore, the study compared several metrics of behavior among user groups in various events2 to discover the evidence of information manipulation and observed the correlation between ideological slogans used and groups of users. By observing the opinion manipulation on social media, users may be able to more readily distinguish atypical coincidental behavior and therefore decrease their chances of being manipulated. Through exposing more context behind user content, this research hopes to decrease the negative impact of atypical coincidental behavior on public opinion.

Furthermore, the researchers compared the behavior of coincidental users. Between the shared IP addresses, active times, narratives, etc., the researchers conclude that these patterns reflect more than just random chance. The evidence shows those coincidental users were engaged in various forms of intentional, collaborative effort in their posts, comments, and other behaviors. Although one cannot claim for certain that the intent of these users is to manipulate public opinion, the researchers cannot conclude any other plausible explanation to justify such behavior because it is so different from typical general users.

The following collaborative behaviors were observed during this time period:

1. Users in each coincidental user groups were active during similar times
2. Coincidental users were more likely to participate in events with more comments, and in events with higher incite scores
3. Coincidental users showed much higher rates of manipulative patterns
4. Each coincidental user group demonstrated different preferences in their patterns of behavior
5. Some coincidental user groups favored just one negative/manipulative slogan, while others used more than one, or none at all.
6. Different coincidental user groups favored different specific narratives and word choices.

In summary, the researchers want to emphasize that behavior indicative of the intent to manipulate opinions on social media platforms appears to be very active. In the time period selected for this study, there were 880 atypical events out of 1,985 targeted events. This means that there were collaborative users demonstrating intentionally manipulative behaviors in almost half of all targeted events. This report also showed that manipulation could and does happen on a variety of topics, including sporting, business, entertainment, politics, etc. The research team believes that more work should be done to study and catalog atypical, manipulative behaviors on social media across a variety of platforms. We believe that users of social media platforms should be made aware of collaborative, manipulative behavior in order to know when others may be attempting to influence public opinion.

to download the complete version of the research paper, please submit the request form at the end of this blog

Methodology

This study focused on atypical coincidental behavior attempting to uncover the types of events that were manipulated via social media and the ways in which they were manipulated. To do so, the important events that happened during the Taiwanese outbreak of the COVID-19 pandemic, as well as the users with coincidental behavior, had to be identified to study the interactions between the events and the coincidental users.

Figure 1 presents an overview of the methodology framework. With the framework provided, all the data posted on the PTT Gossiping Board from May 1, 2021, to August 31, 2021, including 130,099 users and 8,413,675 comments on 293,370 posts, were analyzed to detect events and group users. Additionally, through the comment content analysis, features were extracted for the study of the interactions between the events and the coincidental users. With the results obtained from event user analysis and comment content analysis, coincidental users were first analyzed, then coincidental groups were analyzed.

_______________________________________________________________________________________

1 PTT is the largest local forum in Taiwan.

2 This research defines an ”event” as the sum total of collected news articles on a topic, combined with all social media reactions to it, on given platforms. For more details about how the data was clustered, please download the complete essay and refer to Chapter II Methodology.

 

## The Challenge of Speaker Diarization

### Introduction

Speaker Diarization is the task to partition audio recordings into segments corresponding to the identity of the speaker. In short, a task to identify “who spoke when”. Speaker diarization can be used to analyze audio data in various scenarios, such as business meetings, court proceedings, online videos, just to name a few. However, it is also a very challenging task since characteristics of several speakers have to be jointly modelled.

Traditionally, a diarization system is a combination of multiple, independent sub-modules, which are optimized individually. Recently, end-to-end deep learning methods are becoming more popular. The metric for speaker diarization is diarization error rate (DER), which is the sum of false alarm, missed detection and confusion between speaker labels.

In this article, we will introduce our upcoming speaker diarization system first, then give an overview of the latest research in end-to-end speaker diarization.

### Our Method

Our product uses a traditional diarization pipeline, which consists of several components: speech turn segmentation, speaker embedding extraction, and clustering. We utilize pyannote.audio [2], an open-source toolkit for speaker diarization, to train most of our models.

1. Speech Turn Segmentation

The first step of our diarization system is to partition the audio recordings into possible speaker turn segments. A voice activity detection model is used to detect speech regions, while removing non-speech parts. Speaker change detection model is used to detect speaker change points in the audio. Each of these models are trained to optimize a sequence labeling task: With sequence of audio features as input, output a sequence of labels.

For both models, there are some tunable hyperparameters which determine how sensitive the models are on segmenting the audio. For the speaker change detection model, only audio frames with detection score higher than the threshold “alpha” are marked as speaker change points. We notice that it is generally more beneficial to segment more aggressively (i.e. split the whole audio into more segments) in the speech turn segmentation stage, so as to make sure most segments only contain one speaker. After the clustering stage, we can merge segments assigned to the same speaker for better Speech Recognition performance in following stages.

2. Speaker Embedding Extraction

After performing speech turn segmentation, in order to facilitate clustering of speaker segments in the clustering module, a speaker embedding model is used to obtain a compact, fine-grained representation for each segment.

The model we choose is SincTDNN which is basically a x-vector architecture where filters are learned instead of being handcrafted. Additive angular margin loss is applied to train the model.

We utilize CNCELEB [3], an open source large-scale chinese speaker recognition dataset, to fine tune the model pretrained on voxceleb. CNCELEB is a challenging, multi-domain dataset consisting of 3,000 speakers in 11 different genres. By using such a diverse dataset, we expect our model to be stable enough when facing various real life scenarios. We also notice that only very few epoch is needed to fine tune the model, and that speaker embedding with lower EER may not always imply lower diarization error rate.

3. Clustering

Traditional cluster methods can be used to cluster the speaker embedding, identifying which speaker each segment belongs to. In our system, we leverage affinity propagation for short audio, and KMeans for long audio. Since we do not know how many speakers are in the recording, affinity propagation has relatively better performance than KMeans as it can determine the number of clusters directly. While for KMeans, we estimate the best number of clusters by finding the elbow of derivatives in MSCD. However, we resort to KMeans instead of affinity propagation for longer audio files since affinity propagation tends to be slower in this situation.

Combining all components

With all the aforementioned components trained, we can combine them into a complete pipeline for diarization. Several hyperparameters are optimized jointly to minimize diarization error rate on a certain dataset.

Compared to our previous system, the current system obtains around 30% DER relative improvement on our internal dataset flow-inc, which consists of a few thousands news recordings.  Most improvements stem from improved speaker embedding and clustering method, which improved from 15% to 8% DER for oracle speech turn segmentation setting.

In our production system, we also allow some extra customization for our users. For instance, if the user provides the number of speakers in the recording, we can use KMeans to recluster according to the number of speakers. Also, if the user provides or corrects some speaker labels of the utterances, this extra information is involved when updating the cluster centers. Re-cluster will progress with partial initialization based on updated centers, resulting in more accurate predictions.

Finally, the output of speaker diarization can be further fed to the input of an ASR system, and the final output is the transcript for each speaker turn. It is not very certain if diarization can help improve ASR performance; however, recent work has shown that ASR with diarization can obtain comparable WER as state-of-the-art ASR system, but with extra speaker label information.

### What’s Next?

While most systems in production still follow the traditional pipeline of speech turn segmentation + clustering, End-to-end diarization systems are receiving more and more attention. As traditional pipeline assumes single-speaker per block while extracting speaker embedding, it can by no means handle the problem of overlapping speech. End-to-end diarization systems receive frame-level speech features as input, and directly output frame-level speaker activity for each speaker, thus can handle overlapping speech in an easy manner. Also, end-to-end systems minimize diarization error directly, getting rid of the need to tune hyper-parameters of several components. Today, end-to-end systems are able to perform on par with, or even better than traditional pipeline systems in many datasets.

However, end-to-end systems still have several drawbacks. Firstly, these systems cannot easily handle an arbitrary number of speakers. The originally proposed EEND [4] can only deal with a fixed number. Several follow-up works attempt to address this problem, but their systems still struggle to generalize to real world settings, where there may be more than 3 speakers.

Secondly, while end-to-end systems are optimized directly for diarization error rate, they may easily overfit to the training dataset, for instance the number of speakers, speaker characteristics, as well as background conditions. In contrast, traditional clustering-based approaches are shown to be more robust across datasets. Lastly, most end-to-end diarization systems use transformer or conformer as their core architecture. Self-attention mechanism in transformer has quadratic complexity, which hinders the model’s capability to process long utterances or meetings.

Recently, hybrid systems integrating both an end-to-end neural network and clustering are proposed [5][6], which try to take the pros of both worlds. Hybrid systems handling overlapping speech with an end-to-end network, and able to handle an arbitrary number of speakers using clustering algorithms. Also, block processing of input audio data  [5] significantly shortens the process time of long recordings. This series of research is very promising and deserves more attention.

Finally, many state-of-the-art systems still leverage simulated mixture recordings for training, then fine tune and evaluate on real dataset. This indicates the lack of large-scale real-life dataset for end-to-end training. By introducing more challenging real-life dataset, it is believed that diarization performance can be further improved.

Conclusion

To address the challenging task of speaker diarization, we leverage a system consisting of several neural networks: speech activity detection, speaker change detection, speaker embedding, as well as clustering algorithm. While these components are trained separately, they are optimized jointly by tuning hyper-parameters. For further work, we would like to incorporate an end-to-end segmentation model to handle overlapping speech, as well as making our system more real-time.

Reference

[1] A Review of Speaker Diarization: Recent Advances with Deep Learning https://arxiv.org/abs/2101.09624

[2] pyannote.audio: neural building blocks for speaker diarization https://arxiv.org/abs/1911.01255

[3] CN-Celeb: multi-genre speaker recognition https://arxiv.org/abs/2012.12468

[4] End-to-End Neural Speaker Diarization with Self-attention https://arxiv.org/abs/1909.06247

[5] Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech https://arxiv.org/abs/2105.09040

[6] Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors  https://arxiv.org/abs/2107.01545

[7] The yellow brick road of diarization, challenges and other neural paths https://dihardchallenge.github.io/dihard3workshop/slide/The%20yellow%20brick%20road%20of%20diarization,%20challenges%20and%20other%20neural%20paths.pdf

## Lyrics-free Singing Voice Generation

The conventional approach to generate singing voices is through singing voice synthesis (SVS) techniques. A human user feeds lyrics and MIDI scores (a sequence of notes) to a well-trained SVS model, and the model generates audio recordings following the given lyrics and scores faithfully. The synthesis models have little freedom deciding what to “sing.”

In contrast to this conventional approach, we are interested in a singing voice generation model that does not take any lyrics and MIDI scores as input, but instead decides the phonemes and pitches underlying its singing pretty much on its own.  We consider this setting as more interesting as it emphasizes the “creativity” of the model.

Accordingly, we chose to directly work on audio recordings without explicitly extracting pitches or lyrics from the singing audios. The first problem was how to gather enough singing voice audios in order to train neural network models. Luckily, we were experienced in music source separation and the music source separation can already separate the singing voices with reasonably good quality.

To allow the singing voice generation model to work together with our piano generation models (introduced here), we also want the model or a variant of it to sing according to a given piano track but not sing the exact notes of the piano track.

To model the singing voices, in the winter of 2019, we designed a GAN model that can generate singing voices freely as well as a variant of it that can sing along with a given accompaniment audio. Different from conventional GANs for image generation, the proposed model can generate audios with indefinite durations. This work was published in the International Conference on Computational Creativity (ICCC) 2020. This version of models will be referred to as the first generation.

The first generation of models produce very creepy singing voices that could make you have nightmares, so we developed the second generation of models. They were also GAN-based models, but the architecture and the vocoder were both improved. One particular extra mechanism is the cycle regularization, which largely improved the pronunciation and the intelligibility of phones. This work was published in INTERSPEECH 2020.

Although the sound quality was largely improved in the second generation, the generated singing voices did not sound good musically. Therefore, we started to build new models for better sequence modeling capability. Almost at the same time, OpenAI proposed JukeBox, a two-step method for music generation that directly worked on audios. In this two-step method, 1) an audio was converted into a sequence of discrete tokens with VQ-VAE, and 2) the tokens were modeled by a Transformer. In our third generation of models, we adopted this two-step method, while redesigning the architecture based on our goals.

One particular improvement was that the model can run in realtime. A variant of it could accompany the piano playing in realtime. The music team here at the Taiwan AI Labs has deployed it for realtime interaction in various exhibitions and art installations since the autumn of 2020.

We have been working on various improvements and variants of this model to have different usages for different applications.  Our best model now can generate singing voices with quality much better than what described above.  The new model is likely to make it debut soon.  Stay tuned with us for more fun!

By: Jen-Yu Liu and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

## MuseMorphose: Music Style Transfer with A Transformer VAE

At Taiwan AI Labs, we are constantly pushing the frontier of deep music generation models. In the past year, we have rolled out Guitar Transformer (blog), which can compose human-readable guitar tabs with plausible fingerings, and Compound Word Transformer (blog), which vastly accelerated model training and inference thanks to carefully re-engineered music representation. Today, proudly making its debut is MuseMorphose, our brand new model for music style transfer.

Unlike our previous works, which offered limited possibilities for user interaction, MuseMorphose is designed to extensively engage users in the machine creative process. With MuseMorphose, one may input his/her favorite song, the length of which unlimited, and set two musical style attributes, namely, rhythmic intensity and polyphony (i.e., harmonic fullness), of every bar to his/her desired level (0~7 possible). The model will then re-create the song, taking into account the user-specified sequence of bar-level style attributes.

## Listening Samples

To showcase MuseMorphose’s capabilities, let us present some of its compositions first!

In the upper example, we feed the famous 8-bar theme of Mozart’s “Twinkle, Twinkle, Little Star” to the model, and ask it to generate a style-transferred version with increasing rhythmic and harmonic intensities (going up by 1 level per bar). For the next one, we choose an 8-bar excerpt from our AILabs1K7 pop song dataset (published here) and pick three drastically different style settings for MuseMorphose to wield its creativity.

Across all the samples, MuseMorphose responds precisely to the style settings. What’s more, it sticks faithfully to the musical flow of input song while adding its own creative and harmonious touches.

## MuseMorphose: The Model

Now that we have had some pleasant music, it is time to delve into the technical underpinnings of MuseMorphose.

Architecture of MuseMorphose

MuseMorphose is based on two popular deep learning models for music generation: Variational Autoencoder (VAE), and Transformer. VAE (see also: MusicVAE & Attr-Aware VAE) grants users the freedom to harness the generation through operations on its learned latent space, much like we have shown in the examples; however, its RNN backbone greatly limits the length of music it can model.

Transformer (check out: Music Transformer & MuseNet), on the other hand, can generate music of up to 5 minutes long, but its conditional generation use cases remain underexplored. People could only use global condition tokens to affect a composition’s style or instrumentation, or, in some other cases, they have to supply a full melody/track for the model to come up with an accompaniment. It has not been possible to freely edit a piece according to its high-level musical flow. Therefore, we integrate the two models to construct MuseMorphose, which exhibits both of their strengths, and gets rid of their weaknesses.

In MuseMorphose, a Transformer encoder is tasked with extracting the musical skeleton of each bar in a piece as vectors (called “latent conditions” in the figure above). Then, these skeletons are concatenated with bar-level style inputs from the user, i.e., the rhythmic intensity and polyphony levels, mapped to learned embedding vectors. Finally, the concatenated conditions enter sequentially, i.e., bar by bar, into a Transformer decoder through the in-attention mechanism developed by us, to produce the style-transferred piece.

This asymmetric encoder-decoder design ensures the fine granularity of conditions, and maintains Transformer’s inherent ability to generate long sequences. Moreover, the in-attention mechanism, which injects each bar-level condition to corresponding timesteps and all layers of the Transformer decoder, is the key to effective conditioning.

## Further Materials

You may take a look at our paper to find more discussions on the architecture design, training objective, and evaluation. Our demo website provides even more compositions by MuseMorphose, as well as those by the baselines it outperforms, for you to listen. Want to compose some music with MuseMorphose? No problem! Just check out our open-source implementation and pre-trained checkpoint.

If you find our work exciting, and have some thoughts/suggestions about it (e.g., what other style attributes may be added to MuseMorphose), feel free to drop us a mail. We are definitely looking forward to a lively discussion.

By: Shih-Lun Wu, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

## Compound Word Transformer: Generate Pop Piano Music of Full-Song Length

Over the past months, we attempted to let transformer models learn to generate full-song music, and here is our first attempt towards that, the Compound Word Transformer. A  paper describing this work is going to be published as a full paper at AAAI 2021, the premier conference in the field of artificial intelligence.

You can find a preprint of the paper, and the open source code of the model, from the links below:

We made two major improvements in this work: First, a new representation – Compound Word is presented, which can let transformer models accept longer input sequence. Second, we adopt the linear transformer as our backbone model. The memory complexity of linear transformer scales linearly with respect to the sequence length, and such property enables us to train longer sequences on limited hardware budgets.

To our best knowledge, this is also the first AI model that deals with music at the full-song scale, instead of music segments. Currently, we let our model focus on learning to compose pop piano music, whose average length is approximate 3 to 5 minutes long. Before diving into the technical details, let’s listen to some generated pieces:

All clips above are generated from scratch. Aside from this, our model can also understand the lead sheet, which conveys only melody line and chord progression, and translate it into an expressive piano performance:

How do we nail it? We reviewed our previous work REMI and found the way to represent music can be further condensed. If the required sequence length is shorter, we can feed longer music into the model. The figure below displays the process of how we convert REMI into Compound Words representation. Instead of predicting only one token per timestep, CP groups consecutive and related tokens and predict them at once so as to reduce the sequence length. Each field of CP is related to a certain aspect/attribute of music, such as duration, velocity, chords, and etc

For modeling such CP sequences, we design the specialized input and output modules of the transformer. The figure below illustrates the proposed architecture in action at a certain timestep.  On the right half part, where the model makes token prediction, you can see that there are multiple feedforward heads,  each accounting for a field of CP, which corresponds to a single row of CP shown in the figure above. On the left half part, each field of CP has its own token embedding, which will be concatenated as the vector and then reshaped by a linear layer, to become the final input of the transformer.

Because each head only concentrates on a certain field of CP, we can have more precise control when either modeling or generating music. For example, in training, we can assign different sizes of embedding to tokens of different types, according to the difficulty level associated with each type of token. We set larger for harder ones like duration and pitches, and smaller for the easier ones like beats and bars. In the inference time, we can adopt different sampling policies. For example, we can use larger temperature to have more randomness in the prediction of velocity tokens; and smaller temperature for pitch tokens to avoid mistakes.

The proposed model shows good training and inference efficiency. Now we can train our model on a single GPU with 11GB memory within just one day. In inference time, to generate a 3-minute song takes only about 20 seconds, which is much faster than real-time.

Learning to generate full songs means that the model can take the whole song as input and knows when to start when generation. However, the music generated by the current model still does not exhibit clear structural patterns, like AABA form, or repetitive phrases, etc. We are working on this, hoping one day our AI can write a hit song.

More examples of the generated music can be found at:

By: Wen-Yi Hsiao, Jen-Yu Liu,  Yin-Cheng Yeh, Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

## Guitar Transformer and Jazz Transformer

At the Yating Music Team of the Taiwan AI Labs, we are developing new music composing AI models extending from our previous Pop Music Transformer model (see the previous blog).  In October 2020, we are going to present two full papers documenting some of our latest result at the International Society for Music Information Retrieval Conference (ISMIR), the premier international conference on music information retrieval and music generation.

The first paper, entitled “Automatic composition of guitar tabs by Transformers and groove modeling,” talks about a Guitar Transformer model that learns to generate guitar tabs.

Here are two tabs in the style of guitar fingerstyle  generated fully automatically by this model; no human curation involved.

The highlight of this work is to design and incorporate what we call the “grooving” tokens to the representation we use to represent a piece of symbolic guitar music.  Groove,  which can be in general considered as a rhythmic feeling of a changing or repeated pattern, or “humans’ pleasurable urge to move their bodies rhythmically in response to music,” is not explicitly specified in either a MIDI or TAB file.  Instead, groove is implicitly implied as a result of the arrangement of note onsets over time.  Therefore, existing methods for representing music do not involve the use of groove-related tokens.

What we did is to apply music information retrieval (MIR) techniques to extract a 16-dimensional vector representing the occurrence of note onsets over 16 possible equally-spaced quantized positions of a bar, and then use the classical kmeans algorithm to cluster such 16-dim vectors from all the bars from all the pieces of our training data, leading to k (=32) clusters (we need clustering for otherwise there will be too many unique such 16-dim vectors).  We then treat these cluster IDs as “grooving tokens” and assign a grooving token to each bar of a music piece.  In this way, what our Transformer model (specifically, we use Transformer-XL) sees during model training would be not only the note-related tokens but also such bar-level grooving tokens.  It turns that this improves quite a lot the quality of the generated music, compared to the baseline model that does not use grooving tokens.

The following figure shows the result of a user study asking subjects to choose the best among the three continuations generated by different models, with or without the grooving tokens, given a short human-made prompt. The result is broken down according to the self-report guitar proficiency level of the subjects.  We can see that the professionals are aware of the difference between the grooving-agnostic model and the two groove-aware models (we implemented two variants here, a hard grooving model and a soft grooving model, which differ in the way we represent the musical onsets).

The second paper, entitled “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” talks about a Jazz Transformer model that learns to generate Jazz-style lead sheets, using the Jazzomat dataset.

The focus of this paper can be said to be about the development of objective metrics tailored for symbolic music generation tasks (namely, not for general sequence generation tasks).  Specifically, we proposed the following metrics:

• Pitch Usage: Entropy of 1- & 4-bar chromagrams;
• Rhythm: Cross-bar similarity of grooving patterns;
• Harmony: Percentage of unique chord trigrams;
• Repeated structures: Short-, mid-, & long-term structureness indicators (computed from the fitness scape plot);
• Overall musical knowledge: Multiple-choice continuation prediction (given 8 bars, predict the next 8 bars).

Implementation of these evaluation metrics can all be found in the MusDr repository listed above.  It has also been integrated into MusPy, an open source Python library for symbolic music generation developed by Hao-Wen Dong et al at UCSD.

These objective metrics can help us gain some ideas about the quality of the machine-generated music, dispensing the need to run the expensive user studies too many times.  These metrics also help elucidate the difference between machine-composed music and human-composed ones.  For example, from the structureness indicators, we can see clearly that Transformer-XL based music composing models, which represent a current state-of-the-art, still fall short of generating music with reasonable mid- and long-term structure.  See the following figure for a comparison between the fitness scape plot of a piece composed by the Jazz Transformer (marked as `Model (B)’), and that of a human-composed one.

We are actively using these new metrics to guide and to improve our models.  In particular, we are finding ways to induce structures in machine-composed music.  Let us know if you like what we are doing and/or have some ideas to chat with us!

Ref:
[1] Yu-Hua Chen, Yu-Siang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Automatic composition of guitar tabs by Transformers and groove modeling,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).
[2] Shih-Lun Wu and Yi-Hsuan Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).

By: Yu-Hua Chen, Shih-Lun Wu, Yu-Siang Huang, Wen-Yi Hsiao and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

## Introduction

With the development of the Internet, it is convenient that one can get news from network rapidly. Nonetheless, it is also dangerous since a person might have a preference for a specific news medium over others, and the news medium may have its position to report news while missing some facts that they do not want people to know. For the purpose, we propose a system to detect missing facts from a news report.

## Method

#### Work flow

First, we group the news using their embeddings derived by Universal Sentence Encoder(USE)[1]. Within each group, the news are highly related. In fact, most of them report the same event as exptected. Meanwhile, we extract the summary for each news report using an algorithm called PageRank. Then, we further summarize each news group using the summaries of the news in the group. Afterwards, we compare the summary of each news report with the group summary to get the missing facts of the report.

#### Grouping news

This part is implemented by Yu-An, Wang, and I omit it here.

#### PageRank

For each news report, we split the raw content into sentences. Then we count the similarity of each sentence pair using USE and editdistance. Once a similarity matrix is ready, we can sort the sentences by PageRank. We then choose top k sentences to be the report’s summary, where k is a hyperparameter.

Combining the summaries of the news in a group, we can summarize the news group in a similar way.

#### Missing facts

We call each sentence in a news summary “a fact”. Now we can obtain the difference between a news summary and the group summary, and the sentences existing in the group summary but not in the news summary are the missing facts of the report. Otherwise, the sentences existing in the news summary but not in the group summary is the exclusive content of the news report.

## Source Classifier for Daily News and Farm Post

### Introduction

Sometimes the news media are not neutral as expected. Hence, we can train a satisfactory classifier due to the biases. Or to say, the classifier will work since a news medium might have its specific writing style. We use BERT[2] as the contexual representation extractor to train a classifier in order to predict the source (news medium) of a news report given its content (or title). Except for the daily news, this kind of classifier can also be adopted to classify the source of farm posts, which are usually biased and contain fake information.

### Data

• Daily news
• # classes: 4
• 台灣四大報（中國時報、聯合報、自由時報、蘋果日報）
• Preprocessing
• Sometimes the reporter name, the news medium itself, or some slogans are in the raw content. We filter them using hand-crafted rules.
• Farm post
• # classes: 16
• mission-tw, qiqu.live, qiqu.world, hssszn, qiqi.today, cnba.live, 77s.today, i77.today, nooho.net, hellotw, qiqu.pro, taiwan-politicalnews, readthis.one, twgreatdaily.live, taiwan.cnitaiwannews.cn

### Method

#### C-512

• As shown in the figure below, this classifier is composed of a BERT model followed by a linear layer. This model can handle the input whose content length is less than 512. Note that the latent vector is corresponding to the CLS token.

#### C-whole

• Since the pretrained BERT model is for the cases with less-than-512 texts, we propose a method to deal the cases with more-than-512 texts.
• First, we train a C-512 model. The C-whole model uses the BERT module in the trained C-512 model as its representation extractor. Given an input content with arbitary text length, ll, we split it into ⌈l/512⌉⌈l/512⌉ segments. The representation for each segment is derived by the BERT module, and then it is passed through a linear layer. Averaging all these outputs from the linear layer, the vector is fed into another linear layer to obtain the final output of the C-whole model.

### Application

1. Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018). ↩︎

2. Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 (2018). ↩︎

## Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

Paper (ACM Multimedia 2020):  https://arxiv.org/abs/2002.00212 (pre-print)

Code (GitHub):  https://github.com/YatingMusic/remi

We’ve developed Pop Music Transformer, a deep learning model that can generate pieces of expressive Pop piano music of several minutes.  Unlike existing models for music composition, our model learns to compose music over a metrical structure defined in terms of bars, beats, and sub-beats.  As a result, our model can generate music with more salient and consistent rhythmic structure.

Here are nine pieces of piano performances generated by our model in three different styles.  While generating the music, our model takes no human input (e.g., prompt or chord progressions) at all.  Moreover, no post-processing steps are needed to refine the generated music.  The model learns to generate expressive and coherent music automatically.

From a technical point of view, the major improvement we have made is to invent and employ a new approach to represent musical content.  The new representation, called REMI (REvamped MIDI-derived events), provides a deep learning model more contextual information for modeling music than existing MIDI-like representation.   Specifically, REMI uses position and bar events to provide a metrical context for models to “count the beats.”  And, it uses supportive musical tokens capturing the high-level music information of tempo and chord.  Please see the figure below for a comparison between REMI and the commonly-adopted MIDI-like token representation of music.

The new model can generate music with explicit harmonic and rhythmic structure, while allowing for expressive rhythmic freedom in music (e.g., tempo rubato).  While it can generate chord events and tempo changes events on its own, it also provides a mechanism for human users to control and manipulate the chord progression and local tempo of the music being generated as they wish.

The figure below show the piano-rolls of piano music generated by two baseline models (the first two rows) and the proposed model (the last one), when these models are asked to “continue” a 4-bar prompt excerpted from a human-composed music.  We can see that the proposed model continues the music better.

The figure below show the piano-roll of a generated piano music when we constrain the model not to use the same musical chord (F:minor) as the 4-bar prompt.

The paper describing this new model has been accepted for publication at ACM Multimedia 2020, the premier international conference in the field of multimedia computing.  You can find more details in our paper (see the link below the title) and try the model yourself with the code we’ve released!  We provide not only the source code but also the pre-trained model for developers to play with.

More examples of the generated music can be found at: