Guitar Transformer and Jazz Transformer

At the Yating Music Team of the Taiwan AI Labs, we are developing new music composing AI models extending from our previous Pop Music Transformer model (see the previous blog).  In October 2020, we are going to present two full papers documenting some of our latest result at the International Society for Music Information Retrieval Conference (ISMIR), the premier international conference on music information retrieval and music generation.

 

The first paper, entitled “Automatic composition of guitar tabs by Transformers and groove modeling,” talks about a Guitar Transformer model that learns to generate guitar tabs.

Here are two tabs in the style of guitar fingerstyle  generated fully automatically by this model; no human curation involved.

The highlight of this work is to design and incorporate what we call the “grooving” tokens to the representation we use to represent a piece of symbolic guitar music.  Groove,  which can be in general considered as a rhythmic feeling of a changing or repeated pattern, or “humans’ pleasurable urge to move their bodies rhythmically in response to music,” is not explicitly specified in either a MIDI or TAB file.  Instead, groove is implicitly implied as a result of the arrangement of note onsets over time.  Therefore, existing methods for representing music do not involve the use of groove-related tokens.

What we did is to apply music information retrieval (MIR) techniques to extract a 16-dimensional vector representing the occurrence of note onsets over 16 possible equally-spaced quantized positions of a bar, and then use the classical kmeans algorithm to cluster such 16-dim vectors from all the bars from all the pieces of our training data, leading to k (=32) clusters (we need clustering for otherwise there will be too many unique such 16-dim vectors).  We then treat these cluster IDs as “grooving tokens” and assign a grooving token to each bar of a music piece.  In this way, what our Transformer model (specifically, we use Transformer-XL) sees during model training would be not only the note-related tokens but also such bar-level grooving tokens.  It turns that this improves quite a lot the quality of the generated music, compared to the baseline model that does not use grooving tokens.

The following figure shows the result of a user study asking subjects to choose the best among the three continuations generated by different models, with or without the grooving tokens, given a short human-made prompt. The result is broken down according to the self-report guitar proficiency level of the subjects.  We can see that the professionals are aware of the difference between the grooving-agnostic model and the two groove-aware models (we implemented two variants here, a `hard grooving` model and a `soft grooving model`, which differ in the way we represent the musical onsets).

 

The second paper, entitled “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” talks about a Jazz Transformer model that learns to generate Jazz-style lead sheets, using the Jazzomat dataset.

The focus of this paper can be said to be about the development of objective metrics tailored for symbolic music generation tasks (namely, not for general sequence generation tasks).  Specifically, we proposed the following metrics:

  • Pitch Usage: Entropy of 1- & 4-bar chromagrams;
  • Rhythm: Cross-bar similarity of grooving patterns;
  • Harmony: Percentage of unique chord trigrams;
  • Repeated structures: Short-, mid-, & long-term structureness indicators (computed from the fitness scape plot);
  • Overall musical knowledge: Multiple-choice continuation prediction (given 8 bars, predict the next 8 bars).

Implementation of these evaluation metrics can all be found in the `MusDr` repository listed above.  It has also been integrated into `MusPy,` an open source Python library for symbolic music generation developed by Hao-Wen Dong et al at UCSD.

These objective metrics can help us gain some ideas about the quality of the machine-generated music, dispensing the need to run the expensive user studies too many times.  These metrics also help elucidate the difference between machine-composed music and human-composed ones.  For example, from the structureness indicators, we can see clearly that Transformer-XL based music composing models, which represent a current state-of-the-art, still fall short of generating music with reasonable mid- and long-term structure.  See the following figure for a comparison between the fitness scape plot of a piece composed by the Jazz Transformer (marked as `Model (B)’), and that of a human-composed one.

We are actively using these new metrics to guide and to improve our models.  In particular, we are finding ways to induce structures in machine-composed music.  Let us know if you like what we are doing and/or have some ideas to chat with us!

 

Ref:
[1] Yu-Hua Chen, Yu-Siang Huang, Wen-Yi Hsiao, and Yi-Hsuan Yang, “Automatic composition of guitar tabs by Transformers and groove modeling,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).
[2] Shih-Lun Wu and Yi-Hsuan Yang, “The Jazz Transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. Int. Society for Music Information Retrieval Conf. 2020 (ISMIR’20).

By: Yu-Hua Chen, Shih-Lun Wu, Yu-Siang Huang, Wen-Yi Hsiao and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Missing facts and source classifier of daily news


Introduction

With the development of the Internet, it is convenient that one can get news from network rapidly. Nonetheless, it is also dangerous since a person might have a preference for a specific news medium over others, and the news medium may have its position to report news while missing some facts that they do not want people to know. For the purpose, we propose a system to detect missing facts from a news report.


Method

Work flow

First, we group the news using their embeddings derived by Universal Sentence Encoder(USE)[1]. Within each group, the news are highly related. In fact, most of them report the same event as exptected. Meanwhile, we extract the summary for each news report using an algorithm called PageRank. Then, we further summarize each news group using the summaries of the news in the group. Afterwards, we compare the summary of each news report with the group summary to get the missing facts of the report.

Grouping news

This part is implemented by Yu-An, Wang, and I omit it here.

PageRank

For each news report, we split the raw content into sentences. Then we count the similarity of each sentence pair using USE and editdistance. Once a similarity matrix is ready, we can sort the sentences by PageRank. We then choose top k sentences to be the report’s summary, where k is a hyperparameter.

Combining the summaries of the news in a group, we can summarize the news group in a similar way.

Missing facts

We call each sentence in a news summary “a fact”. Now we can obtain the difference between a news summary and the group summary, and the sentences existing in the group summary but not in the news summary are the missing facts of the report. Otherwise, the sentences existing in the news summary but not in the group summary is the exclusive content of the news report.


Application

島民衛星 https://islander.cc/

Source Classifier for Daily News and Farm Post


Introduction

Sometimes the news media are not neutral as expected. Hence, we can train a satisfactory classifier due to the biases. Or to say, the classifier will work since a news medium might have its specific writing style. We use BERT[2] as the contexual representation extractor to train a classifier in order to predict the source (news medium) of a news report given its content (or title). Except for the daily news, this kind of classifier can also be adopted to classify the source of farm posts, which are usually biased and contain fake information.


Data

  • Daily news
    • # classes: 4
    • 台灣四大報(中國時報、聯合報、自由時報、蘋果日報)
    • Preprocessing
      • Sometimes the reporter name, the news medium itself, or some slogans are in the raw content. We filter them using hand-crafted rules.
  • Farm post
    • # classes: 16
    • mission-tw, qiqu.live, qiqu.world, hssszn, qiqi.today, cnba.live, 77s.today, i77.today, nooho.net, hellotw, qiqu.pro, taiwan-politicalnews, readthis.one, twgreatdaily.live, taiwan.cnitaiwannews.cn

Method

C-512

  • As shown in the figure below, this classifier is composed of a BERT model followed by a linear layer. This model can handle the input whose content length is less than 512. Note that the latent vector is corresponding to the CLS token.

C-whole

  • Since the pretrained BERT model is for the cases with less-than-512 texts, we propose a method to deal the cases with more-than-512 texts.
  • First, we train a C-512 model. The C-whole model uses the BERT module in the trained C-512 model as its representation extractor. Given an input content with arbitary text length, ll, we split it into ⌈l/512⌉⌈l/512⌉ segments. The representation for each segment is derived by the BERT module, and then it is passed through a linear layer. Averaging all these outputs from the linear layer, the vector is fed into another linear layer to obtain the final output of the C-whole model.

Application

島民衛星 https://islander.cc/


1. Cer, Daniel, et al. “Universal sentence encoder.” arXiv preprint arXiv:1803.11175 (2018). ↩︎

2. Jacob Devlin, et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805 (2018). ↩︎

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions

Paper (ACM Multimedia 2020):  https://arxiv.org/abs/2002.00212 (pre-print)

Code (GitHub):  https://github.com/YatingMusic/remi

 

We’ve developed Pop Music Transformer, a deep learning model that can generate pieces of expressive Pop piano music of several minutes.  Unlike existing models for music composition, our model learns to compose music over a metrical structure defined in terms of bars, beats, and sub-beats.  As a result, our model can generate music with more salient and consistent rhythmic structure.

 

Here are nine pieces of piano performances generated by our model in three different styles.  While generating the music, our model takes no human input (e.g., prompt or chord progressions) at all.  Moreover, no post-processing steps are needed to refine the generated music.  The model learns to generate expressive and coherent music automatically.

 

 

From a technical point of view, the major improvement we have made is to invent and employ a new approach to represent musical content.  The new representation, called REMI (REvamped MIDI-derived events), provides a deep learning model more contextual information for modeling music than existing MIDI-like representation.   Specifically, REMI uses position and bar events to provide a metrical context for models to “count the beats.”  And, it uses supportive musical tokens capturing the high-level music information of tempo and chord.  Please see the figure below for a comparison between REMI and the commonly-adopted MIDI-like token representation of music.

 

The new model can generate music with explicit harmonic and rhythmic structure, while allowing for expressive rhythmic freedom in music (e.g., tempo rubato).  While it can generate chord events and tempo changes events on its own, it also provides a mechanism for human users to control and manipulate the chord progression and local tempo of the music being generated as they wish.

 

The figure below show the piano-rolls of piano music generated by two baseline models (the first two rows) and the proposed model (the last one), when these models are asked to “continue” a 4-bar prompt excerpted from a human-composed music.  We can see that the proposed model continues the music better.

 

The figure below show the piano-roll of a generated piano music when we constrain the model not to use the same musical chord (F:minor) as the 4-bar prompt.

 

The paper describing this new model has been accepted for publication at ACM Multimedia 2020, the premier international conference in the field of multimedia computing.  You can find more details in our paper (see the link below the title) and try the model yourself with the code we’ve released!  We provide not only the source code but also the pre-trained model for developers to play with.

 

More examples of the generated music can be found at:

https://drive.google.com/drive/folders/1LzPBjHPip4S0CBOLquk5CNapvXSfys54

 

Enjoy listening!

 

By: Yu-Siang Huang, Chung-Yang Wang, Wen-Yi Hsiao, Yin-Cheng Yeh and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)

Telling Something from Your Face: Age, Beauty, and Anti-Spoofing

Figure 1. Demonstration of the facial age, beauty, pose estimation.

In the last few decades, facial analysis has been thoroughly explored due to its huge commercial potential. For example, learning the semantic meaning of a human face can help us to target potential customers more easily and more efficient. While face recognition is a well-established technology for surveillance systems or security control, it is also crucial for us to know the captured image is from a real person or a spoofing device. In this blog post, we are going to discuss what we do in Taiwan Ailabs and our demo system.

Predicting a human age or beauty from the appearance is a very subjective task. A man/woman can look younger than his/her real age while beauty is not even a quantifiable value. Nevertheless, knowing the approximated appearance age and beauty is still helpful for merchandise recommendation. For instance, recommending an old man/woman some soft drinks may not be a good idea, nor placing a massage chair in a children’s play zone.

There are several ways to estimate these two values with proper supervision. Classification tells us that we can use independent bins to represent a range of the age or beauty score. The drawback of such method is the range is manually set and might lead to quantization error. On the other hand, it is also natural to use regression to predict continuous values, but it might also lead to overfitting without constraints. In addition, the face pose and resolution also have a huge impact on the prediction performance. The above challenges make the prediction of age and beauty even more difficult.

 

 

Method:

Age and Beauty estimation

To resolve the above challenges while achieving very low cost, the demo system is built upon the recently published FSA-Net [1] in CVPR2019. It adopts Soft-Stagewise Regression (SSR) scheme [2] for eliminating the quantization error while maintaining low memory overhead. 

For the training, we choose 128x128x3 as the cropped face resolution, and put an auxiliary loss as the quantized supervision prediction.

A drawback of such scheme is that unstable prediction might appears with different input images. We use a sequential frame selecting pipeline to stabilized the final prediction.

Figure 2. Sequential frame selecting pipeline.

Finally, only when the detected face has enough resolution with very small pose angle, it will be considered as a valid face and proceed with the estimation pipeline. The replacement policy can be altered as long as the selected face image is high quality with enough resolution.

 

Face anti-spoofing

In order to achieve anti-spoofing with a pure RGB image, we divide the process into two different tasks, cell phone detection and denoising based anti-spoofing estimation.

Figure 3. Face anti-spoofing pipeline.

We adopt the famous YOLOv3 [3] as the detector for the cell phone, laptop, monitor detection. Task 1 is defined as detection inside the phone area will be considered as a fake spoof. While taks 2 takes the quality and the noise of the image for determining whether it’s Real or Fake.

 

Demo images:

The following demo shows that our system can predict age and beauty with decent

 Figure 4. Demo images for age and beauty estimation.

 

Figure 5. Demo image for face anti-spoofing. (Red: Fake. Blue: Real.)

Summary:

Face attributes estimation such as age and beauty is subjectively determined by the label data, but it is still useful for commercial analysis and recommendation. For face recognition, the anti-spoofing is also very important for the security and the robustness of the whole identity verification pipeline. We achieved these prototypes for showing that there is much more potential on these topics and the AI can truly help people to make a better decision with these estimations.

 

Reference:

[1] Yang, Tsun-Yi, et al. “FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation from a Single Image.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[2] Yang, Tsun-Yi, et al. “SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation.” IJCAI. Vol. 5. No. 6. 2018.

[3] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

Compete for the NT$20 Million Reward in the AI Chinese Comprehension Championship Challenge – 2020 “Formosa Grand Challenge – Talk to AI” –

2020 “Formosa Grand Challenge – Talk to AI”

Date: September 9, 2019

The 2020 “Formosa Grand Challenge – Talk to AI” officially kicks off today (September 9) to accept applications. The Ministry of Science and Technology invites experts to challenge the Band C Level of the Reading Comprehension Test for the “Test of Chinese as a Foreign Language,” or TOCFL for short, to compete for the first prize reward of NT$20 million.

Unlike the previous 1st “Talk to AI” competition, this challenge will be conducted in a single-choice mode, but will include voice reading and continuous dialogue tests to fully challenge AI’s ability to comprehend the Chinese language. The purpose is to allow AI to read the subject contents of different academic fields to test its comprehension and inference of the words and sentences, and further organize the entire dataset into application knowledge to achieve AI self-learning. The competition will be held in two stages, where the preliminary contest is scheduled for December 2019, and the final round in April 2020.

Dr. Wen Lian Hsu and his research team of the Academia Sinica Institute of Information Science is entrusted to work out the test dataset in conjunction with the National Academy for Educational Research to co-establish the dataset for training comprehension and dialogue. It is scheduled to be opened for use by the contestant teams in October 2019. On the other hand, the Speech-to-Text API uses Ya Ting Verbatim of AI Labs. Ya Ting Verbatim is developed by AI Labs led by Ethan Tu. It is indeed a great honor for Formosa Grand Challenge to cooperate with AI Labs.

To allow AI to take root, this “Talk to AI” competition is scheduled to hold a FUN CUP team event on November 16 to encourage general and vocational high school students as well as college students to participate. The FUN CUP team will work with the Association for Computational Linguistics and Chinese Language Processing to enable young students to understand the application of AI man-machine dialogue and use existing commercial AI tools to train the machine to learn and fully utilize existing voice materials and resources with the goal of achieving small-scale scientific research results of a certain quality.

The Ministry of Science and Technology will continue to enhance the scale of the voice dataset during the competition, allowing the teams to carry out technology development and testing. The research team of Associate Professor Yuan-Fu Liao of National Taipei University of Technology is invited to do the speech data transcription, releasing about 600 hours of the AI voice dataset to the data market of the National Center for High-Performance Computing (NCHC DATA MARKET), which is to be used by paid authorized users. The fee is 2000 NTD per 150 hours.

The Ministry of Science and Technology hopes to encourage innovators to use the potential of AI development, technology, and creativity to solve the challenges of voice applications through the “Formosa Grand Challenge” competition, and looks forward to any possible progress and thinking. It also expects that the event will attract more enterprises as well as academic and research institutions to get involved and work together to promote the upswing of Taiwan’s AI voice recognition technology and assist Taiwanese enterprises in digital transformation.

 


2020 “Formosa Grand Challenge – Talk to AI”

,

AI Labs released an annotation system: Long live the medical diagnosis experience.

The Dilemma of Taiwan’s Medical Laboratory Sciences

Thanks to the Breau Of National Health Insurance in Taiwan, abundant medical data are appropriately recorded. This is surely good news for us, an AI-based company. However, most of the medical data have not been labeled yet. What’s worse, Taiwan currently faces a terrible medical talent shortage. The number of experienced masters of medical laboratory sciences is getting smaller and smaller. Take Malaria diagnosis for example. Malaria parasites belong to the genus Plasmodium (phylum Apicomplexa). In humans, malaria is caused by P. falciparum, P. malariae, P. ovale, P. vivax and P. knowlesi. It is undoubtedly an arduous work for a human to detect and classify the affected cell to these five classes. Unfortunately, it is only one retiring master in this field in Taiwan that can indeed confirm the correctness of the diagnosis. We must take remedial action right away, yet it costs too much either time or money to train a human being to be a Malaria master. Only through the power of the technology can we preserve the valuable diagnosis experience.

Now, we decided to solve the problem by transferring human’s experience to the machines, and the first step is to annotate the medical data. Since the only one master cannot address the overwhelming data by himself, he needs some helpers to help him do the first detection job and, in the end, the master does the final confirmation. In this case, we need a system which allows multiple users to cooperate. It should be able to record the file path of the label data, the annotators, and the label time. We search for assorted off-the-shelf annotation systems, none of them, unfortunately again, meets our specification. So we decided to roll up our sleeves and revise a most relevant one for our propose.

An Improved Annotation System

Citing from an open-source annotation system resource [1], AILabs revised it and released a new, easy-to-use annotation system in order to help those who desire to create their valuable labeled data. With our labeling system, you will know who is the previous annotator and can systematically revise other’s work.

This system is designed for object labeling. By creating a rectangular box on your target object, you will be able to assign the label name and label coordinates to the chosen target. See the example below.

Also, you will obtain the label catalogs and the label coordinates in an XML file, which is in PASCAL VOC format. You can surely leverage the output XML file as the input of your machine learning programs.

How does it work?

Three steps: Load, Draw and Save.

In Load: Feed the system an image into this system. It is always fine if you do not have an XML file since it is your first time operating this system.

In Draw: Create as many as labels in an image as you want. Don’t forget that you may zoom in/out if the image is not clear enough.

In Save: Click save button. Everything is done. The system will output an XML file including all the labels data for an image.

What’s Next?

With the sufficient annotated data, we can then train our machines by learning the labels annotated by the medical master, which will make the machines able to make a diagnosis as brilliant as the last master. We will keep working on it!

Citation

[1] Tzutalin. LabelImg. Git code (2015). https://github.com/tzutalin/labelImg

Music Analysis for Automatic Music Composition: Source Separation and Music Transcription

AI needs a lot of music examples to learn to compose music. The quality and diversity of the music examples can be the key to the success of the AI. Typically, researchers begin with training an AI music composition model by learning from symbolic music data such as MIDI files. This is how we developed the AI Jazz bass player introduced in our last blog post.

 

However, relying on the MIDI files as the major data source has a few clear limitations. First, not all the music out there has MIDI files that are publicly and widely available. This is especially the case for certain music genres such as Jazz, which features improvisation. Second, MIDI files are notoriously noisy [1]. A great effort is needed in preprocessing and cleansing the MIDI data before they can be used to train a machine learning model. Such a process may come with assumptions, simplifications, and imprecisions that limit the performance of the resulting AI model. Third, not all MIDI files contain performance-level attributes of music such as the velocities (dynamics) and microtimings (timing offsets) of the musical notes. The music generated may sound mechanical and not expressive enough [2].

To free Yating from such limitations, we have a team of data engineer, machine learning engineer and musicians that are working on tasks that can be in general referred to as music analysis, or music information retrieval. Our goal is to enable Yating to learn to compose and perform music directly from audio recordings of music performances, an approach the Google Magenta team is also exploring [3]. This new approach, when successful, can unlock many important potentials of AI music composition models.

While the Google Magenta team dealt with exclusively piano-only music in [3], we are interested in building a data processing pipeline that allows us to learn from music played by any instruments.

In doing so, we are building an “AI Listener” that can (one day) comprehend the content of arbitrary music signals as good as well-trained human listeners. The first two music analysis tasks we are focusing on now are “source separation” and “music transcription,” for the output of such models, after some other processing, can be used to AI music composition models.

A core task of source separation [4] is to isolate out the sounds of specific instruments from an audio mixture. For example, a Jazz piano trio usually consists of the sounds played by a pianist, a bass player and a drummer. While human ears can focus on the sounds from one of the instruments while listening to the music, it may be hard for a machine to do so, as the sounds from these instruments (musical sources) have been mixed together in the audio signal.

The task of music transcription [5], on the other hand, can be said as converting music from the audio domain (audio signal) to the symbolic domain (e.g., MIDI file). For single-instrument music, we may want to transcribe the pitch, onset/offset timings, and even the velocity of all the musical notes. For multi-instrument music, the task is even challenging as we need to decide which note is played by which instrument.

For now, we built a source separation model to isolate out the piano track from an audio mixture, and a music transcription model to convert the (separated) piano track from the audio domain to the symbolic domain. We focus on the piano now because there are more public-domain datasets for piano transcriptions (such as the MAESTRO dataset [3] and the MAPS database [6]). But, thanks to the source separation model, we can learn from not only piano-only music but also multi-instrument music that contains piano.

In other words, the two models are cascaded to transcribe the piano part of an audio mixture. The transcription result can then be used to train an AI compoisition model. This process is illustrated below.

 

We present below four examples showing the performance of our models in isolating out and transcribing the piano. In each set of audio files, we show the original audio mixture first, then the separated piano track, and finally the transcribed result. The transcribed result is rendered using an electric piano sound font by a VSTi. We also show the pianoroll demonstrating the transcription result for each song. 

 

 

 

 

 

 

 

(Please note that, because our music transcription model does not predict the usage of sustain pedal thus far (this is a function we will add soon), we occasionally apply sustain pedal by hand to the transcribed result in the above examples.)

In general, the separation result is fairly good. The separation model removes the sounds from other instruments, and the remaining piano sounds do not suffer from distortion or other artefacts. This is quite remarkable, as we notice that this may be the first demonstration of a successful piano source seperation model in the world—people working on musical source seperation usually aim to isolating out the singing voice, drum, and bass (see the SiSEC challenge [7] for example), not the piano. We are currently extending the model to deal with other instruments, such as the guitar.

The transcription result is not perfect yet it already seems feasible to be used for training music composition models.

While it’s still our ongoing work to leverage such transcription result for training AI music composition models, our in-house musicians already find ways to play with the separated piano tracks. Check the video below to see how they used the output of our separation model for making hip-hop style music.

 

References:

[1] C Raffel and DPW Ellis, “Extracting Ground-Truth Information from MIDI Files: A MIDIfesto,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2016. (link)

[2] B Wang and YH Yang, “PerformanceNet: Score-to-audio Music Generation with Multi-band Convolutional Residual Network,” in Proc. AAAI Conference on Artificial Intelligence (AAAI), 2019. (link)

[3] C Hawthorne, A Stasyuk, A Roberts, I Simon, CZ Anna Huang, S Dieleman, E Elsen, J Engel, and D Eck, “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,” in Proc. International Conference on Learning Representations (ICLR), 2019. (link)

[4] JY Liu and YH Yang, “Dilated Convolution with Dilated GRU for Music Source Separation,” in Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2019. (link)

[5] C Hawthorne, E Elsen, J Song, A Roberts, I Simon, C Raffel, J Engel, S Oore, and D Eck, “Onsets and Frames: Dual-Objective Piano Transcription,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2018. (link)

[6] V Emiya, R Badeau, B David, “Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,” IEEE Transactions on Audio, Speech and Language Processing, 2010. (link)

[7] https://sigsep.github.io/datasets/musdb.html

 

 

By: Jen-Yu Liu, Chung-Yang Wang, Tsu-Kuang Hsieh and Yi-Hsuan Yang (Taiwan AI Labs Yating/Music Team)

AI Jazz Bass Player: Bass Accompaniment in A Jazz Piano Trio Setting

November 2018 marks the debut of Yating, an AI Pianist that learns to compose and perform keyboard-style music by means of the AI technology we are developing here at the Taiwan AI Labs.  Instead of playing pre-existing music, Yating listens to you and composes original piano music on-the-fly in response to the affective cues found in your voice input.  This is done with a combination of our technology in automatic speech recognition, affective computing, human-computer interaction, and automatic music composition.  In November 2018 Yating gave a public concert as her debut at the Taiwan Social Innovation Lab (社會創新實驗中心) in Taipei (see the trailer here; in Mandarin).  Now, you can download the App we developed (both iOS and Android versions are available) to listen to the piano music Yating creates for you any time through your smartphone.

Yating keeps growing her skillset since that.  One of the most important skills we want Yating to have is the ability to create original multi-track music, i.e., a music piece that is composed of multiple instruments.  Unlike the previous case of composing keyboard-only, single-instrument music, composing multi-track music demands consideration of the relationship among the multiple tracks/instruments that are involved in the piece of music.  Each track must sound “right” on its own, and collectively the tracks interact with one another closely and unfold over time interdependently.

We begin with the so-called Jazz Piano Trio setting, which is composed of a pianist playing the melody and chord, a double bass player playing the bass, and a drummer that plays the drums.  We find this setting interesting, because it involves a reasonable number of tracks with different roles, and because it’s a direct extension of the previous piano-only setting.  Our goal here is therefore to learn to compose original music with these three tracks in the style of Jazz.

We share with this blog post how Yating learns to play the part of the bass player.  We may talk about the parts of the pianist and drummer in the near future.

Specifically, we consider the case of bass accopmaniment over some given chords and rhythm.  This can be understood as the case where the pianist plays only the chord (but not the melody), the drummer plays the rhythm, and the bass player has to compose the bassline over the provided harmonic and rhythmic progression.  In this case, we only have to compose music of one specific track, but while composing the track we need to take into account the interdependence among all the three tracks.

We use an in-house collection of 1,500+ eight-bar phrases segmented from MIDI files of Jazz piano trio to train a deep recurrent neural network to do this. For I/O, we use pretty_MIDI (link).

You can first listen to a few examples of the bass this model composes given human-composed chords and rhythm.

 

You can find below a video demonstrating a music piece our in-house musicians created in collaboration with this AI bass player.

The architecture of our bass composition neural network is shown in the figure below.  It can be considered as a many-to-many recurrent network.  The input to the model comprises a chord progression and a drum pattern, both of eight bars, and the intended tempo of the music.  The target output of the model is a bass solo of eight bars long as well, comprising the pitch and velocity (which is related to the loudness) associated with each note.

The drum pattern and chord sequence are processed by separate stacks of two recurrent layers of bidirectional long short-term memory (BiLSTM) units.

The input drum pattern is represented by a sequence of eight 16-dimensional vectors, one vector for each bar.  Each element in the vector represents the activity of drums for each 16th beat of the bar, calculated mainly by counting the number of active drums for that 16th beat over the following nine drums: kick drum, snare drum, closed/open hi-hat, low/mid/high toms, crash cymbal, and ride cymbal.  We weigh the kick drum a bit more to differentiate it from the other drums.  The output of the last BiLSTM layer of the drum branch is another sequence of eight K1-dimensional vectors, again one vector for each bar.  Here, K1 denotes the number of hidden units of the last BiLSTM layer of the drum branch.

The input chord progression, on the other hand, is represented by a sequence of thirty-two 24-dimensional vectors, one vector for each beat.  We use a higher temporal resolution here to reflect the fact that the chord may change every beat (while the rhythm may be more often perceived at the bar level).  Each vector is composed of two parts, a 12-dimensional multi-hot “pitch class profile” (PCP) vector representing the activity of the twelve pitch classes (C, C#, …, B, in a chromatic scale) in that beat, and another 12-dimensional one-hot vector marking the pitch class of the bass note (not the root note) in that chord.  The output of the last BiLSTM layer of the chord branch is a sequence of thirty-two K2-dimensional vectors, again one vector for each beat.  Here, K2 denotes the number of hidden units of the last BiLSTM layer of the chord branch.

The tempo of the 8-bar segment, which is available from the MIDI file, is represented as a 35-dimensional one-hot vector after quantizing (non-uniformly) the tempo into 35 bins (the choice of the number of bins is made quite arbitrary).  The vector is used as the input to a fully-connected layer to get an K3-dimensional vector representing the tempo information for the whole segment.

Compared to drum and chords, we use an even finer temporal resolution for the bass generator (the second half of the bass composition model shown in the figure above): we aim to generate the bass for every 16th beat.  The input of the bass generator is therefore a sequence of 128 (K1+K2+K3)-dimensional vectors, one vector for each 16th beat.  Each vector is obtained by concatenating the output from the drum branch, chord branch, and tempo branch of the corresponding 16th beat.  The bar generator is implemented again by two stacks of BiLSTM layers.  From the output of the last BiLSTM layer, we aim to generate a 39-dimensional one-hot vector representing the pitch and a 14-dimensional one-hot vector representing the velocity used by the bass for that 16th beat.  Here, the pitch vector is 39 dimensional because we consider 37 pitches (from MIDI number 28-64, which corresponds to the pitch range of double bass) plus one rest token and one “repeat-the-note” token.  The velocity vector is 14 dimensional because we quantize (non-uniformly) the velocity value (which is originally from 0 to 128) to 14 bins.  Because the model has to predict both the pitch and velocity of the bass, it can be said that model is doing multi-task learning.

After training the model with tens of epochs, we find that it can start to generate some reasonable bass, but the pitch contour is sometimes too fragmented.  It might be possible to further improve the result by collecting more training data, but we decide to apply some simple postprocessing rules based on some music knowledge.  We are in general happy with the current result: the bass fits with the drum pattern nicely and has pleasant grooving.

You can listen to more music we generated below.

This is just the beginning of Yating’s journey in learning to compose multi-track music.  The bass accompaniment model itself can be further improved, but for now we’d like to move on and have fun learning to compose the melody, chords, and drum in the setting of Jazz piano trio.

 

 

By: Yin-Cheng Yeh, Chung-Yang Wang, Yi-Pai Liang, and Yi-Hsuan Yang (Taiwan AI Labs Yating/Music Team)

, ,

ptt.ai, open source blockchain for AI Data Justice

[ 換日線報導中文連結 ]

The beginning of Data Justice” movement

By collaborating with online citizen and social science workers in Taiwan, Taiwan AILabs promotes the Data Justice” in the following principles:

  1. Prioritize Privacy and Integrity with goodwill for applications before data collection
    • In addition to Privacy Protection Acts, review the tech giant on potential abuse of monopoly position forcing users to give up their the privacy, or misuse user content and data for different purpose. In particular, organizations that became monopoly in the market should be reviewed regularly by local administration knowing if there is any abuse of data when users are unwillingly giving up their privacy.
  2. Users’ data and activities belong to users
    • The platform should remain neutral to avoid the misuse of user data and its creation.
  3. Public data collected should open for public researches
    • The government organization data holder is responsible for its openness while privacy and integrity are secured.For example, health insurance data for public health and smart city data for traffic researches.
  4. Regulate mandatory data openness
    • For the data critical to major public welfare controlled by monopoly private agency, we shall equip the administration the power for data openness.
    • For example, Taipower electric power usage data in Taiwan.

Monopoly now is worse than oil monopoly”

In 1882, the American oil giant John D. Rockefeller founded Standard Oil Trust and united with 40 oil-related companies to reach price control. In 1890, U.S. government sued Standard Oil Trust to prevent unfair monopoly. The antitrust laws have been formulated so as to ensure fair trade, fair competition, and prevent price manipulation. The governments of various countries followed the movement to establish anti-monopoly laws. In 1984 AT&T, a telecom giant, was split into several companies for antitrust laws. Microsoft was sued in 2001 for having Internet Explorer in its operating systems.

In 2003, Network Neutrality principle mandated ISPs (Internet Service Providers) to treat all data on Internet the same. FCC (Federal Communications Commission) successfully stopped Comcast, AT&T, Verizon and other giants from slowing down or throttling traffic based on application or domain level. Apple FaceTime, Google YouTube and Netflix are benefited from the principle. After 10 years, the oil and ISPs companies are no longer in the top 10 most valuable companies in the world. Instead, the Internet companies protected by Network Neutrality a decade ago have became the new giants. In the US market, the most valuable companies in the world dominate the market shares in many places. In February 2018, Apple reached 50% of the smart phone market share, Google dominated more than 60% of search traffic, and Facebook controlled nearly 70% of social traffic. Facebook and Google two companies have controlled 73% of the online Ads market. Amazon is on the path grabbing 50% of online shopping revenue. At China side, the situation is even worse. AliPay is owned by Alibaba and WePay is owned by WeChat. Two companies together contributed to 90% of China’s payment market.

When data became weapons, innovations and users become meatloaf

After series of AI breakthrough in the 2010’s, big data as import as crude oil. In internet era, users grant Internet companies permission on collecting their personal data for connecting with creditable users and content out of convenience. For example, the magazine publishes articles on Facebook because Facebook allows users to subscribe their article. At the same time, the publisher can manage their subscribers’ relationship with messenger system. The recommendation system helped to rank users and their content published. All the free services are sponsored from advertisements, which pay the cost of internet space and traffic. This model has encouraged more users to join the platform. Users and content accumulated on the platform also attracted more users to participate in. After 4G mobile era, mobile users are always online. It pushed the data aggregation to a whole new level. After merging and acquisition between Internet companies, a few companies stands out dominating user’s daily data today. New initiatives can no longer reach users easily by launching a new website or an app. On the other hand, Internet giants can easily issue a copycat of innovation, and leverage their traffic, funding and data resources to gain the territories. Startups had little choice but being acquired or burnout by unfair competition. Fewer and fewer stories about innovation from garages. More and more stories about tech giants’ copy startup ideas before they being shaped. There is a well quoted statement in China for example: Being acquired or die, new start-up now will never bypass the giants today.”. The phenomenon of monopoly also limited users’ choices. If a user does not consent to the data collection policy there is no alternative platform usually.

Net Neutrality repealed, giants eat the world

Nasim Aghdam’s anger at YouTube casts a nightmarish shadow over how it deals with creators and advertisers. She shot at the YouTube headquarters and caused 3 injuries. She killed herself in the end. At the beginning of Internet era, innovative content creators can be reasonably rewarded for their own creations. However, after the platform became monopoly, content providers find that their creation of content are ranked through opaque algorithms which ranked their content farther and farther away from their loyal subscribers. Before their subscribers can reach their content, poor advertising and fake news stand on the way. If the publisher wants to retain the original popularity, the content creator need also pay for advertisement. Suddenly reputable content providers are being charged for reaching their own loyal subscribers. Even worse, their subscribers’ information and user behavior are being consumed platform’s machine learning algorithms for serving targeting Ads. At the same time, the platform doesn’t really effectively screen the Advertisers, low quality fake news and fake ads are being served. It is known for scams and elections. After Facebook scandal, users discovered their own private data are being used through analysis tools to attack their mind. However at the #deletefacebook movement, users find no alternative platform due to the monopoly of technical giants. Friends and users are at the platform.

In December 2017, FCC voted to repeal the Net Neutrality principle for the reason that US had failed to achieved Net Neutrality. ISPs companies are not the ones to blame. After a decade, Internet companies who benefited from Net Neutrality are now the monopoly giants and Net Neutrality wasn’t able to be applied for their private ranking and censorship algorithm. Facebook for example offers mobile access to selected sites on its platform at different charge of data service which was widely panned for violating net neutrality principles. It is still active in 63 other countries around the world. The situation is getting worse in the era of AI. Tech giants have leveraged their data power and stepped into the automotive, medical, home, manufacturing, retail, and financial sectors. Through acquisitions by the giants rapidly accumulating new types of vertical data and forcing the traditional industries opening up their data ownership. The traditional industries are facing a even larger and smarter technology monopoly than the ISP or oil companies in a decades.

Taiwan experience may mitigate global data monopoly

Starting from the root cause, at the vertical point of view, The user who contributed the data” was motivated by the trust” of the their friends or the reputable content provider. In order to have the convenience and better service, the user consents to collecting their private data and grant the platform for further analysis. The user who contributed the content” consents to publishing their creation on the platform because the users are already on the platform. The platform now owns the power of the data and content that should originally belong to the users and publisher. For privacy, safety and convenience purpose, the platform prevents other platforms or users from consuming the data. Repeatedly, it results in an exclusive platform for users and content providers.

From horizontal point of view, in order to reach user, for data and traffic, the startup company signed unfair consent with the platform. In the end, the good innovations is usually swallowed by the platform because the platform also owns data and traffic for the innovations. Therefore, the platform will become larger and larger by either merging or copying the good innovation.

In order to break this vicious cycle and create fair competition environment for AI researches. Taiwan AILabs shared at 2018 3/27 Taipei Global Smart City Expo and a panel at 3/28 Taiwan German 2018 Global Solution Workshop with visiting experts and scholars on data policies making. Taiwan AILabs exchanged Taiwan’s unique experience on Data Justice. In the discussion we concluded opportunities that can potentially break the cycle.

The opportunities comes from the the following observations in Taiwan. Currently, the mainstream of the world’s online social network platforms is provided by private companies optimized for advertising revenue. Taiwan has a mature network of users, open source workers and open data campaigns. Internet users” in Taiwan are closer to online citizens”. Taiwan Internet platform, PTT(ptt.cc) for example, is not running for profit. The users elect the managers directly. Over the years, this culture has not cooled down. PTT is still dominating. Due to its equity of voice, it is difficult to be manipulated by Ads contribution. Fake news and fraud can be easily detected by its online evidence. PTT is a more of a major platform for public opinions compared with Facebook in Taiwan. With the collaboration between PTT and Taiwan AILabs, it now has its AI news writer to report news out of its users’ activities. The AI based new writer can minimize editor’s bias.

g0v.tw is another group of non profit organization in Taiwan focusing on citizen science and technology. It promotes the transparency and openness of government organizations through hackathon. It collaborated with the government, academia, non-governmental organizations, and international organizations for data openness on public data with open source collaboration in various fields.

Introducing ptt.ai project: using blockchain for Data Justice” in AI era

PTT is Taiwan’s most impactful online platform running for 23 years. It has its own digital currency – P coin, instant messaging, e-mail, users, elections and administrators elected by users. However, the services hosting the online platform are still relatively centralized. 

In the past, users chose a trusted platform for trusted information. For convenience and Internet space, users and content providers consent to unfair data collection. To avoid centralized data storage, blockchain technology gives new directions. Blockchain is capable to certify the users and content by its chain of trust. The credit system is not built on top of single owner and at the same time the content storage system is also built on top of the chain. It avoids the control of a single organization which becomes the super power.

Ptt.ai is a research starting to learn from PTT’s data economy, combining with the latest blockchain encryption technology and implementing in the decentralization approach.

The mainstream social network platforms in China and the United States created new super power of data through the creation of users and users’ own friends. It will continue to collect more information by horizontally merging industries with unequal data power. The launch of ptt.ai is a thinking of data ownership in different direction. We hope to study how to upgrade the system PTT in the era of AI, and use this platform as the basis for enabling more industries to cooperate with data platforms. It gives the data control back to users and mitigate the data monopoly happening. Ptt.ai will also collaborate with leading players on the automotive, medical, smart home, manufacturing, retail, and financial sectors who are interested in creating open community platform. 

Currently, the experimentation of technology started on an independent platform. It does not involve the operation or migration of the current PTT yet. Please follow the latest news of ptt.ai on http://ptt.ai .

 

[2018/10/24 Updates]:

The open source project is on github now: https://github.com/ailabstw/go-pttai

[2019/4/2 Updates]:

More open source projects are on github now:

 

, ,

Humanity with Privacy and Integrity is Taiwan AI Mindset

The 2018 Smart City Summit & Expo (SCSE) along with three sub-expos have taken place at Taipei Nangang Exhibition Center on March 27th with 210 exhibitors from around the world this year, exhibiting a diversity of innovative applications and solutions in building a smart city. Taiwan is known for the friendly and healthy business environment, ranked as 11th by World Bank. With 40+ years in ICT manufacturing and top level embedded systems, companies form a vigorous ecosystem in Taiwan. With an openness toward innovation, 17 out of 22 Taiwan cities have made it to the top in Intelligent Community Forum (ICF).

Ethan Tu, Taiwan AILabs Founder, gave a talk of “AI in Smart Society for City Governance” and laid out AI position in Taiwan that smart cities is for “humanity with privacy and integrity” besides “safety and convenience”. He said “AI in Taiwan is for humanity. Privacy and integrity will also be protected.”. The maturity of crowd participation, transparency and open data mindset are the key assets to drive Taiwan on smart cities to deliver humanity with privacy and integrity. Taiwan AILabs took social participating and AI collaborated editing open-source news site of http://news.ptt.cc as an example. The city governments are now consuming the news to detect the social events happening in Taiwan in real-time for the AI news’ robustness and reliability in scale. AILabs collaborated with Tainan city on AI drone project to simulate “Beyond Beauty” director Chi Po-lin who dies in helicopter crash. AILabs also established “Taipei Traffic Density Network (TTDN)” supporting real-time traffic detection and prediction with citizen’s privacy secured, no people nor car can be identified without necessity for Taipei city.

The Global Solutions (GS) Taipei Workshop 2018 with “Shaping the Future of an Inclusive Digital Society” took place at the Ambassador Hotel on March 28, 2018 in Taipei. It is co-organized by Chung-Hua Institute for Economic Research (CIER) and the Kiel Institute for the World Economy. The “Using Big Data to Support Economic and Societal Development” panel section was hosted by Dennis Görlich Head, Global Challenges Center, Kiel Institute for the World Economy. Chien-Chih Liu, Founder of the Asia IoT Alliance (AIOTA), Thomas Losse-Müller, Senior Fellow at the Hertie School of Governance, Reuben Ng, Assistant Professor, and Lee Kuan Yew School of Public Policy, National University of Singapore all participated in the discussion. Big data has been identified as oil for AI and economic growth. He shared the vision in his panel, “We don’t have to sacrifice for safety or convenience. On the other hand, Facebook movement is a good example that the tech giants who overlook privacy and integrity will be dumped.”

Ethan explained 3 key principles from Taiwan societies on big data collection. The following principles exist and are contributed by the mature open internet societies and movements in Taiwan. AILabs will promote them as fundamental guidances for data collection on medical records, government records, open communities and so on.

1. Data produced by users belongs to users. The policy makers shall ensure no solo authority such as social media platform is too dominant to user and force users on giving up data ownership.

2. Data collected by public agent belongs to public. The policy makers shall ensure the data collected by public agency shall provide the roadmap on opening data for general public on researches. g0v.tw for example is a NPO for the open data movement.

3. “Net Neutrality” is not only ISP but also for social media and content hosting service. Ptt.cc for example, persists in equality of voice without Ads. Over the time the equality of voice has overcome the fake news by standing-out evidences.

“Humanity is the direction for AILabs. Privacy and Integrity are what we insist.” said Ethan.Smart City workshop with Amsterdam Innovation Exchange Lab from Netherlands

SITEC from Malaysia visiting AILabs.tw