AI music composition passed Turing test

Music composition by computers has been of great research interests for long. Many techniques, such as rules, grammars, probabilistic graphical models, neural networks, and evolutionary methods, are applied to automatic music generation. In this article we describe our approach and the corresponding results.

AI music recognition test

Before describing our method, let’s test if you can distinguish AI music from human music. 5 AI tunes and 5 human tunes are gathered and shuffled, and you are encouraged to select 5, which you consider more machine-made, from them. The true composers will be revealed later in this article.

Breaking music into components

To compose a tune using computers, we break the tune into several components and generate each component individually (but dependently). A music work, e.g., a classical work or a modern pop song, usually consists of several voices, played by several instruments. In some works we can easily recognize one voice as the main melody and the other voices as foil. In this article, we are devoted to generation of monophonic main melodies.

A monophonic melody is a sequence of notes, consisting of pitches and duration. By collecting pitches of all notes we get what is called voice leading, and collecting duration yields the rhythm. There is usually another musical element underlying the main melody called a chord progression, which controls primary transition of moods. One can think of the chord progression as supporting branches and the melody as blooming flowers.

Techniques for musical components

In the above we introduced three musical components: chord progression, rhythm, and voice leading. Our composition method is to generate chord progressions and voice leading with probabilistic graphical models, and rhythms with rules.

The procedure to generate a song is described here. The time configuration, such as how long a song is and how many chords are there in a chord progression, is decided by human. The chord progression and the rhythm are then generated independently. Finally, voice leading is generated to fit the chord progression and the rhythm, completing the composition.

The answer of the AI music recognition test

Now we come back to the AI music recognition test. In the track list earlier in this article, A, D, H, I, and J are composed by computers with the procedure mentioned above. The others are extracted from Johann Sebastian Bach’s Well-Tempered Clavier Volume 1, as listed below.

B: Prelude No.2, bar 18.

C: Prelude No.10, bar 33.

E: Prelude No.2, bar 25.

F: Prelude No.5, bar 25.

G: Prelude No.2, bar 5.

Statistics of the AI music recognition test

Did you guess all the composers right? Let’s see how other people performed. We held this test on Taiwan’s PTT Bulletin Board System and had 85 participants. The resulting statistics is gathered below.

# correct guess (out of 5) 0 1 2 3 4 5 total
# testee 6 9 37 24 6 3 85
tune id composer # testee judging it right % testee judging it right
A AI 51 60%
B Bach 48 56%
C Bach 24 28%
D AI 43 51%
E Bach 41 48%
F Bach 39 46%
G Bach 42 49%
H AI 44 52%
I AI 19 22%
J AI 37 44%
average 0.46

Most people gave 2 ~ 3 correct guesses out of 5, which is of similar accuracy as random selection, and even the test holder mixes them up when not paying attention. So don’t be too blue even if you are fooled.


Feautured Photo by  Brandon Giesbrecht / CC BY 2.0

Doppelgänger app – Can someone unlock your iPhone?

Could your doppelgänger trick your iPhone’s facial recognition feature into believing that you are the same person? The answer might lie within our newly-built facial recognition software "Doppelgänger app" at

One of social media's hottest topics is "How can two celebrities, without any blood relation, look identical?" This discussion went viral on PTT, one of Taiwanese largest bulletin board system (BBS), right after Apple released the "Face ID" feature with iPhone X in November, 2017. Many people were wondering: Can Elva Hsiao(蕭亞軒) unlock Landy Wen(溫嵐)'s iPhone?


Read More

, ,

Meet JARVIS – The Engine Behind AILabs

In Taiwan AI Labs, we are constantly teaching computers to see the world, hear the world, and feel the world so that computers can make sense of them and interact with people in exciting new ways. The process requires moving a large amount of data through various training and evaluation stages, wherein each stage consumes a substantial amount of resources to compute. In other words, the computations we perform are both CPU/GPU bound and I/O bound.

This impose a tremendous challenge in engineering such a computing environment, as conventional systems are either CPU bound or I/O bound, but rarely both.

We recognized this need and crafted our own computing environment from day one. We call it Jarvis internally, named after the system that runs everything for Iron Man. It primarily comprises a frontdoor endpoint that accepts media and control streams from the outside world, a cluster master that manages bare metal resources within the cluster, a set of streaming and routing endpoints that are capable of muxing and demuxing media streams for each computing stage, and a storage system to store and feed data to cluster members.

The core system is written in C++ with a Python adapter layer to integrate with various machine learning libraries.



The design of Jarvis emphasizes realtime processing capability. The core of Jarvis enables data streams flow between computing processors to have minimal latency, and each processing stage is engineered to achieve a required throughput per second. For a long complex procedure, we break it down into smaller sub-tasks and use Jarvis to form a computing pipeline to achieve the target throughput. We also utilize muxing and demuxing techniques to process portions of the data stream in parallel to further increase throughput without incurring too much latency. Once the computational tasks are defined, the blue-print is then handed over to cluster master to allocate underlying hardware resources and dispatch tasks to run on them. The allocation algorithm has to take special care about GPUs, as they are scarce resources that cannot be virtualized at the moment.

Altogether, Jarvis becomes a powerful yet agile platform to perform machine learning tasks. It handles huge amount of work with minimum overhead. Moreover, Jarvis can be scaled up horizontally with little effort by just adding new machines to the cluster. It suits our needs pretty well. We have re-engineered Jarvis several times in the past few months, and will continue to evolve it. Jarvis is our engine to move fast in this fast-changing AI field.


Featured image by Nathan Rupert / CC BY

Face Recognition – The essential part of “Face ID”

Upon seeing a person, what enters our eyes is the person’s face. Human face plays an important role in our daily life when we interact and communicate with others. Unlike other biometrics such as fingerprint, identifying a person with its face can be a non-contract process. We can easily acquire face images of a person from a distance and recognize the person without interacting with the person directly. As a result, it is intuitive that we use human face as the key to build a Face Recognition system.



Over the last ten years, Face Recognition is a popular research area only in computer vision. However, with the rapid development of deep learning techniques in recent years, Face Recognition has become an AI  topic and more and more people are interested in this field. Many company such as Google, Microsoft and Amazon have developed their own Face Recognition tools and applications. In the late 2017, Apple also introduced the iPhone X with Face ID, which is a Face Recognition system aimed at replacing the fingerprint-scanning Touch ID to unlock the phone.


What Face Recognition can be used?

  • automated border system for arrival and departure in the airport
  • access control system for a company
  • criminal surveillance system for government
  • transaction certification for consumer
  • unlocking system for phone or computer


How Face Recognition Works?

Face Recognition system can be divided into three parts:

  • Face Detection : tell where the face is in the image
  • Face Representation : encode facial feature of a face image
  • Face Classification : determine which person is it

Face Detection

Locating the face in the image and finding the size of the face is what Face Detection do. Face Detection, is essentially an object-class detection problem for a given class of human face. For object detection in computer vision, a set of features is first extracted from the image and classifiers or localizers are run in sliding window through the whole image to find the potential bounding box, which is time-consuming and complex. With the approach of deep learning, object detection can be accomplished by a single neural network, from image pixels to bounding box coordinates and class probabilities, with the benefit of end-to-end training and real-time prediction. YOLO, which is an open source real-time object detection system, was built for Face Detection in our Face Recognition pipeline.


Face Representation

With the goal of comparing two faces, computing the distance of two face images pixel by pixel is somehow impracticable because of large computing time and resources. Thus, what we need to do is extract face feature to represent face image.

“The distance between your eyes and ears” and “The size of your noes and mouth”….

These facial features become an easy measurement for us to compare whether the two unknown face represent the same person. Eigen-face and genetic algorithm are used in old days to help discover these features. With the new deep learning technique, a deep neural network project each face image on a 128-dimensional unit hypersphere and generate feature vector of each image for us.

Regarding to transforming face images into Face Representations, OpenFace and DLIB are two commonly used model to generate feature vector. Some experiments are done for these two models and we found out that the face representation for DLIB model is more consistent between each frames for the same person and it indeed outperformed OpenFace model for accuracy test, as a result, DLIB was finally used as our face representation model.


Each vertical slice represents a face representation for a specific person from a image frame. The x-axis is the timestamp for each frame of video. This results show that dlib model does a better job at making consistent images-to-representations transformation for the face image of the same person between each frame.


Face Classification

Gathering the face representations for each person to build a face database, a classifier can be trained to classify each person. To stabilize the final classification results, “weighted moving average” is introduced into our system where we take classification results from the previous frames into consideration when determining the current classification results. With this mechanism, we found out that it smoothes the final classification results and has a better performance on accuracy test compared to classification result from a single image.


Feature image by Ars Electronica / CC BY


AI frontdesk – improve office security and working conditions

Imagine that someone in your office serves as doorkeeper, takes care of visitors and even cares about your working conditions, 24-7? One of our missions at is to explore AI solutions to address society’s problems and improve the quality of life of people and, we have developed one AI-powered front-desk to do all of the tasks mentioned above.

Based on 2016 annual report from Taiwan MOL (Ministry of Labor), the average work hours per year of Taiwanese employee is 2106 hours. Compared with OECD stats, this number ranked No.3 in the world which is just below Mexico and Costa Rica.

Recently on 4th, December, 2017,  the first review of the Labor Standards Act revision was passed. The new version of the law will allow flexible work-time arrangements and expand monthly maximum work hours up to 300. Other major changes of the amendment includes conditionally allowing employees to work 12 days in a row and reduction of a minimum 11 hour break between shifts down to 8 hours. The ruling party plans to finish second and third-reading procedure of this revision early next year (2018), and it will put 9-million Taiwanese labors in worse working environment.To get rid off the bad reputation of “Taiwan – The Island of Overwork “, a system which will notify both employee and employer that one has been extremely over-working, and the attendance report can not easily be manipulated is needed.

In May 2017, an employee Luo Yufen from Pxmart, one of Taiwan’s major supermarket chain, died from a long time of overwork after 7 days of being in the state of coma. However, the OSHA(Occupational Safety and Health Administration) initially find no evidence of overwork after reviewing the clocking report provided by Pxmart which looks ‘normal’. It wasn’t until August, when Luo’s case are requested for further investigation, that the Luo’s real working hours before her death proves her overwork condition.

Read more

Recognize The Speech of Taiwan

We are exploring the new ways people interacts with technologies in the age of AI and speech is one of the most common and natural means of communication. In this post we are introducing our core recipes for automatic speech recognition system in Taiwan.

Cornerstone of Natural Human-Computer Interaction

Mobiles, IoT, wearable devices and robots. Our daily life are more and more likely to be surrounded by smart devices in the future. With the target to interact with them naturally,  just as with human-beings, we need to develop related AI techniques such as machine learning, computer vision, natural language processing and speech processing.

Speech Recognition, so called ASR, is one of the cornerstone that link all these interactions together. With deep-learning-based model and graphical decoder, ASR nowadays is getting more reliable on both accuracy and speed.


Unique Language Habits in Taiwan

Different usage of words, new phrases and sentence structures are generated each day in our modern society and between cultures. This is especially true in Taiwan where the language habits of Taiwanese people is different from other Mandarin speakers.

Due to these reasons, the current ASR solutions in the Mandarin-speaking space have limitation when it comes to supporting general usages in Taiwanese people’s daily life. For example, the biggest Taiwan forum and Internet community, PTT, invents hundreds of words and phrases every month. The newly-created words might be used repeatedly or spread frequently by millions of users in online chatting and posting.

Therefore, the challenges of building a localized ASR system are not only about training a local neural network model, but also about how the system updates and adapts rapidly to the dynamically evolved language.



With a Taiwan-specific language model, our ASR can be much more friendly for speech related applications in Taiwan.


Multi-Language Speech Recognition

Although Mandarin is the official language in Taiwan, a Mandarin-only ASR system cannot satisfy our goals. Taiwan is an place with many different cultures. In addition to Mandarin, other languages such as English, Taiwanese, Hakka and Indigenous languages are also used pretty often in Taiwan. To deal with this problem gathered linguistics, phonetics and machine learning experts to set up a standard process when ASR facing cross language requirements.



These processes includes enriching language model with multiple languages and handling mixed-up words and sentences. Our early ASR experiments on Taiwanese works and we are now enhancing our system to production-level.


ASR Applications in

ASR system is already a powering the front-desk system in now. When an employee arrives at the office, they interacts with the ASR system for door access and need ID cards or badges no more.

An employee ask for door access to the ASR system

Another application is to generate automatic transcripts or captions. Videos of news, conferences, interviews can be convert to text files in real-time using ASR.

News video can now generate live captions with ASR

Our ASR API is ready to open, contact us if you want further cooperation.


Looking Forward

Speed, accuracy, multi-language and rapid updates are core aspects of a easy to use ASR system. We are continuously improving these cores and trying different deep learning algorithms to reach to a point where AI is doing a better job than human in this field. If you are interested in working on this problem, please contact us, we are actively hiring!


featured image by Peter Coombe / CC BY