Azoospermia with Deep Learning Object Detection

Introduce Azoospermia

Azoospermia is a medical term implying the condition of no measurable sperm in a man’s semen. It is also the main challenge in male infertility. Azoospermia could be divided into two classes, including obstructive azoospermia(OA) and non-obstructive azoospermia(NOA). For OA, the testicular size and serum hormone profile are normal. On the other hand, NOA means the process of spermatogenesis is unusual, and to make a further diagnosis the doctor has to check the cell findings of testicular specimens. Originally, it took 2-3 days to complete a pathological diagnosis. In order to make the process more efficient, the Department of Urology of Taipei Veterans General Hospital has developed a standard process using the testicular touch print smear(TPS) technique to make real-time diagnoses. However, it would take a lot of time to learn the interpretation of TPS. Machine learning or deep learning technologies have applied to different kinds of medical images and have become a new field with many researchers involved. AI Labs cooperates with Dr. William J. Huang(黃志賢醫師) from Taipei Veterans General Hospital, to make AI assist surgeons in the reading of TPS slides. We aim to perform object detection techniques on testicular specimens to find 6 cells, including Sertoli cell, primary spermatocyte, round spermatid, elongated spermatid, immature sperm, and mature sperm. These cells are essential to determine different stages of azoospermia. 



In this task, our goal is to detect cells among the above 6 classes in one individual image, the desired outputs are accurate bounding boxes and their labels. Since there are no existing open-dataset for this work, we have to build our own dataset. Provided input images are 2D testicular specimens captured from an electron microscope, with ground truth boxes and labels. The cell dataset currently contains 120 images with over 4,500 cells, which are annotated and reviewed by Dr. Huang and his assistant. Considering input size, class number and performances, we use EfficientDet as our model to train a network to detect cells in TPS. 


Example of input image with annotations. Different color means different class.


What is EfficientDet?

EfficientDet is the current state-of-the-art network for object detection. It is consists of a backbone network and a feature network. Inputs are feed into the backbone network; features would be extracted from different layers in backbone and sent to feature networks. Feature maps would be combined in different strategies depending on the network you used. At the end of the feature network, there are two heads with several layers that are used to predict final bounding box positions and class labels. In our setting, the EfficientDet uses EfficientNet pretrained on ImageNet as backbone and BiFPN as feature networks. 

EfficicientDet Model Structure.


EfficientNet is a model that applies the compound scaling strategy to improve accuracy. While pursuing higher performance, researchers often scale up model width, depth or resolution. However, the results are often contrary to expectation if the model becomes too complicated. The authors of EfficienNet combined different scaling through Neural Architecture Search to find suitable composite, thus it is called compound scaling. There are 8 levels of EfficientNet in total, and we choose EfficientNet-B3 as our backbone considering the difficulty of our task, input size, and model size.


Illustration of compound scaling strategy.


BiFPN is a feature pyramid network(FPN) with both top down and bottom up paths to combine feature maps, while the original FPN has only top down path. The purpose is to enhance the feature expression, so the bounding box regression and classification could perform better.


Comparison of different FPNs.

Implementation Details

There are some modifications that we have done to apply EfficientDet on our data. First, we reduce the size of anchor and the intersection of union(IoU) threshold for finding anchors correspond to ground truth boxes. The reason is that the smallest cell size is only around 8px, which is much smaller than the default base anchor size 32px. Also, since the boxes are quite small, matching ground truth boxes and anchors under a looser condition would make it easier to learn. Furthermore, we sample K matched anchors instead of all candidates for computing losses and update, K=20. Image size is set to 768×1024. 

For the small dataset with 108 training images and 12 testing images, we are able to reach a mAP 71%, recall 76% performance.

Figure of input ground truth data(bold frames) and predicted boxes(thin frames). The number is confidence score.



We demonstrate that modern machine learning/deep learning methods could apply to medical images, and are able to achieve satisfying performance. This model would help the surgeon to interpret the smear more easily, and even speed up the surgery as we are actively working on improving the model with more data. 



[1] Tan, M. & Le, Q.. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning, in PMLR 97:6105-6114

[2] M. Tan, R. Pang and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10778-10787, doi: 10.1109/CVPR42600.2020.01079.

DockCoV2: a drug database against SARS-CoV-2

The current state of the COVID-19 pandemic is a global health crisis. From December 2019 to September 2020, SARS-CoV-2 has infected over 32 million people, and caused more than one million deaths worldwide. To fight the novel coronavirus, one of the best-known ways is to block enzymes essential for virus entry or replication. The Genomics team at Taiwan AI Labs collaborated with Professor Hsuan-Cheng Huang at National Yang-Ming University, Professor Chien-Yu Chen and Distinguished Professor Hsueh-Fen Juan at National Taiwan University under MOST and NTU supports to develop the database DockCoV2 ( which aims to find effective antiviral drugs against SARS-CoV2.

The research team explores new opportunities for drug repurposing which is the process of finding new uses for existing approved drugs, and is believed to offer great benefits over de novo drug discovery, as well as to enable rapid clinical trials and regulatory review for COVID-19 therapy.

Here we develop the database, DockCoV2, by performing molecular docking analyses of seven proteins including spike, 3CLpro, PLpro, RdRp, N protein, ACE2, and TMPRSS2 with 2,285 FDA-approved and 1,478 NHI drugs. DockCoV2 also provides appropriate validation information with literature support. Several databases focus on delivering repurposing drugs against SARS-CoV-2. To our knowledge, no database provides a more up-to-date and comprehensive resource with drug-target docking results for repurposed drugs against SARS-CoV-2.

DockCoV2 is easy to use and search against, is well cross-linked to external databases, and provides the state-of-the-art prediction results in one site. DockCoV2 offers not only the related information of Docking structure and Ligand information but also Experimental data including biological assays, pathway information, and gene set enrichment analysis recruited from other validated databases. Users can download their drug-protein docking data of interest and examine additional drug-related information on DockCoV2. We also have released our scripts and source on github (

Article link:
Ting-Fu Chen, Yu-Chuan Chang, Yi Hsiao, Ko-Han Lee, Yu-Chun Hsiao, Yu-Hsiang Lin, Yi-Chin Ethan Tu, Hsuan-Cheng Huang, Chien-Yu Chen, Hsueh-Fen Juan. DockCoV2: a drug database against SARS-CoV-2. Nucleic Acids Research (2020)


Figure. The overview of the database content. In addition to the docking scores, DockCoV2 designed a joint panel section to provide the following related information: Docking structure, Ligand information, and Experimental data.

Harmonia: an Open-source Federated Learning Framework

Federated learning is a machine learning method that enables multiple parties (e.g., mobile devices or organization) to collaboratively train a model orchestrated by a trustable central server while keeping data locally. It has gained a lot of attention recently due to the increasing awareness of data privacy.

In Taiwan AI Labs, we started an open source project which aims at developing systems/infrastructures and libraries to ease the adoption of federated learning for research and production usage. It is named Harmonia, the Greek goddess of harmony, to reflect the spirit of federated learning; that is, multiple parities collaboratively build a ML model for the common good.

System Architecture

Figure 1: Harmonia system architecture


The design of the Harmonia system is inspired by GitOps. GitOps is a popular DevOp practice, where a Git repository maintains declarative descriptions of the production infrastructure and updates to the repository trigger an automated process to make the production environment match the described state in the repository.  Harmonia leverages Git for access control, model version control and synchronization among the server and participants in a federated training (FL) run. An FL training strategy, global models, and local models/gradients are kept in the git repositories. Updates to these git respoitroies trigger FL system state transitions. This automates the FL training process.

An FL participant is activated as a K8S pod which is composed of an operator and application container. An operator container is in charge of maintaining the FL system states, and communicates with an application container via gRPC. Local training and aggregation functions are encapsulated in application containers. This design enables easy deployment in a Kubernetes cluster environment, and quick plug-in of existing ML (Machine Learning) workflows.

Figure 2: Illustration of two clients FL

Figure 2 illustrates the Harmonia workflow with two local training nodes. The numbers shown in the figure indicate the steps to finish a FL run in the first Harmonia release. To start a FL training, a training plan is registered in the git registry (1), and the registry notifies all the participants via webhooks (2). Two local nodes are then triggered to load a pretrained global model (3), and start local training with a predefined number of epochs (4). When a local node completes its local training, the resulting model (called a local model) is pushed to the registry (5) and the aggregator pulls this local model (6). Once the aggregator receives local models of all the participants, it performs model aggregation (7), and the aggregated model is pushed to the git registry (8). The aggregated model is then pulled to local nodes to start another round of local training (9). These processes are repeated until a user-defined converge condition is met, e.g., # of rounds. The sequence diagram of a FL run is shown in Figure 3.

Figure 3: Sequence diagram

Below we detail the design of the Git repositories, workflows of an operator container for both an aggregator and a local node, and an application container.

Git Repositories

We have three types of repositories in registry:

  1. Training Plan: it stores the required parameters for a FL run. The training plan is a json file named json :


    “edgeCount”: 2,

    “roundCount”: 100,

    “epochs”: 100


  2. Aggregated Model: it stores aggregated models pushed by the aggregator container. The final aggregated model is tagged with inference-<commit_hash_of_train_plan>.
  3. Edge Model: these repositories store local models pushed by each node separately.


Edge and aggregator operator containers control FL states. In Figure 4 and Figure 5, we show the workflow of an edge and aggregator operator, respectively.

Edge Operator

Figure 4: Workflow in an edge node

When a training plan is registered at Git, an edge node starts a training process with local data. The resulting new local model weights are pushed to the git registry, and an edge node then waits for the aggregator to merge new model updates from all participating edge nodes. Another round of local training is performed once the aggregated model is ready at the git registry. This process repeats until it reaches the number of rounds specified in the training plan.

Aggregator Operator

Figure 5: Workflow in Aggregator

For a new FL run, an aggregator operator starts with a state waiting for all edges finishing their local training and notifies the application container in the aggregator server to perform model aggregation. A newly aggregated model is pushed into the git registry. The process iterates until it reaches the number of rounds specified in the training plan.


Local training or the model aggregation task is encapsulated in an application container, which is implemented by users. An application container communicates with an operator container with gRPC protocols. Harmonia works with any ML framework. In the SDK, we provide an application container template so a user can easily plug in their training pipeline and aggregation functions and don’t need to handle gPRC communication.


We demonstrate the usage of Harmonia with pneumonia detection on chest x-rays . The experiment is based on the neural network architecture developed by Taiwan AI Labs.

We took the open dataset RSNA Pneumonia dataset [1], and composed two different FL datasets. In this experiment, we assumed 3 hospitals. We first randomly split the whole dataset into a model training set (80%) and a testing set (20%). In the first FL dataset, we randomly assign training data to edges. In a real-world scenario, data from different hospitals are often non-IID (Independent and Identically Distributed). Therefore, in the second FL dataset, the ratio of positive data and negative data on each edge are set differently. Table 1 shows the number of positive and negative training data  for centralized training, federated training with IID and non-IID, respectively.


Table 1: Number of positive and negative training data of each training method

We adopted Federated Averaging (FedAvg) as our aggregation method [2] in this experiment. Local models are averaged by the aggregator proportionally to the number of training samples on each edge. Edges trained for one epoch in each round and the total number of epochs is the same as centralized training.

Figure 5: Classification accuracy


The results are shown in Figure 5. Both IID and non-IID FL could achieve comparable classification accuracy compared to centralized training, but they take more epochs to reach convergence for the non-IID dataset. We can also observe that non-IID FL converges slower than IID.

Privacy Module

To enforce Differential Privacy (DP), Harmonia provides a pytorch-based package, which implements two types of DP mechanisms. The first one is based on the algorithm proposed in [3, 5], a differentially-private version of the SGD, which randomly adds noises to SGD updates. Users can simply replace the original training optimizer to the DPSGD optimizer provided by Harmonia. The second technique is known as the Sparse Vector Technique(SVT) [4], which protects models by sharing distorted components of weights selectively. To adopt this privacy protection mechanism, a user could pass a trained model into the ModelSanitizer function provided by Harmonia.


The first release includes Harmonia-Operator SDK ( and differential privacy modules ( We will continue to develop essential components in FL (e.g., participant selection), and enable more flexible ways of describing FL training strategies. We welcome contributions of new aggregation algorithms, privacy mechanisms, datasets, etc. Let’s work together to flourish the growth of federated learning.


[1] RSNA Pneumonia Detection Challenge.

[2] Communication-Efficient Learning of Deep Networks from Decentralized Data. Brendan McMahan et al., in Proceedings of AISTATS, 2017

[3] Deep Learning with Differential Privacy. Martín Abadi et al., in Proceedings of ACM CCS, 2016

[4] Understanding the sparse vector technique for differential privacy. Min Lyu et al., in Proceedings of VLDB Endowment, 2017

[5] Stochastic gradient descent with differentially private updates. Shuang Song et al., in Proceedings of GlobalSIP Conference, 2013


AI Labs released an annotation system: Long live the medical diagnosis experience.

The Dilemma of Taiwan’s Medical Laboratory Sciences

Thanks to the Breau Of National Health Insurance in Taiwan, abundant medical data are appropriately recorded. This is surely good news for us, an AI-based company. However, most of the medical data have not been labeled yet. What’s worse, Taiwan currently faces a terrible medical talent shortage. The number of experienced masters of medical laboratory sciences is getting smaller and smaller. Take Malaria diagnosis for example. Malaria parasites belong to the genus Plasmodium (phylum Apicomplexa). In humans, malaria is caused by P. falciparum, P. malariae, P. ovale, P. vivax and P. knowlesi. It is undoubtedly an arduous work for a human to detect and classify the affected cell to these five classes. Unfortunately, it is only one retiring master in this field in Taiwan that can indeed confirm the correctness of the diagnosis. We must take remedial action right away, yet it costs too much either time or money to train a human being to be a Malaria master. Only through the power of the technology can we preserve the valuable diagnosis experience.

Now, we decided to solve the problem by transferring human’s experience to the machines, and the first step is to annotate the medical data. Since the only one master cannot address the overwhelming data by himself, he needs some helpers to help him do the first detection job and, in the end, the master does the final confirmation. In this case, we need a system which allows multiple users to cooperate. It should be able to record the file path of the label data, the annotators, and the label time. We search for assorted off-the-shelf annotation systems, none of them, unfortunately again, meets our specification. So we decided to roll up our sleeves and revise a most relevant one for our propose.

An Improved Annotation System

Citing from an open-source annotation system resource [1], AILabs revised it and released a new, easy-to-use annotation system in order to help those who desire to create their valuable labeled data. With our labeling system, you will know who is the previous annotator and can systematically revise other’s work.

This system is designed for object labeling. By creating a rectangular box on your target object, you will be able to assign the label name and label coordinates to the chosen target. See the example below.

Also, you will obtain the label catalogs and the label coordinates in an XML file, which is in PASCAL VOC format. You can surely leverage the output XML file as the input of your machine learning programs.

How does it work?

Three steps: Load, Draw and Save.

In Load: Feed the system an image into this system. It is always fine if you do not have an XML file since it is your first time operating this system.

In Draw: Create as many as labels in an image as you want. Don’t forget that you may zoom in/out if the image is not clear enough.

In Save: Click save button. Everything is done. The system will output an XML file including all the labels data for an image.

What’s Next?

With the sufficient annotated data, we can then train our machines by learning the labels annotated by the medical master, which will make the machines able to make a diagnosis as brilliant as the last master. We will keep working on it!


[1] Tzutalin. LabelImg. Git code (2015).

AI reads Medical Literature “variant2literature”

The importance of AI

Before the existence of medical AI systems, medical professionals and genome researchers found themselves limited by the extensive amount of time and money required to compensate professionals for their work in genomic standardization, analysis, and comparing gene variants and symptoms. To add on, full genomic analysis is difficult to achieve  due to the human genome being composed of over twenty thousand individual gene sequences.

AI applies to many areas of genomic analysis, such as gene-assisted diagnosis, human genome annotation, and quantification of the level of correlation between a gene variant and diseases. There’s still a long way to go before we understand all of the over twenty thousand human genes’ purposes and mechanisms and predict how gene variants affect them, but AI can assist in establishing correlation matrices and prediction models, except in cases such as genetically transmitted diseases, drug reactions and cancerous genes, where data is significantly scarcer.

As such, we employed AI to develop “variant2literature”, which peruses a large amount of medical literature, finding related variants to diseases of interest in order to assist medical professionals in efficiently predicting possible underlying diseases correlated to the variant  while raising the precision of diagnosis. In addition to finding the literature containing the variants of interest, variant2literature also provides association prediction if a variant is present along with a disease name in a single sentence.

The following paragraphs detail our methods and experimental results.


Data collection and test results

In order to determine the association between diseases and variants, initially, eatracting biomedical terms from literature is indispensable. In variant2liteature, we employed GNormPlus, tmVar and DNorm to identify the genes, variants, and diseases mentioned in PubMed Central (PMC). These tools were provided by the National Center for Biotechnology Information (NCBI), a part of the U.S. National Library of Medicine (NLM).

GNormPlus is a system that identifies gene descriptions from an excerpt. This technology is composed to two components: mention recognition and concept normalization. For mention recognition, GNormPlus uses conditional random fields (CRF) to recognize the gene descriptions. However, we still need to match the description to the gene described, which is why GNormPlus uses GenNorm in its concept normalization module to find the matching gene via vector analysis of exact matching or “bag-of-words” matching on descriptions.

tmVar is also a text-mining approach based on CRF, which is used to extract a wide range of sequence variants for both proteins and genes. These variants are defined according to a standard sequence variant nomenclature developed by the human genome variation society (HGVS). tmVar pre-processes the input text using tokenization and uses a CRF-based model to extract variant mentions for the final output.

DNorm is used to identify disease mentions, which are identified by using the BANNER named entity recognizer, a trainable system that also uses CRF. Mentions that are outputted by BANNER are then used to achieve disease normalization and identification using pairwise learning to rank.

Finally, a Recurrent Neural Network (RNN) deep learning model is used to predict the association between variants and diseases. This model was trained from a small set of literature annotated by our experts. We identified genes, variants, and diseases from these articles and categorized the types of relationships. If the instances of diseases and variants are observed in the same sentence, experts will denote whether they (a pair of variant and disease) are correlated, “Y” for yes and “O” for no. In other words, this is a machine learning algorithm that adopts binary classification. These labeled sentences will first pass through our self-trained Word2Vec, which converts the tokenized sentence into vectors, inputs these vectors into our RNN model, and then outputs the relationship between the two instances (“Y” or “O”). After training, this RNN model can be applied to the entirety of PMCOA to identify all relationships between diseases and variants.



On variant2literature, the user will input either a gene or a variant. variant2literature will normalize these inputs, search for the input in the indexed papers, then output related papers and label the relevant genes, variants, and diseases in the reported papers.If a disease and a gene variant both appear within the same sentence, variant2literature will determine the correlation between the two based on the surrounding context.

Through AI-assisted analysis, variant2literature automatically determines the correlation between a disease and a gene variant, greatly reducing the time spent comparing data, reducing the cost of gene analysis, allowing medical professionals to efficiently detect underlying diseases and setting a new milstone for Taiwanese disease-related genetic testing.

, ,, open source blockchain for AI Data Justice

[ 換日線報導中文連結 ]

The beginning of Data Justice” movement

By collaborating with online citizen and social science workers in Taiwan, Taiwan AILabs promotes the Data Justice” in the following principles:

  1. Prioritize Privacy and Integrity with goodwill for applications before data collection
    • In addition to Privacy Protection Acts, review the tech giant on potential abuse of monopoly position forcing users to give up their the privacy, or misuse user content and data for different purpose. In particular, organizations that became monopoly in the market should be reviewed regularly by local administration knowing if there is any abuse of data when users are unwillingly giving up their privacy.
  2. Users’ data and activities belong to users
    • The platform should remain neutral to avoid the misuse of user data and its creation.
  3. Public data collected should open for public researches
    • The government organization data holder is responsible for its openness while privacy and integrity are secured.For example, health insurance data for public health and smart city data for traffic researches.
  4. Regulate mandatory data openness
    • For the data critical to major public welfare controlled by monopoly private agency, we shall equip the administration the power for data openness.
    • For example, Taipower electric power usage data in Taiwan.

Monopoly now is worse than oil monopoly”

In 1882, the American oil giant John D. Rockefeller founded Standard Oil Trust and united with 40 oil-related companies to reach price control. In 1890, U.S. government sued Standard Oil Trust to prevent unfair monopoly. The antitrust laws have been formulated so as to ensure fair trade, fair competition, and prevent price manipulation. The governments of various countries followed the movement to establish anti-monopoly laws. In 1984 AT&T, a telecom giant, was split into several companies for antitrust laws. Microsoft was sued in 2001 for having Internet Explorer in its operating systems.

In 2003, Network Neutrality principle mandated ISPs (Internet Service Providers) to treat all data on Internet the same. FCC (Federal Communications Commission) successfully stopped Comcast, AT&T, Verizon and other giants from slowing down or throttling traffic based on application or domain level. Apple FaceTime, Google YouTube and Netflix are benefited from the principle. After 10 years, the oil and ISPs companies are no longer in the top 10 most valuable companies in the world. Instead, the Internet companies protected by Network Neutrality a decade ago have became the new giants. In the US market, the most valuable companies in the world dominate the market shares in many places. In February 2018, Apple reached 50% of the smart phone market share, Google dominated more than 60% of search traffic, and Facebook controlled nearly 70% of social traffic. Facebook and Google two companies have controlled 73% of the online Ads market. Amazon is on the path grabbing 50% of online shopping revenue. At China side, the situation is even worse. AliPay is owned by Alibaba and WePay is owned by WeChat. Two companies together contributed to 90% of China’s payment market.

When data became weapons, innovations and users become meatloaf

After series of AI breakthrough in the 2010’s, big data as import as crude oil. In internet era, users grant Internet companies permission on collecting their personal data for connecting with creditable users and content out of convenience. For example, the magazine publishes articles on Facebook because Facebook allows users to subscribe their article. At the same time, the publisher can manage their subscribers’ relationship with messenger system. The recommendation system helped to rank users and their content published. All the free services are sponsored from advertisements, which pay the cost of internet space and traffic. This model has encouraged more users to join the platform. Users and content accumulated on the platform also attracted more users to participate in. After 4G mobile era, mobile users are always online. It pushed the data aggregation to a whole new level. After merging and acquisition between Internet companies, a few companies stands out dominating user’s daily data today. New initiatives can no longer reach users easily by launching a new website or an app. On the other hand, Internet giants can easily issue a copycat of innovation, and leverage their traffic, funding and data resources to gain the territories. Startups had little choice but being acquired or burnout by unfair competition. Fewer and fewer stories about innovation from garages. More and more stories about tech giants’ copy startup ideas before they being shaped. There is a well quoted statement in China for example: Being acquired or die, new start-up now will never bypass the giants today.”. The phenomenon of monopoly also limited users’ choices. If a user does not consent to the data collection policy there is no alternative platform usually.

Net Neutrality repealed, giants eat the world

Nasim Aghdam’s anger at YouTube casts a nightmarish shadow over how it deals with creators and advertisers. She shot at the YouTube headquarters and caused 3 injuries. She killed herself in the end. At the beginning of Internet era, innovative content creators can be reasonably rewarded for their own creations. However, after the platform became monopoly, content providers find that their creation of content are ranked through opaque algorithms which ranked their content farther and farther away from their loyal subscribers. Before their subscribers can reach their content, poor advertising and fake news stand on the way. If the publisher wants to retain the original popularity, the content creator need also pay for advertisement. Suddenly reputable content providers are being charged for reaching their own loyal subscribers. Even worse, their subscribers’ information and user behavior are being consumed platform’s machine learning algorithms for serving targeting Ads. At the same time, the platform doesn’t really effectively screen the Advertisers, low quality fake news and fake ads are being served. It is known for scams and elections. After Facebook scandal, users discovered their own private data are being used through analysis tools to attack their mind. However at the #deletefacebook movement, users find no alternative platform due to the monopoly of technical giants. Friends and users are at the platform.

In December 2017, FCC voted to repeal the Net Neutrality principle for the reason that US had failed to achieved Net Neutrality. ISPs companies are not the ones to blame. After a decade, Internet companies who benefited from Net Neutrality are now the monopoly giants and Net Neutrality wasn’t able to be applied for their private ranking and censorship algorithm. Facebook for example offers mobile access to selected sites on its platform at different charge of data service which was widely panned for violating net neutrality principles. It is still active in 63 other countries around the world. The situation is getting worse in the era of AI. Tech giants have leveraged their data power and stepped into the automotive, medical, home, manufacturing, retail, and financial sectors. Through acquisitions by the giants rapidly accumulating new types of vertical data and forcing the traditional industries opening up their data ownership. The traditional industries are facing a even larger and smarter technology monopoly than the ISP or oil companies in a decades.

Taiwan experience may mitigate global data monopoly

Starting from the root cause, at the vertical point of view, The user who contributed the data” was motivated by the trust” of the their friends or the reputable content provider. In order to have the convenience and better service, the user consents to collecting their private data and grant the platform for further analysis. The user who contributed the content” consents to publishing their creation on the platform because the users are already on the platform. The platform now owns the power of the data and content that should originally belong to the users and publisher. For privacy, safety and convenience purpose, the platform prevents other platforms or users from consuming the data. Repeatedly, it results in an exclusive platform for users and content providers.

From horizontal point of view, in order to reach user, for data and traffic, the startup company signed unfair consent with the platform. In the end, the good innovations is usually swallowed by the platform because the platform also owns data and traffic for the innovations. Therefore, the platform will become larger and larger by either merging or copying the good innovation.

In order to break this vicious cycle and create fair competition environment for AI researches. Taiwan AILabs shared at 2018 3/27 Taipei Global Smart City Expo and a panel at 3/28 Taiwan German 2018 Global Solution Workshop with visiting experts and scholars on data policies making. Taiwan AILabs exchanged Taiwan’s unique experience on Data Justice. In the discussion we concluded opportunities that can potentially break the cycle.

The opportunities comes from the the following observations in Taiwan. Currently, the mainstream of the world’s online social network platforms is provided by private companies optimized for advertising revenue. Taiwan has a mature network of users, open source workers and open data campaigns. Internet users” in Taiwan are closer to online citizens”. Taiwan Internet platform, PTT( for example, is not running for profit. The users elect the managers directly. Over the years, this culture has not cooled down. PTT is still dominating. Due to its equity of voice, it is difficult to be manipulated by Ads contribution. Fake news and fraud can be easily detected by its online evidence. PTT is a more of a major platform for public opinions compared with Facebook in Taiwan. With the collaboration between PTT and Taiwan AILabs, it now has its AI news writer to report news out of its users’ activities. The AI based new writer can minimize editor’s bias. is another group of non profit organization in Taiwan focusing on citizen science and technology. It promotes the transparency and openness of government organizations through hackathon. It collaborated with the government, academia, non-governmental organizations, and international organizations for data openness on public data with open source collaboration in various fields.

Introducing project: using blockchain for Data Justice” in AI era

PTT is Taiwan’s most impactful online platform running for 23 years. It has its own digital currency – P coin, instant messaging, e-mail, users, elections and administrators elected by users. However, the services hosting the online platform are still relatively centralized. 

In the past, users chose a trusted platform for trusted information. For convenience and Internet space, users and content providers consent to unfair data collection. To avoid centralized data storage, blockchain technology gives new directions. Blockchain is capable to certify the users and content by its chain of trust. The credit system is not built on top of single owner and at the same time the content storage system is also built on top of the chain. It avoids the control of a single organization which becomes the super power. is a research starting to learn from PTT’s data economy, combining with the latest blockchain encryption technology and implementing in the decentralization approach.

The mainstream social network platforms in China and the United States created new super power of data through the creation of users and users’ own friends. It will continue to collect more information by horizontally merging industries with unequal data power. The launch of is a thinking of data ownership in different direction. We hope to study how to upgrade the system PTT in the era of AI, and use this platform as the basis for enabling more industries to cooperate with data platforms. It gives the data control back to users and mitigate the data monopoly happening. will also collaborate with leading players on the automotive, medical, smart home, manufacturing, retail, and financial sectors who are interested in creating open community platform. 

Currently, the experimentation of technology started on an independent platform. It does not involve the operation or migration of the current PTT yet. Please follow the latest news of on .


[2018/10/24 Updates]:

The open source project is on github now:

[2019/4/2 Updates]:

More open source projects are on github now:


, ,

Humanity with Privacy and Integrity is Taiwan AI Mindset

The 2018 Smart City Summit & Expo (SCSE) along with three sub-expos have taken place at Taipei Nangang Exhibition Center on March 27th with 210 exhibitors from around the world this year, exhibiting a diversity of innovative applications and solutions in building a smart city. Taiwan is known for the friendly and healthy business environment, ranked as 11th by World Bank. With 40+ years in ICT manufacturing and top level embedded systems, companies form a vigorous ecosystem in Taiwan. With an openness toward innovation, 17 out of 22 Taiwan cities have made it to the top in Intelligent Community Forum (ICF).

Ethan Tu, Taiwan AILabs Founder, gave a talk of “AI in Smart Society for City Governance” and laid out AI position in Taiwan that smart cities is for “humanity with privacy and integrity” besides “safety and convenience”. He said “AI in Taiwan is for humanity. Privacy and integrity will also be protected.”. The maturity of crowd participation, transparency and open data mindset are the key assets to drive Taiwan on smart cities to deliver humanity with privacy and integrity. Taiwan AILabs took social participating and AI collaborated editing open-source news site of as an example. The city governments are now consuming the news to detect the social events happening in Taiwan in real-time for the AI news’ robustness and reliability in scale. AILabs collaborated with Tainan city on AI drone project to simulate “Beyond Beauty” director Chi Po-lin who dies in helicopter crash. AILabs also established “Taipei Traffic Density Network (TTDN)” supporting real-time traffic detection and prediction with citizen’s privacy secured, no people nor car can be identified without necessity for Taipei city.

The Global Solutions (GS) Taipei Workshop 2018 with “Shaping the Future of an Inclusive Digital Society” took place at the Ambassador Hotel on March 28, 2018 in Taipei. It is co-organized by Chung-Hua Institute for Economic Research (CIER) and the Kiel Institute for the World Economy. The “Using Big Data to Support Economic and Societal Development” panel section was hosted by Dennis Görlich Head, Global Challenges Center, Kiel Institute for the World Economy. Chien-Chih Liu, Founder of the Asia IoT Alliance (AIOTA), Thomas Losse-Müller, Senior Fellow at the Hertie School of Governance, Reuben Ng, Assistant Professor, and Lee Kuan Yew School of Public Policy, National University of Singapore all participated in the discussion. Big data has been identified as oil for AI and economic growth. He shared the vision in his panel, “We don’t have to sacrifice for safety or convenience. On the other hand, Facebook movement is a good example that the tech giants who overlook privacy and integrity will be dumped.”

Ethan explained 3 key principles from Taiwan societies on big data collection. The following principles exist and are contributed by the mature open internet societies and movements in Taiwan. AILabs will promote them as fundamental guidances for data collection on medical records, government records, open communities and so on.

1. Data produced by users belongs to users. The policy makers shall ensure no solo authority such as social media platform is too dominant to user and force users on giving up data ownership.

2. Data collected by public agent belongs to public. The policy makers shall ensure the data collected by public agency shall provide the roadmap on opening data for general public on researches. for example is a NPO for the open data movement.

3. “Net Neutrality” is not only ISP but also for social media and content hosting service. for example, persists in equality of voice without Ads. Over the time the equality of voice has overcome the fake news by standing-out evidences.

“Humanity is the direction for AILabs. Privacy and Integrity are what we insist.” said Ethan.Smart City workshop with Amsterdam Innovation Exchange Lab from Netherlands

SITEC from Malaysia visiting

, ,

Meet JARVIS – The Engine Behind AILabs

In Taiwan AI Labs, we are constantly teaching computers to see the world, hear the world, and feel the world so that computers can make sense of them and interact with people in exciting new ways. The process requires moving a large amount of data through various training and evaluation stages, wherein each stage consumes a substantial amount of resources to compute. In other words, the computations we perform are both CPU/GPU bound and I/O bound.

This impose a tremendous challenge in engineering such a computing environment, as conventional systems are either CPU bound or I/O bound, but rarely both.

We recognized this need and crafted our own computing environment from day one. We call it Jarvis internally, named after the system that runs everything for Iron Man. It primarily comprises a frontdoor endpoint that accepts media and control streams from the outside world, a cluster master that manages bare metal resources within the cluster, a set of streaming and routing endpoints that are capable of muxing and demuxing media streams for each computing stage, and a storage system to store and feed data to cluster members.

The core system is written in C++ with a Python adapter layer to integrate with various machine learning libraries.



The design of Jarvis emphasizes realtime processing capability. The core of Jarvis enables data streams flow between computing processors to have minimal latency, and each processing stage is engineered to achieve a required throughput per second. For a long complex procedure, we break it down into smaller sub-tasks and use Jarvis to form a computing pipeline to achieve the target throughput. We also utilize muxing and demuxing techniques to process portions of the data stream in parallel to further increase throughput without incurring too much latency. Once the computational tasks are defined, the blue-print is then handed over to cluster master to allocate underlying hardware resources and dispatch tasks to run on them. The allocation algorithm has to take special care about GPUs, as they are scarce resources that cannot be virtualized at the moment.

Altogether, Jarvis becomes a powerful yet agile platform to perform machine learning tasks. It handles huge amount of work with minimum overhead. Moreover, Jarvis can be scaled up horizontally with little effort by just adding new machines to the cluster. It suits our needs pretty well. We have re-engineered Jarvis several times in the past few months, and will continue to evolve it. Jarvis is our engine to move fast in this fast-changing AI field.


Featured image by Nathan Rupert / CC BY

AI carries the torch of Malaria diagnosis in Taiwan

Meeting the shortage of medical technologist, our AI is learning from the experience of medical experts in Taiwan CDC to bring expert-level precision and speed to the diagnostic process of Malaria in Taiwan.

Taiwan has been on the list of Malaria-eradicated regions since 1965. Since then, there have been around 10 – 30 malaria cases each year, all of which are imported cases, as reported by Taiwan CDC (Centers for Disease Control). Due to the declining number of Malaria cases, there have been fewer medical laboratory technologists specialized in Malaria diagnosis, while the training of new experts is becoming increasingly difficult.

It is said that the most experienced medical laboratory technologist in Taiwan CDC, who has been in charge of Taiwan’s Malaria diagnosis for years, is retiring soon. There was never a single misdiagnosed case in her hand. She is concerned that her experience and knowledge might not be able to pass on to the future generations.

Thanks to the recent advancement of artificial intelligence, computers now have the potential to learn from her experience of medical expertise and lead a pivotal role in the Malaria diagnostic process. We are now getting the ball rolling by collaborating with Taiwan CDC on the Malaria Diagnostics Project to leverage AI to improve the diagnosis process of Malaria.


Read more

How AI is Transforming Personalized Treatment: scientists are training AI to classify the effects of genetic mutations

Recently, utilizing personal information to tailor personalized treatment has gained a lot of attention. In the case of cancer, the disease begins when one or more genes in a cell are mutated. This makes each tumor distinct even if it comes from the same cell, therefore the result of a treatment may vary on different patients. With identifying genetic mutations on a patient, doctors can find out what the cause of a tumor is and give accurate treatment.

Identifying genetic mutations is becoming easier, but interpreting it remains difficult. For breast cancers, there are about 180 oncogenic mutations that contributes to the growth of tumor. To distinquish them from normal variants requires examining literature carefully. While there are thousands of publications studying the effect of genetic mutations, these information cannot be used efficiently due to lack of well-curated databases. Building such database is a costly process that requires experts to review clinical literature manually. According to Memorial Sloan Kettering Cancer Center(MSKCC), they organized an annotation committee to review data from different sources, and spent 2 months to annotate 150 genetic mutations, while there are 79 million mutations have been identified by the 1000 Genomes project. Furthermore, the number of publications is growing exponentially, so a automated classification process is in demand. To speed up the curation of mutation databases, we utilized machine’s ability to read and comprehend, which can efficiently review publications and classify the effect of genetic mutations.

From reading to comprehending

Our goal is to train a machine that can classify the mutations like human experts. Instead of training a general comprehension model, we want it to imitate the decision making procedure of experts since the amount of annotated data is insufficient.

Extract key paragraphs

Reading a 10-page paper from the beginning to the very end is time-consuming. People tend to go through the general description roughly and read the key paragraphs carefully. These paragraphs serve an important role since a sentence with mutation names in it may directly conclude the effect of the mutation. To imitate this behavior, we find keywords in the text and extract its context as key paragraphs for further investigation.

Vector representation

It is very common that words or ducuments are encoded into vector space embeddings before being processed by machine learning models. Recently, models did a great job finding representation for words. We use Word2Vec model to extract vector representation of gene names and mutation names, which are expected to be informative about its effect.

In the primary experiments, the model showed positive result as it can succesfully distinquish oncogenic mutations from normal ones. We headed to Kaggle competition [Personalized Medicine: Redefining Cancer Treatment] for a more controlled environment.

Kaggle Competition

In June 2017, MSKCC launched a Kaggle competition named “Personalized Medicine: Redefining Cancer Treatment“. The participants were asked to predict the effect of a genetic mutation given relevant documents. We collaborated with domain experts to participate this competition.

To get a basic idea of the documents, we use Doc2Vec model implemented in gensim to obtain vector space embedding of each document. The embeddings are then projected into two-dimensional space using PCA transformation. The resulting plot shows that documents from different classes can be roughly separated by its content.


Doc2Vec embeddings with PCA transformation into 2D space


Solution to small-data problem

The dataset is relatively small with about 3000 entries; therefore we focus on feature engineering part and keep our model simple to overcome the overfitting problem. We extract classical features, such as tf-idf values, along with engineered features based on observation and domain knowledge. Keywords suggested by experts are used to extract key paragraphs, where human experts pay more attention while determining the effect of mutations.


An example of key paragraph(marked yellow)



A powerful model XGBoost is used as classifier here. XGBoost has won pratically every competition in the structured data category over the last two years. In addition to strong modeling ability, its regularization technique is also well-suited for the dataset.


We obtained 75% accuracy among 9 classes on the test set. The competition host also held another stage with a very different test set. In this stage, the training/testing set mismatch is too significant, makes all participants’ classifier nearly unusable.


The problem has not been well-defined. Biases in the annotating process and the ambiguity between classes has not been resolved. But the result shows that with carefully defined target, data-driven methods can be utilized to clasify the effect of genetic mutations.


Featured image by Dave Fayram / CC BY