, ,

ptt.ai, open source blockchain for AI Data Justice

[ 換日線報導中文連結 ]

The beginning of Data Justice” movement

By collaborating with online citizen and social science workers in Taiwan, Taiwan AILabs promotes the Data Justice” in the following principles:

  1. Prioritize Privacy and Integrity with goodwill for applications before data collection
    • In addition to Privacy Protection Acts, review the tech giant on potential abuse of monopoly position forcing users to give up their the privacy, or misuse user content and data for different purpose. In particular, organizations that became monopoly in the market should be reviewed regularly by local administration knowing if there is any abuse of data when users are unwillingly giving up their privacy.
  2. Users’ data and activities belong to users
    • The platform should remain neutral to avoid the misuse of user data and its creation.
  3. Public data collected should open for public researches
    • The government organization data holder is responsible for its openness while privacy and integrity are secured.For example, health insurance data for public health and smart city data for traffic researches.
  4. Regulate mandatory data openness
    • For the data critical to major public welfare controlled by monopoly private agency, we shall equip the administration the power for data openness.
    • For example, Taipower electric power usage data in Taiwan.

Monopoly now is worse than oil monopoly”

In 1882, the American oil giant John D. Rockefeller founded Standard Oil Trust and united with 40 oil-related companies to reach price control. In 1890, U.S. government sued Standard Oil Trust to prevent unfair monopoly. The antitrust laws have been formulated so as to ensure fair trade, fair competition, and prevent price manipulation. The governments of various countries followed the movement to establish anti-monopoly laws. In 1984 AT&T, a telecom giant, was split into several companies for antitrust laws. Microsoft was sued in 2001 for having Internet Explorer in its operating systems.

In 2003, Network Neutrality principle mandated ISPs (Internet Service Providers) to treat all data on Internet the same. FCC (Federal Communications Commission) successfully stopped Comcast, AT&T, Verizon and other giants from slowing down or throttling traffic based on application or domain level. Apple FaceTime, Google YouTube and Netflix are benefited from the principle. After 10 years, the oil and ISPs companies are no longer in the top 10 most valuable companies in the world. Instead, the Internet companies protected by Network Neutrality a decade ago have became the new giants. In the US market, the most valuable companies in the world dominate the market shares in many places. In February 2018, Apple reached 50% of the smart phone market share, Google dominated more than 60% of search traffic, and Facebook controlled nearly 70% of social traffic. Facebook and Google two companies have controlled 73% of the online Ads market. Amazon is on the path grabbing 50% of online shopping revenue. At China side, the situation is even worse. AliPay is owned by Alibaba and WePay is owned by WeChat. Two companies together contributed to 90% of China’s payment market.

When data became weapons, innovations and users become meatloaf

After series of AI breakthrough in the 2010’s, big data as import as crude oil. In internet era, users grant Internet companies permission on collecting their personal data for connecting with creditable users and content out of convenience. For example, the magazine publishes articles on Facebook because Facebook allows users to subscribe their article. At the same time, the publisher can manage their subscribers’ relationship with messenger system. The recommendation system helped to rank users and their content published. All the free services are sponsored from advertisements, which pay the cost of internet space and traffic. This model has encouraged more users to join the platform. Users and content accumulated on the platform also attracted more users to participate in. After 4G mobile era, mobile users are always online. It pushed the data aggregation to a whole new level. After merging and acquisition between Internet companies, a few companies stands out dominating user’s daily data today. New initiatives can no longer reach users easily by launching a new website or an app. On the other hand, Internet giants can easily issue a copycat of innovation, and leverage their traffic, funding and data resources to gain the territories. Startups had little choice but being acquired or burnout by unfair competition. Fewer and fewer stories about innovation from garages. More and more stories about tech giants’ copy startup ideas before they being shaped. There is a well quoted statement in China for example: Being acquired or die, new start-up now will never bypass the giants today.”. The phenomenon of monopoly also limited users’ choices. If a user does not consent to the data collection policy there is no alternative platform usually.

Net Neutrality repealed, giants eat the world

Nasim Aghdam’s anger at YouTube casts a nightmarish shadow over how it deals with creators and advertisers. She shot at the YouTube headquarters and caused 3 injuries. She killed herself in the end. At the beginning of Internet era, innovative content creators can be reasonably rewarded for their own creations. However, after the platform became monopoly, content providers find that their creation of content are ranked through opaque algorithms which ranked their content farther and farther away from their loyal subscribers. Before their subscribers can reach their content, poor advertising and fake news stand on the way. If the publisher wants to retain the original popularity, the content creator need also pay for advertisement. Suddenly reputable content providers are being charged for reaching their own loyal subscribers. Even worse, their subscribers’ information and user behavior are being consumed platform’s machine learning algorithms for serving targeting Ads. At the same time, the platform doesn’t really effectively screen the Advertisers, low quality fake news and fake ads are being served. It is known for scams and elections. After Facebook scandal, users discovered their own private data are being used through analysis tools to attack their mind. However at the #deletefacebook movement, users find no alternative platform due to the monopoly of technical giants. Friends and users are at the platform.

In December 2017, FCC voted to repeal the Net Neutrality principle for the reason that US had failed to achieved Net Neutrality. ISPs companies are not the ones to blame. After a decade, Internet companies who benefited from Net Neutrality are now the monopoly giants and Net Neutrality wasn’t able to be applied for their private ranking and censorship algorithm. Facebook for example offers mobile access to selected sites on its platform at different charge of data service which was widely panned for violating net neutrality principles. It is still active in 63 other countries around the world. The situation is getting worse in the era of AI. Tech giants have leveraged their data power and stepped into the automotive, medical, home, manufacturing, retail, and financial sectors. Through acquisitions by the giants rapidly accumulating new types of vertical data and forcing the traditional industries opening up their data ownership. The traditional industries are facing a even larger and smarter technology monopoly than the ISP or oil companies in a decades.

Taiwan experience may mitigate global data monopoly

Starting from the root cause, at the vertical point of view, The user who contributed the data” was motivated by the trust” of the their friends or the reputable content provider. In order to have the convenience and better service, the user consents to collecting their private data and grant the platform for further analysis. The user who contributed the content” consents to publishing their creation on the platform because the users are already on the platform. The platform now owns the power of the data and content that should originally belong to the users and publisher. For privacy, safety and convenience purpose, the platform prevents other platforms or users from consuming the data. Repeatedly, it results in an exclusive platform for users and content providers.

From horizontal point of view, in order to reach user, for data and traffic, the startup company signed unfair consent with the platform. In the end, the good innovations is usually swallowed by the platform because the platform also owns data and traffic for the innovations. Therefore, the platform will become larger and larger by either merging or copying the good innovation.

In order to break this vicious cycle and create fair competition environment for AI researches. Taiwan AILabs shared at 2018 3/27 Taipei Global Smart City Expo and a panel at 3/28 Taiwan German 2018 Global Solution Workshop with visiting experts and scholars on data policies making. Taiwan AILabs exchanged Taiwan’s unique experience on Data Justice. In the discussion we concluded opportunities that can potentially break the cycle.

The opportunities comes from the the following observations in Taiwan. Currently, the mainstream of the world’s online social network platforms is provided by private companies optimized for advertising revenue. Taiwan has a mature network of users, open source workers and open data campaigns. Internet users” in Taiwan are closer to online citizens”. Taiwan Internet platform, PTT(ptt.cc) for example, is not running for profit. The users elect the managers directly. Over the years, this culture has not cooled down. PTT is still dominating. Due to its equity of voice, it is difficult to be manipulated by Ads contribution. Fake news and fraud can be easily detected by its online evidence. PTT is a more of a major platform for public opinions compared with Facebook in Taiwan. With the collaboration between PTT and Taiwan AILabs, it now has its AI news writer to report news out of its users’ activities. The AI based new writer can minimize editor’s bias.

g0v.tw is another group of non profit organization in Taiwan focusing on citizen science and technology. It promotes the transparency and openness of government organizations through hackathon. It collaborated with the government, academia, non-governmental organizations, and international organizations for data openness on public data with open source collaboration in various fields.

Introducing ptt.ai project: using blockchain for Data Justice” in AI era

PTT is Taiwan’s most impactful online platform running for 23 years. It has its own digital currency – P coin, instant messaging, e-mail, users, elections and administrators elected by users. However, the services hosting the online platform are still relatively centralized. 

In the past, users chose a trusted platform for trusted information. For convenience and Internet space, users and content providers consent to unfair data collection. To avoid centralized data storage, blockchain technology gives new directions. Blockchain is capable to certify the users and content by its chain of trust. The credit system is not built on top of single owner and at the same time the content storage system is also built on top of the chain. It avoids the control of a single organization which becomes the super power.

Ptt.ai is a research starting to learn from PTT’s data economy, combining with the latest blockchain encryption technology and implementing in the decentralization approach.

The mainstream social network platforms in China and the United States created new super power of data through the creation of users and users’ own friends. It will continue to collect more information by horizontally merging industries with unequal data power. The launch of ptt.ai is a thinking of data ownership in different direction. We hope to study how to upgrade the system PTT in the era of AI, and use this platform as the basis for enabling more industries to cooperate with data platforms. It gives the data control back to users and mitigate the data monopoly happening. Ptt.ai will also collaborate with leading players on the automotive, medical, smart home, manufacturing, retail, and financial sectors who are interested in creating open community platform. 

Currently, the experimentation of technology started on an independent platform. It does not involve the operation or migration of the current PTT yet. Please follow the latest news of ptt.ai on http://ptt.ai .

 

[2018/10/24 Updates]:

The open source project is on github now: https://github.com/ailabstw/go-pttai

[2019/4/2 Updates]:

More open source projects are on github now:

 

, ,

Humanity with Privacy and Integrity is Taiwan AI Mindset

The 2018 Smart City Summit & Expo (SCSE) along with three sub-expos have taken place at Taipei Nangang Exhibition Center on March 27th with 210 exhibitors from around the world this year, exhibiting a diversity of innovative applications and solutions in building a smart city. Taiwan is known for the friendly and healthy business environment, ranked as 11th by World Bank. With 40+ years in ICT manufacturing and top level embedded systems, companies form a vigorous ecosystem in Taiwan. With an openness toward innovation, 17 out of 22 Taiwan cities have made it to the top in Intelligent Community Forum (ICF).

Ethan Tu, Taiwan AILabs Founder, gave a talk of “AI in Smart Society for City Governance” and laid out AI position in Taiwan that smart cities is for “humanity with privacy and integrity” besides “safety and convenience”. He said “AI in Taiwan is for humanity. Privacy and integrity will also be protected.”. The maturity of crowd participation, transparency and open data mindset are the key assets to drive Taiwan on smart cities to deliver humanity with privacy and integrity. Taiwan AILabs took social participating and AI collaborated editing open-source news site of http://news.ptt.cc as an example. The city governments are now consuming the news to detect the social events happening in Taiwan in real-time for the AI news’ robustness and reliability in scale. AILabs collaborated with Tainan city on AI drone project to simulate “Beyond Beauty” director Chi Po-lin who dies in helicopter crash. AILabs also established “Taipei Traffic Density Network (TTDN)” supporting real-time traffic detection and prediction with citizen’s privacy secured, no people nor car can be identified without necessity for Taipei city.

The Global Solutions (GS) Taipei Workshop 2018 with “Shaping the Future of an Inclusive Digital Society” took place at the Ambassador Hotel on March 28, 2018 in Taipei. It is co-organized by Chung-Hua Institute for Economic Research (CIER) and the Kiel Institute for the World Economy. The “Using Big Data to Support Economic and Societal Development” panel section was hosted by Dennis Görlich Head, Global Challenges Center, Kiel Institute for the World Economy. Chien-Chih Liu, Founder of the Asia IoT Alliance (AIOTA), Thomas Losse-Müller, Senior Fellow at the Hertie School of Governance, Reuben Ng, Assistant Professor, and Lee Kuan Yew School of Public Policy, National University of Singapore all participated in the discussion. Big data has been identified as oil for AI and economic growth. He shared the vision in his panel, “We don’t have to sacrifice for safety or convenience. On the other hand, Facebook movement is a good example that the tech giants who overlook privacy and integrity will be dumped.”

Ethan explained 3 key principles from Taiwan societies on big data collection. The following principles exist and are contributed by the mature open internet societies and movements in Taiwan. AILabs will promote them as fundamental guidances for data collection on medical records, government records, open communities and so on.

1. Data produced by users belongs to users. The policy makers shall ensure no solo authority such as social media platform is too dominant to user and force users on giving up data ownership.

2. Data collected by public agent belongs to public. The policy makers shall ensure the data collected by public agency shall provide the roadmap on opening data for general public on researches. g0v.tw for example is a NPO for the open data movement.

3. “Net Neutrality” is not only ISP but also for social media and content hosting service. Ptt.cc for example, persists in equality of voice without Ads. Over the time the equality of voice has overcome the fake news by standing-out evidences.

“Humanity is the direction for AILabs. Privacy and Integrity are what we insist.” said Ethan.Smart City workshop with Amsterdam Innovation Exchange Lab from Netherlands

SITEC from Malaysia visiting AILabs.tw

, ,

Meet JARVIS – The Engine Behind AILabs

In Taiwan AI Labs, we are constantly teaching computers to see the world, hear the world, and feel the world so that computers can make sense of them and interact with people in exciting new ways. The process requires moving a large amount of data through various training and evaluation stages, wherein each stage consumes a substantial amount of resources to compute. In other words, the computations we perform are both CPU/GPU bound and I/O bound.

This impose a tremendous challenge in engineering such a computing environment, as conventional systems are either CPU bound or I/O bound, but rarely both.

We recognized this need and crafted our own computing environment from day one. We call it Jarvis internally, named after the system that runs everything for Iron Man. It primarily comprises a frontdoor endpoint that accepts media and control streams from the outside world, a cluster master that manages bare metal resources within the cluster, a set of streaming and routing endpoints that are capable of muxing and demuxing media streams for each computing stage, and a storage system to store and feed data to cluster members.

The core system is written in C++ with a Python adapter layer to integrate with various machine learning libraries.

 

 

The design of Jarvis emphasizes realtime processing capability. The core of Jarvis enables data streams flow between computing processors to have minimal latency, and each processing stage is engineered to achieve a required throughput per second. For a long complex procedure, we break it down into smaller sub-tasks and use Jarvis to form a computing pipeline to achieve the target throughput. We also utilize muxing and demuxing techniques to process portions of the data stream in parallel to further increase throughput without incurring too much latency. Once the computational tasks are defined, the blue-print is then handed over to cluster master to allocate underlying hardware resources and dispatch tasks to run on them. The allocation algorithm has to take special care about GPUs, as they are scarce resources that cannot be virtualized at the moment.

Altogether, Jarvis becomes a powerful yet agile platform to perform machine learning tasks. It handles huge amount of work with minimum overhead. Moreover, Jarvis can be scaled up horizontally with little effort by just adding new machines to the cluster. It suits our needs pretty well. We have re-engineered Jarvis several times in the past few months, and will continue to evolve it. Jarvis is our engine to move fast in this fast-changing AI field.

 

Featured image by Nathan Rupert / CC BY

AI carries the torch of Malaria diagnosis in Taiwan

Meeting the shortage of medical technologist, our AI is learning from the experience of medical experts in Taiwan CDC to bring expert-level precision and speed to the diagnostic process of Malaria in Taiwan.

Taiwan has been on the list of Malaria-eradicated regions since 1965. Since then, there have been around 10 – 30 malaria cases each year, all of which are imported cases, as reported by Taiwan CDC (Centers for Disease Control). Due to the declining number of Malaria cases, there have been fewer medical laboratory technologists specialized in Malaria diagnosis, while the training of new experts is becoming increasingly difficult.

It is said that the most experienced medical laboratory technologist in Taiwan CDC, who has been in charge of Taiwan’s Malaria diagnosis for years, is retiring soon. There was never a single misdiagnosed case in her hand. She is concerned that her experience and knowledge might not be able to pass on to the future generations.

Thanks to the recent advancement of artificial intelligence, computers now have the potential to learn from her experience of medical expertise and lead a pivotal role in the Malaria diagnostic process. We are now getting the ball rolling by collaborating with Taiwan CDC on the Malaria Diagnostics Project to leverage AI to improve the diagnosis process of Malaria.

 

Read more

How AI is Transforming Personalized Treatment: scientists are training AI to classify the effects of genetic mutations

Recently, utilizing personal information to tailor personalized treatment has gained a lot of attention. In the case of cancer, the disease begins when one or more genes in a cell are mutated. This makes each tumor distinct even if it comes from the same cell, therefore the result of a treatment may vary on different patients. With identifying genetic mutations on a patient, doctors can find out what the cause of a tumor is and give accurate treatment.

Identifying genetic mutations is becoming easier, but interpreting it remains difficult. For breast cancers, there are about 180 oncogenic mutations that contributes to the growth of tumor. To distinquish them from normal variants requires examining literature carefully. While there are thousands of publications studying the effect of genetic mutations, these information cannot be used efficiently due to lack of well-curated databases. Building such database is a costly process that requires experts to review clinical literature manually. According to Memorial Sloan Kettering Cancer Center(MSKCC), they organized an annotation committee to review data from different sources, and spent 2 months to annotate 150 genetic mutations, while there are 79 million mutations have been identified by the 1000 Genomes project. Furthermore, the number of publications is growing exponentially, so a automated classification process is in demand. To speed up the curation of mutation databases, we utilized machine’s ability to read and comprehend, which can efficiently review publications and classify the effect of genetic mutations.

From reading to comprehending

Our goal is to train a machine that can classify the mutations like human experts. Instead of training a general comprehension model, we want it to imitate the decision making procedure of experts since the amount of annotated data is insufficient.

Extract key paragraphs

Reading a 10-page paper from the beginning to the very end is time-consuming. People tend to go through the general description roughly and read the key paragraphs carefully. These paragraphs serve an important role since a sentence with mutation names in it may directly conclude the effect of the mutation. To imitate this behavior, we find keywords in the text and extract its context as key paragraphs for further investigation.

Vector representation

It is very common that words or ducuments are encoded into vector space embeddings before being processed by machine learning models. Recently, models did a great job finding representation for words. We use Word2Vec model to extract vector representation of gene names and mutation names, which are expected to be informative about its effect.

In the primary experiments, the model showed positive result as it can succesfully distinquish oncogenic mutations from normal ones. We headed to Kaggle competition [Personalized Medicine: Redefining Cancer Treatment] for a more controlled environment.

Kaggle Competition

In June 2017, MSKCC launched a Kaggle competition named “Personalized Medicine: Redefining Cancer Treatment“. The participants were asked to predict the effect of a genetic mutation given relevant documents. We collaborated with domain experts to participate this competition.

To get a basic idea of the documents, we use Doc2Vec model implemented in gensim to obtain vector space embedding of each document. The embeddings are then projected into two-dimensional space using PCA transformation. The resulting plot shows that documents from different classes can be roughly separated by its content.

 

Doc2Vec embeddings with PCA transformation into 2D space

 

Solution to small-data problem

The dataset is relatively small with about 3000 entries; therefore we focus on feature engineering part and keep our model simple to overcome the overfitting problem. We extract classical features, such as tf-idf values, along with engineered features based on observation and domain knowledge. Keywords suggested by experts are used to extract key paragraphs, where human experts pay more attention while determining the effect of mutations.

 

An example of key paragraph(marked yellow)

 

 

A powerful model XGBoost is used as classifier here. XGBoost has won pratically every competition in the structured data category over the last two years. In addition to strong modeling ability, its regularization technique is also well-suited for the dataset.

Result

We obtained 75% accuracy among 9 classes on the test set. The competition host also held another stage with a very different test set. In this stage, the training/testing set mismatch is too significant, makes all participants’ classifier nearly unusable.

 

The problem has not been well-defined. Biases in the annotating process and the ambiguity between classes has not been resolved. But the result shows that with carefully defined target, data-driven methods can be utilized to clasify the effect of genetic mutations.

 

Featured image by Dave Fayram / CC BY