In Taiwan AI Labs, we are constantly teaching computers to see the world, hear the world, and feel the world so that computers can make sense of them and interact with people in exciting new ways. The process requires moving a large amount of data through various training and evaluation stages, wherein each stage consumes a substantial amount of resources to compute. In other words, the computations we perform are both CPU/GPU bound and I/O bound.
This impose a tremendous challenge in engineering such a computing environment, as conventional systems are either CPU bound or I/O bound, but rarely both.
We recognized this need and crafted our own computing environment from day one. We call it Jarvis internally, named after the system that runs everything for Iron Man. It primarily comprises a frontdoor endpoint that accepts media and control streams from the outside world, a cluster master that manages bare metal resources within the cluster, a set of streaming and routing endpoints that are capable of muxing and demuxing media streams for each computing stage, and a storage system to store and feed data to cluster members.
The core system is written in C++ with a Python adapter layer to integrate with various machine learning libraries.
The design of Jarvis emphasizes realtime processing capability. The core of Jarvis enables data streams flow between computing processors to have minimal latency, and each processing stage is engineered to achieve a required throughput per second. For a long complex procedure, we break it down into smaller sub-tasks and use Jarvis to form a computing pipeline to achieve the target throughput. We also utilize muxing and demuxing techniques to process portions of the data stream in parallel to further increase throughput without incurring too much latency. Once the computational tasks are defined, the blue-print is then handed over to cluster master to allocate underlying hardware resources and dispatch tasks to run on them. The allocation algorithm has to take special care about GPUs, as they are scarce resources that cannot be virtualized at the moment.
Altogether, Jarvis becomes a powerful yet agile platform to perform machine learning tasks. It handles huge amount of work with minimum overhead. Moreover, Jarvis can be scaled up horizontally with little effort by just adding new machines to the cluster. It suits our needs pretty well. We have re-engineered Jarvis several times in the past few months, and will continue to evolve it. Jarvis is our engine to move fast in this fast-changing AI field.