From logistics regression to self-driving cars: Chances and challenges for machine learning in highly automated driving （part one）
SORIN MIHAI GRIGORESCU, MARKUS GLAAB, AND ANDRÉ ROSSBACH, ELEKTROBIT AUTOMOTIVe JULY 30, 2018
In part one of this three-part series, the authors investigate the drivers behind and potential applications of machine learning technology in highly automated driving scenarios. Part two defines the theoretical background of machine learning technology, as well as the types of neural networks available to automotive developers. Part three evaluates these options in the context of functional safety requirements.
Machine learning has been one of the hottest topics in research and industry over the last couple of years. Renewed attention has resulted from the latest advancements in computational performance and algorithms compared to the advent of machine learning decades ago.
Recent impressive results in artificial intelligence have been facilitated by machine learning, and particularly deep learning solutions. Applications include natural language processing (NLP), personal assistance, the victory of AlphaGo over a human being, and the achievement of human-level behavior in learning to play Atari games.
Considering that machine learning and deep learning enable such impressive results when tackling extremely complex problems, it is obvious that researchers and engineers have considered also applying them to highly automated driving (HAD) scenarios in self-driving cars. The first promising results have been achieved in this area with NVIDIA’s Davenet, Comma.Ai, Google Car, and Tesla. Machine learning and deep learning approaches have resulted in initial prototypes, but the industrialization of such functionalities poses additional challenges with regard to, for example, essential functional safety considerations.
This article aims to contribute to ongoing discussions about the role of machine learning in the automotive industry and to highlight the importance of this topic in the context of self-driving cars. In particular, it aims to increase understanding of the capabilities and limitations of machine learning technologies.
First, we discuss the design space and architectural alternatives for machine learning-based highly automated driving in the context of the EB robinos reference architecture. Two selected use cases currently in research and development at Elektrobit are then presented in detail.
The second installment provides theoretical background to machine learning and deep neural networks (DNN) that provide the basis for deriving criteria used to select a machine learning technology according to a given task. Finally, the third installment discusses verification and validation challenges that affect functional safety considerations.
Machine learning and highly automated driving
It is a complex and non-trivial task to develop the highly automated driving functionalities that lead to self-driving cars. Engineers typically tackle such challenges using the principle of divide and conquer. This is for a good reason: A decomposed system with clearly defined interfaces can be tested and verified much more thoroughly than a single black box.
Our approach to highly automated driving is EB robinos, depicted in Figure 1. EB robinos is a functional software architecture with open interfaces and software modules that permits developers to manage the complexity of autonomous driving. The EB robinos reference architecture integrates components following the "Sense, Plan, Act" decomposition paradigm. Moreover, it makes use of machine learning technology within its software modules in order to cope with highly unstructured real-world driving environments. The subsections below contain selected examples of the technologies that are integrated within EB robinos.
Figure 1. Open EB robinos reference architecture.
In contrast, end-to-end deep learning approaches also exist, which span everything from sense to act (Bojarski et al. 2016). However, with respect to the handling and training of corner cases and rare events, and with regards to the exponential amount of training data necessary, a decomposition approach (i.e., semantic abstraction) is considered as more reasonable (Shalev- Shwartz et al. 2016).
Nevertheless, a decision about which parts are better tackled in isolation to others or in combination with others is required even if the decomposition approach is followed. It is also necessary to determine whether a machine learning approach is expected to outperform a traditionally engineered algorithm for the task accomplished by a particular block. Not least, this decision may be influenced by functional safety considerations. Functional safety is a crucial element of autonomous driving, as described later in this series. Traditional software components are written on the basis of concrete requirements and are tested accordingly.
The main issues in the testing and validation of machine learning systems are their black box nature and the stochastic behavior of the learning methods. It is basically impossible to predict how the system learns its structure.
The criteria and theoretical background given above can provide guidance for informed decisions. Elektrobit is currently researching and developing use cases in which machine learning approaches are considered to be promising. Two such use cases are presented next. The first deals with the generation of artificial training samples for machine learning algorithms and their deployment for traffic sign recognition. The second use case describes our approach to self-learning cars. Both examples make use of current cutting-edge deep learning technology.
Use case 1: Artificial sample generation and traffic sign recognition
This project proposes a speed limit and end of restriction traffic sign (TS) recognition system in the context of enhancing OpenStreetMap (OSM) data used in entry navigation systems. The aim is to run the algorithm on a standard smartphone that can be mounted on the windshield of a car. The system detects traffic signs along with their GPS position and uploads the collected data to backend servers via the mobile data connection of the phone. The approach is divided mainly into two stages: detection and recognition. Detection is achieved through a boosting classifier. Recognition is performed through a probabilistic Bayesian inference framework that fuses information delivered by a collection of visual probabilistic filters. The next installment of this article contains a description of the theoretical background behind the used algorithms. Figure 2 depicts the block diagram of the traffic signs recognition (TSR) algorithm.
Figure 2: Block diagram of the smartphone-based TSR system
The color image obtained is passed to the detector in 24-bit RGB format. The detection process is carried out by evaluating the response of a cascade classifier calculated through a detection window.
This detection window is shifted across the image at different scales. The probable traffic sign regions of interest (RoI) are collected as a set of object hypotheses. The classification cascade is trained with extended local binary patterns (eLPB) from the point of view of feature extraction. Each element in the hypotheses vector is classified into a traffic sign by a support vector machine (SVM) learning algorithm.
Traffic sign recognition methods rely on manually labelled traffic signs, which are used to train both the detection and the recognition classifiers. The labelling process is tedious and prone to error due to the variety of traffic sign templates used in different countries. Figure 3 shows the differences in some speed limit signs.
Figure 3. Traffic sign templates for different countries [source: www.wikipedia.org]. (a) Vienna convention. (b) United Kingdom. (c) Alternative Vienna convention. (d) Ireland. (e) Japan. (f) Samoa. (g) United Arab Emirates and Saudi Arabia. (h) Canada. (i) United States. (j) United States (Oregon variant).
Specific training data for each country is required for the traffic sign recognition method to perform well. It is time-consuming to create enough manually labelled traffic signs because position, illumination, and weather conditions have to be taken into account.
Elektrobit therefore has created an algorithm that generates training data automatically from a single artificial template image to overcome the challenge of manually annotating large numbers of training samples. Figure 4 shows the structure of the algorithm.
Figure 4. Block diagram of the artificial samples generation algorithm for machine learning-based recognition systems.
This approach provides a method for generating artificial data that is used in the training stages of machine learning algorithms. The method uses the reduced dataset of real and generic traffic sign image templates for each country to output a collection of images.
The features of these images are artificially defined by a sequence of image template deformation algorithms. The artificial images thus obtained are evaluated against a reduced set of real-world images using kernel principal components analysis (KPCA). The artificial data set is suitable for the training of machine learning systems, in this particular case for traffic sign recognition, when the characteristics of the generated images correspond to those of the real images.
Elektrobit replaced the Boosting SVM classifiers with a deep region-based detection and recognition convolutional neural network to improve the precision of the original traffic sign recognition system. The network is deployed using Caffe (Jia et al. 2014), which is a deep neural network library developed by Berkley and supported by NVIDIA. Caffe is a pure C++/CUDA library with Python and Matlab interfaces. In addition to its core deep learning functionalities, Caffe also provides reference deep learning models that can be used directly in machine learning applications. Figure 5 shows the Caffe net structure used for traffic sign detection and recognition. The different, colored blocks represent convolution (red), pooling (yellow), activation (green), and fully connected network layers (purple).
Figure 5. Deep region-based detection and recognition convolutional neural network in Caffe.
Use case 2: Learning how to drive
The revolution in deep learning has recently increased attention on another paradigm, which is referred to as reinforcement learning (RL). In RL, an agent by itself learns how to perform certain tasks by means of a reward system. The methodology is in the category of semi-supervised learning because the design of the reward system requires domain-specific knowledge.
That is even though there is no required labeling for the input data, in contrast with supervised learning. This recent interest in RL is due mainly to the seminal work of the Deep Mind team. This team managed to combine RL with a deep neural network capable of learning the action value function (Mnih et al. 2016). Their system was able to learn to play several Atari games at human-level capacity.
We constructed the deep reinforcement learning system, shown in Figure 6, in order to experiment safely with autonomous driving learning. This system uses the TORCS open-source race simulator (Wymann et al. 2014). TORCS is widely used in the scientific community as a highly portable multi-platform car-racing simulator. It runs on Linux (all architectures, 32- and 64-bit, little and big endian), FreeBSD, OpenSolaris, MacOSX, and Windows (32- and 64-bit). It features many different cars, tracks, and opponents to race against. We can collect images for object detection as well as critical driving indicators from the game engine. These indicators include the speed of the car, the relative position of the ego-car to the center line of the road, and the distances to the cars in front.
Figure 6. Deep reinforcement learning architecture for learning how to drive in a simulator.
The goal of the algorithm is to self-learn driving commands by interacting with the virtual environment. A deep reinforcement learning paradigm was used for this purpose, in which a deep convolutional neural network (DNN) is trained by reinforcing actions a that provide a positive reward signal r(s^',a). The state s is represented by the current game image as seen in the simulator window. There are four possible actions: accelerate, decelerate, turn left, and turn right.
The DNN computes a so-called Q- function, which predicts the optimal action a to be executed for a specific state s. In other words, the DNN calculates a Q- value for each state-action pair. The action with the highest Q- value will be executed, which moves the simulator environment to the next state, s'. In this state, the executed action is evaluated by means of the reward signal r(s’,a).
For example, if the car was able to accelerate without a collision, the related action that made this possible will be reinforced in the DNN; otherwise, it will be discouraged. The reinforcement is performed in the framework by retraining the DNN with the state-reward signals. Figure 7 shows the Caffe implementation for the deep reinforcement learning algorithm. The network layers have the same color-coding as in Figure 6.
Figure 7. A Caffe-based deep convolutional neural network structure used for deep reinforcement learning.