Particularly in applied robotics, the grasping of objects is a major field which comes with various difficulties. Multiple objects with simple or complex shapes as well as different colors can be scattered on a surface in random positions and orientations. However, with the knowledge about the correct object positions, a robot has high chance of grasping them. Object detection systems can determine bounding boxes of objects in images, which can help to calculate the correct object positions. Current state-of-the-art object detection systems such as the popular Faster R-CNN and the Mask R-CNN, often use multi-stage architectures. Both models utilize a region proposal network to obtain regions which are likely to contain objects. This thesis introduces and evaluates multiple architecture variations of single-stage and two-stage models. These variations include a region proposal network, yet in the setting of grasping experiments. Usually, the training of these models is done in a supervised manner which requires lots of data with ground truth information. Generating this kind of data in a real-world environment is expensive, yet it is cost-efficient to generate the same kind of data in a simulated environment. Therefore, this thesis introduces a framework to generate artificial data in a simulated grasping experiment environment. This framework implements several domain randomization techniques in order to randomize this simulation environment. The training data contains only artificial images with objects of simple geometry. The results have shown that models, which were trained only on these artificial images, can still generalize well to images of a real environment. Furthermore, the generalization to images which contain objects of complex geometry is equally possible. This thesis performs ablation studies on the employed domain randomization techniques which reveal both degradation and improvement of different techniques. Benchmarks on the model variations show that a significantly faster inference is possible compared to the originally Faster R-CNN and Mask R-CNN, while still achieving pleasant prediction results. This was made possible by using different configurations for the region proposal network, and by introducing faster feature extraction backbone architectures.
Speaker: Ruben Bauer