Tactile perception combined with vision for successful robotic housekeeping. Part 2

Part 1. https://habr.com/ru/articles/848264/

Architecture of tactile-visual fusion and stable perceptual strategy

Although the proposed multimodal haptic sensor provides robots with fast and complex tactile perception, tactile perception alone is not sufficient to meet the needs of robots in complex scenarios. We fuse haptic with visual perception and further introduce a hybrid haptic-visual fusion architecture. This architecture integrates haptic and visual information at the data, function and decision levels, enabling robots to effectively interact with complex environments. The specific architecture of the robot is shown in Fig. 4a. We divide the architecture into different layers, starting at the bottom, the signal layer, followed by the perception layer, the decision layer, and finally the system layer. At the signal level, a binocular depth camera is used to capture visual signals, and the aforementioned multimodal haptic sensors are used to collect interface, slip, pressure and temperature signals. At the perceptual level, the computer converts sensor signals into corresponding cognitive functions. In particular, visual signals allow the recognition of objects and their localization, while tactile signals allow the perception of temperature, thermal conductivity, contact pressure, texture and sliding state of objects upon contact. Based on multimodal perception, the robot makes an appropriate decision and sends tasks to the actuators (robotic arm and automated guided vehicle (AGV)). The actuators perform a number of actions such as controlling the vehicle's motion with the AGV, approaching an object, and picking up and sorting objects with a robotic arm. By combining all these levels, we created a complex tactile-visual architecture of the robotic system (system level). Moreover, with additional sensors and actuators, it can give robots even greater sensing and execution capabilities, allowing them to perform more complex tasks.

**Rice. 4: Architecture of a tactile-visual robot.**

a Robot architecture combining tactile-visual perception, including signal level, perception level, decision level and system level. b Tactile-visual fusion strategy for stable grip: vision provides grip position and haptic feedback control adjusts grip force by detecting slip in real time. c Photos of grabbing a paper cup and adding water with haptic feedback. d Tactile signals during the process of capturing and adding water. First, the robot's hand steadily grasps the empty cup. Water is then added to the cup with haptic feedback, and the robot's hand maintains a steady grip. e Photos of grabbing a paper cup and adding water without haptic feedback. f Tactile signal during the process of capturing and adding water. First, the robot's hand steadily grasps the empty cup. Water is then added to the cup without tactile feedback, the cup finally slips off due to the increased weight.

In this architecture, we also propose a haptic-visual fusion strategy to help the robot achieve stable grasping of various objects. Due to the variety of shapes and sizes of objects, achieving a stable grasp requires the use of customized grasping strategies according to the characteristics of the objects. Common perception strategies are mainly divided into model-based and model-free methods. Model-based methods typically formulate perceptual strategies using pre-trained models, but they incur relatively high training costs. The model-free method does not require information about the object type and directly determines the capture strategy based on the observation results produced by the camera, such as the commonly used five-dimensional capture method. However, this method lacks detailed tactile information about the object, so it cannot perform precise operations. Here we propose a tactile-visual perceptual fusion strategy (Fig. 4b). First, the grip position and posture of the robotic hand are determined according to the contour, size, and depth of the object obtained by vision. When the robot picks up an object, it first performs a light grip and uses haptic sensing to detect the slip in real time. When slippage is detected, the robot's hand gradually increases its gripping force until it achieves a stable hold. When no slip is detected, the robot arm maintains its current gripping state. By using this haptic feedback control, the grip force applied by the robotic arm is minimized to the point that slipping does not occur, which is especially important when handling delicate or fragile objects.

To demonstrate that our grasping strategy can be applied to slippery or fragile objects, we use a robotic arm equipped with haptic sensors on the fingers to grasp a paper cup that is gradually filled with water. As shown in Fig. 4c,d, in the initial state (0-t₁) the robot's hand does not grasp any objects, so the pressure signal and the interface signal from the tactile sensor are zero at this time. At time t₁by t₂ the hand begins to squeeze the empty cup (weighs about 6.8 g), and the grip force gradually increases. After the hand completes the grasping action, the pressure and interface signals remain essentially unchanged (t₂-t₃), indicating a stable grip and no slipping. After this, water is poured into the cup at t₃and due to the increased weight of the cup, a slip occurs between the cup and the robot's arm, which is quickly detected by tactile sensors. The robot hand quickly responds to increasing grip force under real-time feedback control until slippage is detected, resulting in a stable grip (t ₃-t₄). After this, the pressure signal and the interface signal remain unchanged, indicating that the water supply has stopped at time t₄and after that the grip remains stable. The cup currently weighs ~100g, which is about 15 times its original weight, while the paper water cup is held stable without deformation. It is noted that high gripping force may crush the paper cup and thus spill water. For comparison, in Fig. 4e and f show results without closed-loop slip control. When pouring water into a cup, the cup slips because the robot's hand is not aware of the sliding and therefore cannot adapt its gripping force accordingly. This comparison demonstrates that real-time sliding feedback control can achieve stable gripping of objects with minimal gripping force to avoid crushing of fragile objects, which is necessary for delicate robotic manipulation. It is important that the slip detection is ultra-sensitive and ultra-fast (in this work, the slip detection achieves an ultra-sensitivity of 0.05 mm/s and an ultra-fast response time of 4 ms) to ensure the success of a more stable grip.

Tactile-visual fusion recognition strategy

In addition to stable object grasping, accurate object recognition is also an important function of robots. For example, when a robot assists with household tasks such as serving drinks, identification of the cup and determination of whether there is content inside is usually required, along with a rough estimate of the composition of the contents for subsequent precise manipulations. In our daily life, people usually identify objects using vision. However, the robot's vision is limited in recognizing objects in the home environment due to interference from ambient light, occlusion, and confusion with similarly shaped objects, as mentioned earlier. Everyday objects are made from a variety of materials, and many have similar shapes and colors. It is difficult for vision alone to distinguish between everyday objects of the same shape, such as crumpled paper, plastic bags and napkins. For objects that cannot be recognized by vision alone, people use tactile perception to make accurate judgments based on the object's characteristics – temperature, pressure, thermal conductivity, texture, etc. Taking inspiration from this concept, we propose a cascaded haptic-visual fusion strategy for object recognition that synthesizes multimodal sensory information for accurate object identification (Fig. 5a). Firstly, visual information is used in the YOLOv3 model to recognize objects based on their shape, size, color, etc., resulting in differentiable categories such as spherical, bottle-shaped, cup-shaped, shapeless, etc. Subsequently, for visually similar objects within the same category, tactile perception is used for more accurate discrimination. And shapeless objects can be divided into types such as plastic bag, wrapping paper, napkin, cloth, etc. By using information about the thermal conductivity, pressure and temperature of an object using a fine-grained neural network (SNN). The fabric can be further classified into fleece, denim, nylon, etc. Using a bag wood classifier depending on the thermal conductivity and texture of the material. Regarding cup-shaped objects, it should be noted that visual perception alone cannot determine whether the contents are in an opaque cup. We can use the slip detection function to determine whether there is weight inside the cup, and additionally use thermal conductivity and temperature to evaluate the composition of the contents. Following this approach, we effectively integrate multiple sensory inputs to achieve accurate object identification. Recognition time using the tactile-visual fusion strategy is about 80 ms. Additionally, as more sensory information is accumulated, this strategy can be expanded to achieve accurate recognition of more other objects in everyday life.

**Rice. 5: Recognition strategy with tactile-visual fusion for object sorting and table cleaning.**

a Tactile-visual fusion recognition strategy, where Matter TC, Press and Temp refer to the thermal conductivity of a substance, pressure and temperature, respectively. V confusion matrix for object recognition using only vision, the overall recognition accuracy is only 59%. c Using the confusion matrix for object recognition using only tactile sensations, the overall recognition accuracy is 92%. d Confusion matrix in object recognition when using the tactile-visual recognition strategy, recognition accuracy reaches 96.5%. A = crumpled paper, B = cleaning rag, C = napkin, D = plastic bag, E = plastic bottle, F = orange peel, G = cold water cup, H = alcohol cup, I = hot water cup, J = empty cup. e A tactile-visual robot helps with table cleaning. (I) Vision-based object location. (II) Sustained grasping and tactile-based object recognition. (III) Smart sorting and collection.

To demonstrate the superiority of the tactile-visual fusion recognition strategy, we use the tactile-visual fusion strategy to identify 10 everyday objects, including paper, napkin, plastic bag, plastic bottle, orange peel, empty cup, cold water glass, alcohol glass, and hot water glass. . For each item, we collect 70 samples and randomly divide the collected datasets into training set, validation set and testing set (4:1:2 ratio). Training the model takes about 0.33 seconds. We also compare results using only visual or only tactile recognition. Here, the results shown in Fig. 5b-d are obtained from independent experiments using corresponding recognition methods, respectively. The confusion matrix of vision-only recognition is shown in Fig. 5b, and the overall recognition accuracy is only 59%. Misrecognition mainly occurs on shapeless, bowl-shaped objects. As for shapeless objects (for example, crumpled writing paper, a napkin, or a plastic bag), they do not have a clear shape or similar colors, which makes them easy to confuse with each other when visually perceived. For cup-shaped objects, it is difficult to determine the liquid contents by vision due to the obstruction of the line of sight and the transparency of the liquid. Using only tactile perception to identify the above objects, the recognition confusion matrix is shown in Fig. 5c, and the overall recognition accuracy reaches 92%. Tactile perception makes it possible to achieve high accuracy in recognizing most objects. However, it is difficult for the tactile sensor to distinguish objects with complex shapes, such as an orange peel (75%). In addition, the use of the proposed tactile-visual recognition strategy, which combines the advantages of both sensory and visual perception, allows us to achieve the highest recognition accuracy of 96.5% (Fig. 5d). Composite vision also helps in determining object position and posture for fine grasping.

Robotic Desk – Cleaning Task for Household Assistance

In addition, we apply the proposed haptic-visual robot in real-life scenarios, the robot autonomously performs table cleaning tasks. In this task, the robot coordinates all the components (robotic arm, AGV, camera and haptic sensors) based on the haptic-visual fusion architecture shown in Fig. 4a to perform various actions and finally to clean up objects on the desktop, as shown in Fig.5e. First, the robot enters a room, uses its camera to scan and locate objects on the table, and moves in close proximity to the objects using an AGV. The robot then uses a haptic-visual fusion strategy to stably grasp objects. At the same time, the robot identifies object types using the tactile-visual fusion recognition strategy and places these objects into sorting bins according to their catalogs. Notably, when handling a cup containing liquid, the robot skillfully detects the liquid using haptic grip, then pours the liquid into the water tank, and finally places the empty cup into the recycling box. Some objects that are difficult to grasp, such as a pen, a piece of paper, a book, a robot with tactile-visual fusion can handle them intelligently by moving the objects to the edge of the table and then dexterously grasping them like humans.

Visual recognition can identify objects that are very different in appearance, but visually similar objects, such as a napkin and a cleaning cloth, are difficult to distinguish. In addition, the vision cannot recognize the clear liquid in the cup. Although tactile recognition can distinguish materials well, the accuracy of recognizing objects with complex shapes, such as orange peels, leaves much to be desired. In addition, due to the lack of visual guidance, a robot with only tactile perception cannot perform tasks such as locating an object, making it difficult to apply in real-world scenarios. In an effort to deftly handle everyday necessities, a robot must integrate tactile and visual sensing capabilities, effectively coordinating them to perform perceptual and cognitive functions, make strategic decisions, and control the system. Therefore, we propose a haptic-visual robot architecture that integrates haptic and visual information from signal level, perception and decision making, endowing the robot with robust sensing capabilities and execution proficiency. On this basis, we design appropriate tactile-visual fusion strategies for object grasping and recognition. The gripping strategy uses fast and sensitive sliding feedback to realize precise gripping with minimal gripping force, and the tactile-visual fusion sensing strategy uses a hybrid cascade strategy to realize precise sensing of various daily necessities, including sensing the liquid contents of a cup. We apply the proposed recognition strategy to identify common objects in everyday life, achieving a recognition accuracy of 96.5%, which is significantly superior to visual-only (59%) or tactile-only (92%) recognition. Moreover, using the proposed haptic-visual fusion architecture and perception/recognition strategies, the robot autonomously performs the desktop cleaning task. The results demonstrate the promising potential of intelligent robots with tactile-visual integration for housekeeping, significantly reducing the need for manual labor. The developed multimodal tactile sensors and the proposed haptic-visual fusion robot architecture endow the robot with superior perceptual and executive capabilities, facilitating flexible and reliable human interaction and assisting people in daily life.