Collaboration between
humans and robots is on

The digital revolution is taking over many parts of human lives, especially in an industrial setting, creating the best possible cooperation between robots and humans. This type of collaboration is raising and therefore there is a high demand for it. Two different worlds cooperating and collaborating is quite challenging, hence lots of data gathering, innovations, analysis, and system/software development are taking place under the umbrella of the CoRoSect.

In the CoRoSect project, the emphasis is on how workers’ concentration levels change during the execution of specific human-robot collaboration tasks. Gathered data is used to train deep learning models in an industrial environment, which will be recreated using a mobile Augmented Reality Device, such as Microsoft’s HoloLens 2.  This has shown great results and significant progress has been achieved in the establishment of the simulated environment.

But CoRoSect teams do not stop here!

They have put effort into the experimentation of different algorithms for detection, tracking, and measuring the distance of people and objects from fixed cameras, to find the adequate algorithm. Their final decision fell onto the use of detic, a method to detect Twenty-Thousand Classes using Image-Level Supervision with DeepSORT, a tracking-by-detection algorithm. Furthermore, the determination of the message bus and messages to be sent to the rest of the components is undertaken.

In addition, CoRoSect teams have been working toward the enhancement of the annotation process. They have tested a super-pixel algorithm called Simple Linear Iterative Clustering (SLIC). An attempt was made to split single frames into subframes using high-resolution frames of actual insects provided by the farms.

While the robotic environment was developed and tested in pick-and-place scenarios, the problem of image-to-image translation for domain adaption was tackled. To translate scenes between the human and robotic domains, we used a unique technique called Cycle-consistent Generative Adversarial Networks (CycleGANs). A human performed movements in front of a green screen and some objects, while a robot did random movements in a similar situation within the simulator. The approach was then used to train a model that can perform both forward and backward translations.

The analysis is focused on how workers’ concentration levels change during the execution of a specific human-robot collaboration task. Data gathering is of great importance for this process, because it will be used to train deep learning models in an industrial environment. This will further on be generated by a mobile augmented reality device, such as Microsoft’s HoloLens 2.

The following figures show an example of the human scene setting and the forward translated context to the robotic domain obtained from the trained CycleGAN model. The next steps will include the acquisition of more images to perform the context translation more accurately and jointly with the targeted tasks. The second part of the work included the experimentation with the pick and place environment, where, after defining the environment we collected data that corresponds to the task of picking an object and placing it at a specific location over the table.

The following figure shows an example of the picking environment.

Example of the image-to-image translation between the human (a) and the robotic domain (b), and definition of the picking environment (c). The context translation between a human performing a move and the equivalent of a robot doing the same move was achieved by applying the CycleGAN method[1] . Also, an example of the pick and place environment is given in the last inset.

Detection of moving and static “obstacles and people” in the factory space, in order to avoid potential collisions between people and robots, is a major issue to tackle.

The planned solution for the detection of people is to use a combination of:

    – Fixed monocular cameras: should be placed in high places, in order to cover the entire room without blind areas (in Figure 1 see an example of it).

    – Software analyzing in near real-time, the footage of these cameras to detect people and object static in movement. In the case of movement, we want to make the prediction of the next movements.

[1] Zhu, Jun-Yan, et al. “Unpaired image-to-image translation using cycle-consistent adversarial networks.” Proceedings of the IEEE international conference on computer vision. 2017.

Cameras placement

In terms of  software, the main highlights are:

  • Experimentation of different algorithms for detection, tracking, and measuring the distance of people and objects from fixed cameras
  • Final decision about the use of detic with DeepSORT
     – Setting methods for calibration of camera and algorithm

    – Smoothing of trajectories

  • Determination of message bus and messages to be sent to the rest of the components
Detection and tracking of objects (here suitcases)
Objects in the factory space in 3D

In regard to Simple Linear Iterative Clustering (SLIC), an effort was put to reduce the spatial difference between the resolution of frames and insects contained within those frames, while also preserving the image size at an exploitable level, each high-resolution frame was split into 36 subframes. The results were encouraging, indicating that the limitation of detecting tiny objects can most likely be addressed as an image resolution constraint.

Grasping and manipulating objects in a virtual environment is a relatively new research area. Most of the research has focused on grasping rigid objects, but handling non-rigid (deformable) objects in a virtual environment is more challenging, and very little progress has been made.

Grasping of a deformable object

CoRoSect will use specific hardware (VR Sense Gloves) and software (Unity Virtual Environment) to realistically grasp, manipulate, and force feedback of non-rigid (deformable) objects. The main goal is to train a deep learning model that will have grasping information as input (hand joints position, hand rotation, etc) and will predict the forces applied to the fingers (force-feedback) of the virtual hand.

We will use two different datasets focusing on deformable objects (non-rigid objects), to train a deep learning model that will take grasping information as input (hand joints position, hand rotation, etc.) and predict the forces applied to the virtual hand’s fingers (force-feedback). To boost our results, we are currently conducting experiments to generate a second dataset using our equipment.

The next step is to extend the existing dataset to train a better model which should give us more realistic force feedback.

Proposed Technique