|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: Learning Tool Affordances without Labels |
| 4 | +date: 2021-06-15 11:12:00-0400 |
| 5 | +description: "GIFT: Generalizable Interaction-aware Functional Tool representations (at RSS 2021)" |
| 6 | +source: /_bibliography/ |
| 7 | +bibliography: giftturpin.bib |
| 8 | +bibliography_template: bib |
| 9 | +--- |
| 10 | + |
| 11 | + |
| 12 | +## **Generalizable Interaction-aware Functional Tool Affordances without Labels** |
| 13 | +[Dylan Turpin](http://www.cs.toronto.edu/~dylanturpin/), [Liquan Wang](https://www.linkedin.com/in/liquan-wang-a37634196/?originalSubdomain=ca), [Stavros Tsogkas](https://tsogkas.github.io/), [Sven Dickinson](https://www.cs.toronto.edu/~sven/), [Animesh Garg](https://animesh.garg.tech/) |
| 14 | +*[Robotics Systems & Science](http://www.roboticsproceedings.org/rss17/p060.html), 2021.* |
| 15 | +[paper](https://arxiv.org/abs/2106.14973), [video](https://youtu.be/7N1XiIzu9v4) |
| 16 | + |
| 17 | + |
| 18 | +{: width="120%"} |
| 19 | + |
| 20 | +*Figure 1: Rather than relying on human labels, the GIFT framework discovers affordances from goal-directed interaction with a set of procedurally-generated tools.* |
| 21 | + |
| 22 | + |
| 23 | + |
| 24 | +### Motivation |
| 25 | +**1. We should represent tools by what we can do with them.** |
| 26 | + |
| 27 | +When it comes to tools, ''What can I do with this?'' is the key question and there usually isn't just one answer. |
| 28 | +There are many basic ways of using a hammer (to strike, to pry, to reach) |
| 29 | +each of which contains many finer-grained possibilities (strike with |
| 30 | +a high grip for precision, or a low grip for power). |
| 31 | + |
| 32 | +We learn a representation that captures these possibilities. |
| 33 | +Specifically, we represent an action possibility (i.e., an affordance) as a tuple |
| 34 | +(task ID, grasp keypoint, interaction keypoint). |
| 35 | + |
| 36 | +We build on a line of work that investigates behaviour-grounded object representations, especially KETO ([*Fang et al. 2019*](https://arxiv.org/abs/1910.11977)) and kPAM ([*Manuelli et al. 2019*](https://arxiv.org/abs/1903.06684)). |
| 37 | +In contrast to these, our learned representations do not rely on human labels (as in kPAM) or a predefined manipulation strategy (as in KETO). |
| 38 | + |
| 39 | +**2. Behaviour-grounded predictions are testable predictions.** |
| 40 | + |
| 41 | +Because our predicted tool representations are *behaviour-grounded* (i.e., they correspond to possible actions) |
| 42 | +we can test them against reality by executing the corresponding actions and checking |
| 43 | +the result. |
| 44 | +This ''predict, act, check'' routine gives us a *self-supervised* training loop that |
| 45 | +does not rely on human labels. |
| 46 | + |
| 47 | +To close the loop, we need a way of translating predicted representations |
| 48 | +into executable motions, which is a challenge, because the space of possible actions is large. |
| 49 | +Prior works rely on workarounds like additional human supervision |
| 50 | +or constraining the action space, |
| 51 | +but these simplifications come with serious drawbacks. |
| 52 | + |
| 53 | +*Human supervision* (e.g., with keypoint labels) is expensive and introduces human bias. |
| 54 | +We want our representations |
| 55 | +to be discovered only from the constraints of the manipulation tasks. |
| 56 | + |
| 57 | +*Constraining the action space* (e.g., to pre-defined motion primitives) |
| 58 | +means some action possibilities will never |
| 59 | +be explored. |
| 60 | + |
| 61 | + |
| 62 | +**3. Constraining behaviour limits affordance discovery.** |
| 63 | + |
| 64 | +If we constrain behaviour by limiting the action space |
| 65 | +or using a pre-defined manipulation strategy, |
| 66 | +we will never discover affordances corresponding to excluded behaviours. |
| 67 | + |
| 68 | +We generate our trajectories with a simple sampling-based motion planner |
| 69 | +that is conditioned on the predicted keypoints through a reward function |
| 70 | +with one term encoding task success and another encouraging use of the selected keypoints. |
| 71 | + |
| 72 | +Actions are sampled from the full action space, |
| 73 | +so motion generation is free to use tools in unexpected ways, |
| 74 | +discovering new possibilities that could not have been explored if we |
| 75 | +were tied to a limited set of motion primitivies |
| 76 | +or a pre-defined manipulation strategy. |
| 77 | + |
| 78 | + |
| 79 | +### Method overview |
| 80 | + |
| 81 | +{: width="120%"} |
| 82 | + |
| 83 | +*Figure 2: The training pipeline. Our framework learns affordance models for hooking, reaching and hammering by interacting with a set of tools. |
| 84 | +All |
| 85 | +three models share a task-independent keypoint detector, which takes an RGBD image of a tool and predicts a set of keypoints representing a |
| 86 | +tool’s geometry and providing possible choices of grasp and interaction regions. The task-conditional portion of each model, which is trained |
| 87 | +on the outcome of trajectories collected from motion planning, selects two keypoints which become our functional tool representation.* |
| 88 | + |
| 89 | +**Collect experience in the full action space.** |
| 90 | + |
| 91 | +We begin by generating a set of tools as concatenations of pairs of |
| 92 | +convex meshes into T, X and L shapes. |
| 93 | +We sample a tool from our training set, place it on a |
| 94 | +table and capture an RGBD observation of the tool from above. |
| 95 | + |
| 96 | +From this observation, our SparseKP network infers a set of keypoints |
| 97 | +representing the tool's geometry and providing possible choices of regions |
| 98 | +to grasp and interact with. |
| 99 | +This module is pre-trained using unsupervised keypoint losses and is shared across tasks. |
| 100 | + |
| 101 | +A grasp keypoint and provisional interaction keypoint are uniformly |
| 102 | +sampled from the sparse ones. |
| 103 | +We take a crop of the RGBD observation around the grasp keypoint |
| 104 | +and pass it to a grasping network to infer a nearby stable grasp. |
| 105 | +This grasp is executed and we use MPPI to generate the rest of the trajectory. |
| 106 | +MPPI iteratively samples action sequences from the full action space and takes their average weighted by reward. |
| 107 | + |
| 108 | +Part of this reward encodes task success and part encourages |
| 109 | +use of the provisional interaction keypoint. |
| 110 | +In this way we are able to sample from the full action space |
| 111 | +while still conditioning on the sparse selection of keypoints |
| 112 | +on the tool's surface. |
| 113 | + |
| 114 | +**Extract training examples from trajectory contact data.** |
| 115 | + |
| 116 | +Once we have a trajectory, we extract a training example. |
| 117 | +This consists of a reward, grasp KP, interaction KP and the full set of sparse keypoints. |
| 118 | +We replace the provisional choice of interaction KP with the keypoint closest |
| 119 | +to the actual first contact between the tool and the target object. |
| 120 | + |
| 121 | +The provisional interaction KP is sampled, and conditioned on, |
| 122 | +in order to encourage exploration of different manipulation strategies. |
| 123 | +But in the end, we want to use whichever interaction keypoint |
| 124 | +works best for the task. |
| 125 | + |
| 126 | +**Train keypoint selection to maximize task reward.** |
| 127 | + |
| 128 | +We sample a training example and build a graph out of its sparse keypoints. |
| 129 | +This graph is fed to a task-specific GNN, which predicts a distribution over pairs of keypoint indices (i.e., the joint distribution over grasp and interaction keypoints). |
| 130 | +Finally, we update the GNN weights using REINFORCE. |
| 131 | + |
| 132 | +At test time, we sample a tool from the holdout test set. |
| 133 | +The grasp and interaction keypoints are selected |
| 134 | +based on the predicted distribution, rather than uniformly, and we |
| 135 | +enforce that the selected interaction point be used, rather than treating it as provisional. |
| 136 | + |
| 137 | +### Results |
| 138 | + |
| 139 | +So, how well does it work in practice? |
| 140 | + |
| 141 | +Quantitatively GIFT beats baseline methods on all three tasks and qualitatively |
| 142 | +the choices of grasp and interaction points usually match task semantics and agree with |
| 143 | +the choices of a human oracle. |
| 144 | + |
| 145 | +{: width="70%"} |
| 146 | + |
| 147 | +*Table 1: GIFT outperforms baselines on all tasks and matches a human oracle on two of three tasks using novel tools. Reward is normalized with respect to the human oracle.* |
| 148 | + |
| 149 | +<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/6l3jwp" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div> |
| 150 | + |
| 151 | +*Video 1: Hammering a peg with tools from the holdout set. |
| 152 | +Low grasp points increase leverage (providing greater |
| 153 | +strike point velocity and peg impulse for a given joint velocity). |
| 154 | +Strike points on the hard metallic head allow accumulated |
| 155 | +kinetic energy to be rapidly transferred to the peg for |
| 156 | +maximum impulse.* |
| 157 | + |
| 158 | +<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/sehup0" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div> |
| 159 | + |
| 160 | +*Video 2: Hooking a thermos with tools from the holdout set. |
| 161 | +Our original intention was to constrain the task such that only the |
| 162 | +right angle between head and handle could be used to hook. |
| 163 | +In fact, the motion planner finds other solutions.* |
| 164 | + |
| 165 | +<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/ciwknw" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div> |
| 166 | + |
| 167 | +*Video 3: Constrained reaching with tools from the holdout set. |
| 168 | +To manipulate an object on the other side of a wall, |
| 169 | +the tool must fit through the hole and reach the |
| 170 | +target object.* |
| 171 | + |
| 172 | +### Future directions |
| 173 | + |
| 174 | +By grounding our affordance representation in contact data |
| 175 | +and sampling trajectories from the full action space, |
| 176 | +we are able to discover unbiased affordances for each task |
| 177 | +without human labels. |
| 178 | +There are however important limitations to our method that we think future work could address. |
| 179 | + |
| 180 | +Our sparse keypoints are stored as raw locations, without any additional property encodings. |
| 181 | +A natural extension to our method would encode additional local information |
| 182 | +at each keypoint. |
| 183 | +This could capture local geometric or material properties and allow |
| 184 | +for more robust reasoning about the relationship between materials, |
| 185 | +fine-grained geometry |
| 186 | +and task requirements. |
| 187 | + |
| 188 | +Our motion sampling routine depends on access to simulatable dynamics. |
| 189 | +This makes it non-trivial to transfer our results to real robots. |
| 190 | +We plan to experiment with recovering 3D models from visual observations, so we can leverage our learned representation to plan in simulator and execute on a real robot. |
| 191 | + |
| 192 | +For details on the background, implementation and results read the full paper [here](https://arxiv.org/abs/2106.14973) and watch the [presentation](https://streamable.com/eylzdj). |
0 commit comments