Skip to content

Commit 3f6746d

Browse files
committed
add blog and move old posts to examples
1 parent 4fb6a1d commit 3f6746d

34 files changed

Lines changed: 460 additions & 0 deletions
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
---
2+
layout: post
3+
title: Learning Tool Affordances without Labels
4+
date: 2021-06-15 11:12:00-0400
5+
description: "GIFT: Generalizable Interaction-aware Functional Tool representations (at RSS 2021)"
6+
source: /_bibliography/
7+
bibliography: giftturpin.bib
8+
bibliography_template: bib
9+
---
10+
11+
12+
## **Generalizable Interaction-aware Functional Tool Affordances without Labels**
13+
[Dylan Turpin](http://www.cs.toronto.edu/~dylanturpin/), [Liquan Wang](https://www.linkedin.com/in/liquan-wang-a37634196/?originalSubdomain=ca), [Stavros Tsogkas](https://tsogkas.github.io/), [Sven Dickinson](https://www.cs.toronto.edu/~sven/), [Animesh Garg](https://animesh.garg.tech/)
14+
*[Robotics Systems & Science](http://www.roboticsproceedings.org/rss17/p060.html), 2021.*
15+
[paper](https://arxiv.org/abs/2106.14973), [video](https://youtu.be/7N1XiIzu9v4)
16+
17+
18+
![Teaser image for GIFT. Discover tool affordances by interacting with procedurally-generated tools across three manipulation tasks: hooking, reaching and hammering. Train an affordance model to detect sparse keypoints representing tool geometry and predict distributions over pairs of keypoints to grasp and interact with for each task by learning from the contact data of sampled trajectories. Affordance predictions from RGBD observations of unknown objects match expected task semantics across hooking, reaching and hammering and are similar to those of a human labeller, e.g. for hammering.]({{ 'gift-teaser.svg' | prepend: '/assets/img/blog/gift-jul21/' | prepend: site.baseurl | prepend: site.url }}){: width="120%"}
19+
20+
*Figure 1: Rather than relying on human labels, the GIFT framework discovers affordances from goal-directed interaction with a set of procedurally-generated tools.*
21+
22+
23+
24+
### Motivation
25+
**1. We should represent tools by what we can do with them.**
26+
27+
When it comes to tools, ''What can I do with this?'' is the key question and there usually isn't just one answer.
28+
There are many basic ways of using a hammer (to strike, to pry, to reach)
29+
each of which contains many finer-grained possibilities (strike with
30+
a high grip for precision, or a low grip for power).
31+
32+
We learn a representation that captures these possibilities.
33+
Specifically, we represent an action possibility (i.e., an affordance) as a tuple
34+
(task ID, grasp keypoint, interaction keypoint).
35+
36+
We build on a line of work that investigates behaviour-grounded object representations, especially KETO ([*Fang et al. 2019*](https://arxiv.org/abs/1910.11977)) and kPAM ([*Manuelli et al. 2019*](https://arxiv.org/abs/1903.06684)).
37+
In contrast to these, our learned representations do not rely on human labels (as in kPAM) or a predefined manipulation strategy (as in KETO).
38+
39+
**2. Behaviour-grounded predictions are testable predictions.**
40+
41+
Because our predicted tool representations are *behaviour-grounded* (i.e., they correspond to possible actions)
42+
we can test them against reality by executing the corresponding actions and checking
43+
the result.
44+
This ''predict, act, check'' routine gives us a *self-supervised* training loop that
45+
does not rely on human labels.
46+
47+
To close the loop, we need a way of translating predicted representations
48+
into executable motions, which is a challenge, because the space of possible actions is large.
49+
Prior works rely on workarounds like additional human supervision
50+
or constraining the action space,
51+
but these simplifications come with serious drawbacks.
52+
53+
*Human supervision* (e.g., with keypoint labels) is expensive and introduces human bias.
54+
We want our representations
55+
to be discovered only from the constraints of the manipulation tasks.
56+
57+
*Constraining the action space* (e.g., to pre-defined motion primitives)
58+
means some action possibilities will never
59+
be explored.
60+
61+
62+
**3. Constraining behaviour limits affordance discovery.**
63+
64+
If we constrain behaviour by limiting the action space
65+
or using a pre-defined manipulation strategy,
66+
we will never discover affordances corresponding to excluded behaviours.
67+
68+
We generate our trajectories with a simple sampling-based motion planner
69+
that is conditioned on the predicted keypoints through a reward function
70+
with one term encoding task success and another encouraging use of the selected keypoints.
71+
72+
Actions are sampled from the full action space,
73+
so motion generation is free to use tools in unexpected ways,
74+
discovering new possibilities that could not have been explored if we
75+
were tied to a limited set of motion primitivies
76+
or a pre-defined manipulation strategy.
77+
78+
79+
### Method overview
80+
81+
![The training pipeline.]({{ 'gift-pipeline.svg' | prepend: '/assets/img/blog/gift-jul21/' | prepend: site.baseurl | prepend: site.url }}){: width="120%"}
82+
83+
*Figure 2: The training pipeline. Our framework learns affordance models for hooking, reaching and hammering by interacting with a set of tools.
84+
All
85+
three models share a task-independent keypoint detector, which takes an RGBD image of a tool and predicts a set of keypoints representing a
86+
tool’s geometry and providing possible choices of grasp and interaction regions. The task-conditional portion of each model, which is trained
87+
on the outcome of trajectories collected from motion planning, selects two keypoints which become our functional tool representation.*
88+
89+
**Collect experience in the full action space.**
90+
91+
We begin by generating a set of tools as concatenations of pairs of
92+
convex meshes into T, X and L shapes.
93+
We sample a tool from our training set, place it on a
94+
table and capture an RGBD observation of the tool from above.
95+
96+
From this observation, our SparseKP network infers a set of keypoints
97+
representing the tool's geometry and providing possible choices of regions
98+
to grasp and interact with.
99+
This module is pre-trained using unsupervised keypoint losses and is shared across tasks.
100+
101+
A grasp keypoint and provisional interaction keypoint are uniformly
102+
sampled from the sparse ones.
103+
We take a crop of the RGBD observation around the grasp keypoint
104+
and pass it to a grasping network to infer a nearby stable grasp.
105+
This grasp is executed and we use MPPI to generate the rest of the trajectory.
106+
MPPI iteratively samples action sequences from the full action space and takes their average weighted by reward.
107+
108+
Part of this reward encodes task success and part encourages
109+
use of the provisional interaction keypoint.
110+
In this way we are able to sample from the full action space
111+
while still conditioning on the sparse selection of keypoints
112+
on the tool's surface.
113+
114+
**Extract training examples from trajectory contact data.**
115+
116+
Once we have a trajectory, we extract a training example.
117+
This consists of a reward, grasp KP, interaction KP and the full set of sparse keypoints.
118+
We replace the provisional choice of interaction KP with the keypoint closest
119+
to the actual first contact between the tool and the target object.
120+
121+
The provisional interaction KP is sampled, and conditioned on,
122+
in order to encourage exploration of different manipulation strategies.
123+
But in the end, we want to use whichever interaction keypoint
124+
works best for the task.
125+
126+
**Train keypoint selection to maximize task reward.**
127+
128+
We sample a training example and build a graph out of its sparse keypoints.
129+
This graph is fed to a task-specific GNN, which predicts a distribution over pairs of keypoint indices (i.e., the joint distribution over grasp and interaction keypoints).
130+
Finally, we update the GNN weights using REINFORCE.
131+
132+
At test time, we sample a tool from the holdout test set.
133+
The grasp and interaction keypoints are selected
134+
based on the predicted distribution, rather than uniformly, and we
135+
enforce that the selected interaction point be used, rather than treating it as provisional.
136+
137+
### Results
138+
139+
So, how well does it work in practice?
140+
141+
Quantitatively GIFT beats baseline methods on all three tasks and qualitatively
142+
the choices of grasp and interaction points usually match task semantics and agree with
143+
the choices of a human oracle.
144+
145+
![Quantitative results.]({{ 'gift-table.svg' | prepend: '/assets/img/blog/gift-jul21/' | prepend: site.baseurl | prepend: site.url }}){: width="70%"}
146+
147+
*Table 1: GIFT outperforms baselines on all tasks and matches a human oracle on two of three tasks using novel tools. Reward is normalized with respect to the human oracle.*
148+
149+
<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/6l3jwp" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div>
150+
151+
*Video 1: Hammering a peg with tools from the holdout set.
152+
Low grasp points increase leverage (providing greater
153+
strike point velocity and peg impulse for a given joint velocity).
154+
Strike points on the hard metallic head allow accumulated
155+
kinetic energy to be rapidly transferred to the peg for
156+
maximum impulse.*
157+
158+
<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/sehup0" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div>
159+
160+
*Video 2: Hooking a thermos with tools from the holdout set.
161+
Our original intention was to constrain the task such that only the
162+
right angle between head and handle could be used to hook.
163+
In fact, the motion planner finds other solutions.*
164+
165+
<div style="width: 100%; height: 0px; position: relative; padding-bottom: 56.250%;"><iframe src="https://streamable.com/e/ciwknw" frameborder="0" width="100%" height="100%" allowfullscreen style="width: 100%; height: 100%; position: absolute;"></iframe></div>
166+
167+
*Video 3: Constrained reaching with tools from the holdout set.
168+
To manipulate an object on the other side of a wall,
169+
the tool must fit through the hole and reach the
170+
target object.*
171+
172+
### Future directions
173+
174+
By grounding our affordance representation in contact data
175+
and sampling trajectories from the full action space,
176+
we are able to discover unbiased affordances for each task
177+
without human labels.
178+
There are however important limitations to our method that we think future work could address.
179+
180+
Our sparse keypoints are stored as raw locations, without any additional property encodings.
181+
A natural extension to our method would encode additional local information
182+
at each keypoint.
183+
This could capture local geometric or material properties and allow
184+
for more robust reasoning about the relationship between materials,
185+
fine-grained geometry
186+
and task requirements.
187+
188+
Our motion sampling routine depends on access to simulatable dynamics.
189+
This makes it non-trivial to transfer our results to real robots.
190+
We plan to experiment with recovering 3D models from visual observations, so we can leverage our learned representation to plan in simulator and execute on a real robot.
191+
192+
For details on the background, implementation and results read the full paper [here](https://arxiv.org/abs/2106.14973) and watch the [presentation](https://streamable.com/eylzdj).

0 commit comments

Comments
 (0)