SPRINT: Semantic Policy Pre-training via Language Instruction Relabeling

1University of Southern California, 2KAIST


Abstract

We propose SPRINT, a scalable offline policy pre-training approach which substantially reduces the human effort needed for pre-training a diverse set of skills through relabling instructions with LLMs and an offline RL chaining objective.

Interpolate start reference image.


SPRINT:
Scalable Pre-training via Relabeling Language INsTructions

SPRINT equips policies with a diverse repertoire of skills via language-instruction-conditioned offline RL: given a natural language task description z, the policy π(a|s, z) is rewarded for successfully executing the instruction.

Method

Interpolate start reference image.

SPRINT introduces two approaches for increasing the scale and diversity of the pre-training task instructions without requiring additional costly human inputs. SPRINT pre-trains policies on the combined set of tasks and thereby equips them with a richer skill repertoire.

1. Hindsight Language Labels (left)

We assume the dataset provides a base set of language-annotated sub-trajectories which we can directly train on with offline RL and a sparse goal-reaching reward.

2. In-Trajectory Skill Aggregation (middle)

SPRINT leverages pre-trained, large language models to aggregate consecutive instructions into new tasks. These new, longer-horizon trajectories are also relabeled with sparse goal-reaching rewards.

3. Cross-Trajectory Skill Chaining (right)

SPRINT introduces an objective for skill-chaining via offline RL that generates novel instruction chains across different trajectories. We label the rewards of these new trajectories using a combination of concurrently trained Q-function and sparse goal-reaching rewards.



Environments

Interpolate start reference image.

We evaluate SPRINT in two domains that require the learning of complex, long-horizon behaviors from sparse rewards. For both environments, we have a large scale dataset of task & primitive skill instructions and demonstrations.
(a) ALFRED: ALFRED provides a rich set of long-horizon, meaningful tasks and a dataset of 6.6k language-annotated demonstrations across a variety of realistic household floor plans.
(b) Real World Kitchen Manipulation: We also evaluate on a real-world tabletop manipulation environment with a Jaco robot arm. This setup resembles a household kitchen with a variety of objects and realistic tasks to accomplish.


Results

ALFRED

Zero-shot Evaluation

Evaluate on seen tasks.

Interpolate start reference image. Episodic Transformer

Completes 6/8 sub-tasks.

Interpolate start reference image. Actionable Models

Completes 0/8 sub-tasks.

Interpolate start reference image. SPRINT

Completes 8/8 sub-tasks.

Task: "Throw away a microwaved slice of potato."


SPRINT outperforms the baselines on zero-shot evaluation. SPRINT finish all 8 sub-tasks, while Episodic Transformer, the state-of-the-art imitation learning baseline only finishes 6 sub-tasks. The other offline RL baseline, Actionable Models, does not finish any sub-tasks.




Online RL Finetuning

Finetune pre-trained policies with online RL for 50,000 timesteps.


Interpolate start reference image. Episodic Transformer

Completes 1/3 sub-tasks.

Interpolate start reference image. Actionable Models

Completes 0/3 sub-tasks.

Interpolate start reference image. SPRINT

Completes 2/3 sub-tasks.

Task: "Put the chilled lettuce on the counter."


SPRINT also can benefit from online RL finetuning for unseen tasks. SPRINT finishes 2/3 sub-tasks, while Episodic Transformer only finishes 1 sub-task. And Actionable Models does not finish any sub-tasks.




Real World Kitchen Manipulation

Offline Finetuning

We perform offline fine-tuning on 25 demonstrations for each task after pre-training.

Interpolate start reference image.
L-BC Composite

Completes 4/8 sub-tasks.

Interpolate start reference image.
SPRINT

Completes 8/8 sub-tasks.

Task: "Serve milk in the bowl and butter and baked bread in the plate."


We compare SPRINT with L-BC Composite, which is the best performing baseline. SPRINT accomplishes all the sub-tasks, while L-BC Composite fails to complete 4 sub-tasks. SPRINT is able to generalize to unseen tasks by finetuning with limited demonstrations. L-BC Composite performs well on the first 4 sub-tasks, but fails on the longer horizon sub-tasks.



BibTeX

@misc{zhang2023sprint,
      title={SPRINT: Scalable Policy Pre-Training via Language Instruction Relabeling}, 
      author={Jesse Zhang and Karl Pertsch and Jiahui Zhang and Joseph J. Lim},
      year={2023},
      eprint={2306.11886},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}