Papers
Topics
Authors
Recent
Search
2000 character limit reached

LH-MTLC: Hierarchical Language Control

Updated 30 January 2026
  • LH-MTLC is a framework where agents interpret natural language to accomplish long-horizon, multi-step tasks by decomposing instructions into modular sub-goals.
  • It integrates hierarchical policies using CNN-based state encoders and LSTM decoders to generate language-based sub-goals, achieving high task completion rates with limited demonstrations.
  • The low-level controller leverages reinforcement learning with intrinsic rewards for efficient skill execution, enabling safe operations with human-in-the-loop oversight in sparse-reward environments.

Long-Horizon Multi-Task Language Control (LH-MTLC) is the problem of training agents—typically robotic or embodied artificial agents—to interpret natural language instructions specifying complex, multi-step tasks and to execute these instructions as long-horizon action sequences, often spanning dozens or more temporally and semantically distinct sub-goals. This area couples the challenge of language grounding with long-horizon planning and execution, requiring architectural solutions that integrate hierarchical reasoning, modular skill composition, and sample efficiency under sparse reward regimes. LH-MTLC benchmarks comprise environments with both long episode horizons and multi-task requirements, where agent performance is measured by its ability to execute end-to-end tasks as specified by high-level language, often with minimal human intervention or corrective supervision.

1. Formal Problem Definition and Hierarchical Decomposition

LH-MTLC is formally modeled as a Markov Decision Process (MDP) with an expansive state space SS (e.g., gridworld observation or raw robot sensor stream), a primitive action space AA, and a time horizon T10T \gg 10, encompassing many compositional sub-goals before the agent receives any non-zero external reward. Rewards are typically extremely sparse, provided only upon complete success (e.g., agent exits a maze, task is finished). To make such problems tractable, the task is decomposed hierarchically:

  • Sub-task Set GG: A fixed discrete set of language-expressible sub-tasks, e.g., G={openyellowdoor,pickupbluekey,}G = \{\mathrm{``open\,yellow\,door''},\,\mathrm{``pick\,up\,blue\,key''},\,\ldots\}.
  • Hierarchical Policy Pair (πH,πL)(\pi_H, \pi_L): High-level policy πH:SΔ(G)\pi_H: S \rightarrow \Delta(G) produces a sub-goal gGg \in G every HlH_l steps, while the low-level controller πL:S×GΔ(A)\pi_L: S \times G \rightarrow \Delta(A) executes the sub-task gg for HlH_l environment interactions.
  • Return Objective: For a trajectory τ=(s0,g0,a0,,st,gt,at,)\tau = (s_0, g_0, a_0, \ldots, s_t, g_t, a_t, \ldots) and discount γ\gamma, maximize

J(πH,πL)=Eτ(πH,πL)[t=0Tγtr(st,at)],J(\pi_H, \pi_L) = \mathbb{E}_{\tau \sim (\pi_H, \pi_L)} \left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right],

with r(s,a)=1r(s, a) = 1 only upon final task completion (Prakash et al., 2021).

This factorization exploits the compositional structure of real-world tasks, leverages the interpretability and modularity of natural-language sub-goals, and enables more targeted credit assignment both at the sub-task and end-to-end levels.

2. High-Level Policy Architecture and Language Sub-Goal Generation

The high-level policy is responsible for decomposing the raw observation into interpretable sub-goals at an appropriate temporal abstraction. Architecturally, this is implemented via:

  • State Encoder: Deep convolutional encoder (e.g., 3-layer CNN) maps the current observation to a compact embedding ese_s (of dimension 512).
  • Instruction Decoder: An LSTM-based decoder (input 512, hidden 1024) generates a sub-goal token sequence within a tightly specified grammar.
  • Supervised Training Procedure: The policy is learned from NN demonstration pairs {(si,i)}\{(s_i, \ell_i)\}, with i\ell_i as ground-truth language instructions, optimizing the cross-entropy loss

LH=i=1NlogpH(isi;θH).L_H = -\sum_{i=1}^N \log p_H(\ell_i | s_i; \theta_H).

  • Hierarchical Sampling: Every HlH_l steps, πH\pi_H samples a new sub-goal gpH(gs)g \sim p_H(g|s).

This layer exploits sample efficiency by mapping a relatively small demonstration set to high-reward hierarchical plans—moderate demonstration data (e.g., 500–1000 demos) allows TC% (task completion rate) to reach up to 95% in domains such as the MiniGrid “4-Rooms” and “6-Rooms” navigation tasks (Prakash et al., 2021).

3. Low-Level Controller Design and Learning

The low-level controller executes primitive actions to achieve the language-conditioned sub-goal within a fixed horizon (HlH_l):

  • Low-Level Network: Combines a visual encoder (512-dim) with a language-encoded sub-goal (via LSTM, 512-dim), concatenated and processed through two fully connected layers to output logits over primitive actions.
  • Reinforcement Learning Algorithm: Trained using Proximal Policy Optimization (PPO) with sub-task-specific intrinsic rewards—rg(s,a)=+1r_g(s, a) = +1 if the sub-goal gg is completed within HlH_l steps, $0$ otherwise.
  • Multi-Task Pre-Training: The controller is exposed to all primitive sub-goals in isolation, learning to solve each to completion, thus creating a library of reusable skills.

This design ensures that the low-level policy generalizes across sub-goals with shared visual and linguistic structure and aligns its reward structure with hierarchical credit signals.

4. Training Pipeline and Human-in-the-Loop Supervision

The training pipeline is staged for maximal sample efficiency and safety:

  1. Low-Level Pre-Training: The environment is instrumented so that each sub-goal can be sampled independently, and the low-level policy is trained with PPO for \sim3 million steps.
  2. High-Level Supervised Training: A demonstration set Ddemo={(si,i)}D_{\mathrm{demo}} = \{(s_i, \ell_i)\} (usually 50–1000 labeled examples) is collected, and the high-level planner is trained to convergence.
  3. Hierarchical Integration: At runtime, the full system cycles every HlH_l steps: πH\pi_H issues a new sub-language-goal, πL\pi_L executes until sub-goal is completed or horizon exhausted.
  4. Optional Human Oversight: A human expert can intercede to replace erroneous high-level sub-goals. The modular structure allows such intervention at the language command level, offering strong safety and debugging guarantees.

Credit assignment is disentangled between high-level planner (via sub-task completion) and low-level controller (via within-HlH_l reward), while both losses (LHL_H, LLL_L) are optimized independently, with no end-to-end RL necessary.

5. Task Suites, Metrics, and Baseline Performance

Evaluation is standardized on structured long-horizon, sparse-reward benchmarks such as MiniGrid:

  • Environments: “4-Rooms” (sequential object/key/door interactions before reaching exit), “6-Rooms” (increased complexity, more objects/sub-rooms).
  • Key Metrics:
    • TC%: Percentage of episodes fully completed without any human help.
    • Avg. HI: Average number of human interventions required per episode.
  • Result Table:
Demos 4-Rooms TC% 4-Rooms HI 6-Rooms TC% 6-Rooms HI
50 30% 5.9 15% 7.7
100 55% 4.85 30% 6.1
500 90% 1.05 75% 3.85
1000 95% 0.5 90% 1.26

Comparison to flat RL baselines highlights the gains from hierarchical + language decomposition:

Method 4-Rooms TC% 6-Rooms TC%
Flat RL (sparse) 30% 15%
Flat RL (dense) 70% 53%
Hierarchy (500D) 90% 75%
Hierarchy (1kD) 95% 90%

With fewer than 250 demonstrations, performance drops sharply; above 500, the high-level planner exceeds 90% accuracy (Prakash et al., 2021).

6. Benefits, Limitations, and Design Insights

Benefits:

  • Sample efficiency: Learning reusable sub-goal policies enables rapid composition of new long-horizon plans.
  • Interpretability: All sub-goals are output as explicit language utterances, facilitating logging, debugging, and human intervention.
  • Modularity: The set GG of sub-goals can be extended or reconfigured to handle new environments or object colorings without retraining the full hierarchy.

Limitations:

  • Fixed low-level horizon Hl=10H_l=10: The approach does not learn when to terminate sub-goals or adaptively choose horizons.
  • Grammar-constrained sub-goals: Cannot handle free-form or open-domain language at the high-level; requires manually specified sub-goal vocabulary.
  • Training is not end-to-end: Each module is trained on its own loss; improvements in one may not propagate to the other when tasks shift.

Human-in-the-loop ablation: Having a human overseer can reduce failures to near zero, requiring only 1–2 interventions per trace when ≥500 training demonstrations are supplied, demonstrating strong safety margins in real-world operation.

7. Connections to the Broader LH-MTLC Landscape

This hierarchical paradigm exemplifies core principles of LH-MTLC with modular separation between language-based planning and low-level motor execution, each grounded in distinct learning regimes (supervised imitation for high level, RL for low level). It anticipates the scaling and interpretability challenges manifest in more recent large-scale environments (e.g., CALVIN (Mees et al., 2021), LHManip (Ceola et al., 2023)) and multi-agent extensions (LaMMA-P (Zhang et al., 2024)), and provides foundational mechanisms (language-based sub-goaling, human correction interface, modular value assignment) which recur in state-of-the-art LH-MTLC systems. The explicit structuring of credit, minimal demonstration requirements for high-level guidance, and simple integration of human feedback continue to be key advantages and distinguishing features of this class of systems (Prakash et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Long-Horizon Multi-Task Language Control (LH-MTLC).