Contributions

Morning Spotlight

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation	Statement × Drawing on insights fromthe paper the task of evaluating generalization in robotic manipulation directly impacts future robot applications and research by emphasizing the critical need for robots to adapt to dynamic, real-world environments. The findings from this paper underscore the importance of developing robotic systems that can perform reliably across a wide range of environmental conditions and tasks, which is essential for their practical deployment in many areas before attempts to scale up robot learning. Therefore, the future direction for skill development in robotics should extend beyond merely accumulating or developing new skills to critically assess the generalization and robustness of these acquired skills.
KinScene: Model-Based Mobile Manipulation of Articulated Scenes	Statement × In long-term manipulation tasks involving articulated objects, the robot must reason about the resultant motion of each part and anticipate its impact on future actions. We highlight three key challenges: 1. Exploration: To understand the objects' articulation properties, the robot must be able to self-explore and collect necessary clues for inference. 2. Scene-level Articulation Mapping: The map should encompass essential information, including 3D models and objects' articulation properties, to facilitate long-horizon planning for sequential tasks. 3. Manipulation and Planning: To accommodate various objects of different sizes and positions, the robot must strategically plan the actions of both its mobile base and arm to execute sequential tasks while considering the kinematic constraints. Each task represents a distinct research challenge, driving current robotics research to focus on individual components within the robotic pipeline. While progress has been made in specific aspects of articulation manipulation, there is a notable gap in research that integrates these capabilities for real-world evaluations. Our proposed framework fills this gap by providing a generalized approach for sequential articulation manipulation through autonomous exploration, scene mapping, object articulation detection, and sequential object manipulation planning.
RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing	Statement × In our research, we focus on utilizing non-prehensile box pushing and dense object packing tasks as exemplars to illustrate how robots can adeptly manipulate objects with uncertain physical properties, employing both visual and tactile sensing modalities. These tasks pose significant challenges as the robot must accurately estimate the physical attributes of objects, such as mass distribution and deformability, to accomplish the desired objectives. To address these challenges, we propose a novel approach for multimodal fusion aimed at enhancing world model learning. By pushing the boundaries of what tasks robots can accomplish, our work signifies a crucial advancement in the field of robotic manipulation. Looking ahead, we believe that the integration of structured world model learning with model-based planning, alongside the strategic utilization of multimodal sensing, holds immense promise for future robot manipulation research. The integration of structured world model learning and model-based planning provides a foundation for creating interpretable and generalizable learning systems, essential for navigating the complexities of real-world manipulation tasks. Similarly, leveraging multimodal sensing enables robots to tackle challenging manipulation tasks that are beyond the scope of single-modal sensing, particularly when visual observations are impeded by occlusions or environmental factors. Our proposed framework serves as a tangible demonstration of the synergistic benefits of integrating structured world model learning, model-based planning, and multimodal sensing. By showcasing its effectiveness in addressing demanding manipulation tasks, such as non-prehensile manipulation and manipulation in cluttered environments, we offer a roadmap for future research endeavors. This roadmap emphasizes the importance of expanding robots' capabilities through the harmonious integration of structured world model learning and multimodal sensing, thereby paving the way for the development of more versatile and adept manipulation systems.
Bilateral Control-Based Imitation Learning via Action Chunking with Transformer	Statement × In this project, Bilateral Control-Based Imitation Learning via Action Chunking with Transformer (Bi-ACT), we utilized a real-world 'Pick-and-Place' task on various objects to evaluate the robot's adaptability and precision. The robot was tasked with manipulating objects with diverse sizes, shapes, consistencies, weight distributions, and hardness levels. This approach allowed us to observe how the robot applied learning from a limited set of training data to manage untrained objects adaptively. Successful manipulation of both trained and untrained objects, as seen across interpolation and extrapolation datasets, demonstrates the robot's capability to generalize its learning. This indicates that the Bi-ACT model could be extended to a broader range of tasks beyond its initial training scope. Based on the insights gleaned from the Bi-ACT project, our roadmap for future robot manipulation research focuses on enhancing the robot's adaptability and precision. In terms of adaptability, similar to the current 'Pick-and-Place' task, we aim to enable robots to generalize simple tasks, extending their capabilities from varying the objects to modifying the placement locations. Since complex tasks are amalgamations of simpler ones, perfecting these fundamental tasks is crucial for the successful completion of more complex and intricate tasks. Selecting appropriate training data will allow the robot to tackle tasks in a generalized manner and adapt seamlessly to different environments. To enhance the robot's precision, we are committed to refining the model and its preprocessing mechanisms, with a specific emphasis on image data. Tackling challenges such as object recognition will enable the robot to accurately identify and concentrate on the most salient features within the expansive datasets available. Moreover, considering the dynamic nature of real-world environments and the inevitability of errors, it is essential for the robot to have the capability to assess the current state of a task and, if necessary, reinitiate the process to achieve the desired outcome. This strategic direction will guide our research towards robots that can understand the surrounding environments and act accordingly, leading to enhanced precision and contributing significantly to the advancement of robot manipulation research.
DITTO: Demonstration Imitation by Trajectory Transformation	Statement × One-shot learning from human demonstrations is an important task for robotic manipulation, as, compared to robotic demonstrations, it presents a pathway to utilizing large and diverse datasets of human videos. I would like to distinguish between algorithmic taskonomy, describing the various pre-training and auxiliary losses and, functional taskonomy (for lack of a better word) describing the variety of robotic tasks that are addressed by the algorithm and tested during the experimental evaluation. Our algorithmic roadmap is to learn object centric trajectories from more unstructured in-the-wild videos. In terms of functional tasks, the one shot imitation approach gives us lots of flexibility to address a wide range of tasks, however we would like to look at improving manipulation of articulated objects. For this, we are looking at improving the execution by making it more robust.
ScrewMimic: Bimanual Imitation from Human Videos with Screw Space Projection (Remote)	Statement × Manipulation in human environments often involves tasks that require coordinating the motion of two arms, e.g., opening a bottle, cutting a block in two pieces, or stirring a pot. Given that the general-purpose robots that we want are going to operate in an environment that has been designed for humans and with objects that are being used by humans, endowing robots with such bimanual manipulation capabilities is crucial. Our work is a promising step towards enabling robots to efficiently learn such complex bimanual manipulation tasks by watching human demonstrations and fine-tuning the actions through interaction. To come up with a taskonomy roadmap, future robot manipulation research should study what are the most important and frequent tasks performed by humans. One way of doing that would be to assess large-scale human video datasets in different environments to understand the distribution of tasks. Furthermore, currently, a lot of complex manipulation tasks take place in a relatively constrained environment. In my mind, we are slowly moving towards really in-the-wild pick-and-place tasks for robots. While we are much farther away from such in-the-wild manipulation capabilities for other complex tasks, future research should study how we can remove such constraints to enable truly in-the-wild manipulation.
GILD: Generalizable Imitation Learning with 3D Semantic Fields (Remote)	Statement × Imitation learning has shown remarkable capability in executing complex robotic manipulation tasks. However, existing frameworks often fall short in structured modeling of the environment, lacking explicit characterization of geometry and semantics, which limits their ability to generalize to unseen objects and layouts. To enhance the generalization capabilities of imitation learning agents, we introduce a novel framework in this work, incorporating explicit spatial and semantic information via 3D semantic fields. We begin by generating 3D descriptor fields from multi-view RGBD observations with the help of large foundational vision models. These high-dimensional descriptor fields are then converted into low-dimensional semantic fields, which aids in the efficient training of a diffusion-based imitation learning policy. The proposed method offers explicit consideration of geometry and semantics, enabling strong generalization capabilities in tasks that require category-level generalization, resolving geometric ambiguities, and attention to subtle geometric details. We evaluate our method across eight tasks involving articulated objects and instances with varying shapes and textures from multiple object categories. Our method proves its effectiveness by outperforming state-of-the-art imitation learning baselines on unseen testing instances by 57%. Additionally, we provide a detailed analysis and visualization to interpret the sources of performance gain and explain how our method can generalize to novel instances.
D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation (Remote)	Statement × We propose a representation using foundation vision models to facilitate zero-shot manipulation tasks.
Robot Air Hockey: A Manipulation Testbed for Robot Learning with Reinforcement Learning	Statement × Robot manipulation has made tremendous progress in recent years. By creating Robot Air Hockey, we aim to push the robot manipulation research frontier in two major ways. First, we explore the setting of dynamic object manipulation, which is understudied in the real world relative to quasistatic object manipulation because of many challenges. Compared to other dynamic robotics tasks (e.g. learning quadrupedal policies), it can be difficult to collect data for dynamic manipulation. Mainly, the low latency requirements and frequent (often human) resets required in these tasks bottleneck data’s quality and quantity. We hope that our system serves as a potential testbed and provides insights for future works where we address some of these challenges. Some of the insights we observed from creating a dynamic manipulation system are: 1) maintaining low latency is challenging, and 2) playing at a relatively high frame rate---20Hz or 50ms, even humans can struggle. We hope to follow up with experiments investigating human capabilities across different modalities of teleoperation. Furthermore, one of the most significant hurdles is the robot emergency-stopping, either because a learned policy takes cyclic actions that result in damaging resonant behavior, or because the robot jerks too quickly when changing direction, making the robot exceed its acceleration limits. Further investigation into action smoothing could provide higher-quality human demonstration data and offline policies. Furthermore, recent work in robot learning has mostly focused on learning from demonstrations, where the data provided are usually optimal trajectories from humans or internet-scale data used to train representations. This optimality assumption is safe in quasistatic or low interaction environments, but in dynamic object manipulation, human teleoperators can lack the skill necessary to provide high-quality demonstrations. Robot air hockey considers one type of data that is relatively understudied: in-domain low-reward interaction data. Currently, we have collected robot data via teleoperation, and we are also planning to collect robot interaction data autonomously soon. As a result, this will open up avenues for many algorithms that are not limited to learning from demonstrations. For instance, our system allows us to assess RL algorithms and offers the opportunity to assess many forms of RL, such as goal-conditioned, offline or sim-to-real methods, to name a few. Because of the suboptimality of demonstration data, RL is an ideal tool for this setting, as suggested by the results in this work where offline RL outperforms other learning methods such as behavior cloning. Since related work assessing RL directly on a physical robot, especially in a dynamic manipulation setting, is limited, this work offers offline RL assessment in a dynamic, interactive real-world set of tasks. Our roadmap for future robot manipulation research is to learn from suboptimal interaction data and solve dynamic manipulation tasks in the real world. Current approaches have achieved incredible performance in quasistatic manipulation tasks, thanks to their abilities to harness optimal demonstration data and understand the task dynamics. However, doing so for dynamic manipulation is much more difficult. Although the scale of large models and datasets increases, in-domain data for dynamic tasks is still hard to collect. Even if one can collect in-domain data for these tasks, collecting optimal data is difficult, let alone deriving an accurate dynamics model for the task. Therefore, we believe this system is a powerful potential testbed for applying RL in the real world where we envision achieving super-human level performance on challenging, dynamic manipulation tasks.
LocoMan: Advancing Versatile Quadrupedal Dexterity with Lightweight Loco-Manipulators	Statement × In this work, we introduce LocoMan, an innovative quadrupedal robot equipped with a dexterous manipulator, designed for performing versatile tasks in a variety of constrained environments. Our design includes a lightweight 3-DoF (Degrees of Freedom) manipulator that is mounted on the robot's leg, coupled with a comprehensive whole-body controller that accommodates all five operating modes. Through experiments, we demonstrate the potential of LocoMan in performing real-life tasks. Specifically, LocoMan achieves a diverse set of challenging dexterous loco-manipulation tasks in confined spaces, such as opening doors, plugging into sockets, picking objects in narrow and low-lying spaces, and bimanual manipulation. Looking ahead, we aim to enhance LocoMan's autonomy by incorporating learning-based approaches.
Physics-informed Neural Motion Planning on Constraint Manifolds	Statement × Constrained motion planning plays an important role in robot manipulation tasks. For example, opening doors and carrying a glass filled with water both require robots moving on a zero-volume constraint manifold in the configuration space. We propose to use neural networks to solve a partial differential equation to find the geodesic path on the constraint manifold. This approach provides an efficient motion planning strategy and it does not require demonstration paths for learning, thereby enhancing the robot's capacity for manipulation.
Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations	Statement × This paper focuses on imitation learning with diverse human demonstrations, carries significant implications for future robot applications and research in several ways: Enhanced Adaptability: By addressing the challenge of multi-modal data distributions, the proposed approach enables robots to adapt more effectively to diverse human behaviors. This enhanced adaptability is crucial for real-world applications where robots need to interact with a wide range of users with varying behaviors and preferences. Robustness and Versatility: The introduction of tractable metrics to quantify diversity provides a practical means to assess the robustness and versatility of imitation learning algorithms. This enables researchers to develop more reliable and adaptable robotic systems capable of handling diverse scenarios and environments. Benchmarking and Evaluation: The creation of simulation benchmark environments and datasets specifically designed to evaluate a model's ability to learn multi-modal behavior serves as a valuable benchmark for assessing the performance of state-of-the-art methods. This facilitates comparative evaluations and enables researchers to identify areas for improvement in existing algorithms.
Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects	Statement × Impact on Future Robot Applications and Research Implementing Movement Primitive Diffusion (MPD) in robot-assisted surgery marks a significant advancement in robotic manipulation, particularly for sensitive tasks like surgery. MPD, which combines diffusion-based imitation learning with probabilistic dynamic movement primitives for refined handling, will profoundly influence robotics in several ways: 1. Enhanced Motion Quality in Robotic Surgery: By focusing on gentle manipulation of deformable objects, MPD addresses a critical need in RAS for methods that are both data-efficient and capable of generating high-quality motion. This directly translates to improved outcomes in surgical procedures by reducing the risk of injury to surrounding tissues and increasing the quality of robotic movements. 2. Broadened Application in Robotic Handling of Deformable Objects: The methodology's success in RAS hints at its broader applicability across other sectors that require the manipulation of deformable or delicate objects, such as in cooking or laundry. MPD's versatility and efficiency can drive advancements in these areas by enabling robots to perform tasks previously considered too complex or delicate. 3. Advancement of Imitation Learning Techniques: MPD contributes to the evolution of imitation learning by demonstrating how combining diffusion-based methods with probabilistic dynamic movement primitives can lead to superior performance in terms of motion quality and data efficiency. This advancement enriches the toolkit available to researchers and developers in robotics, fostering further innovation in how robots learn and adapt to complex tasks. Taskonomy Roadmap for Future Robot Manipulation Research Building on the foundation laid by MPD, we think of following research opportunities: 1. Human-Robot Collaboration: Exploring the potential for MPD to enhance human-robot collaboration, where the robot could assist human operators by performing specific tasks under their guidance. This research would involve developing models that can understand and predict human intentions and actions to work seamlessly alongside human operators. 2. Exploration of non-RAS focused Tasks: The capabilities of MPD to interact with and manipulate deformable tissues should transfer to tasks outside of RAS. This research therefore would explore the potential of MPD in more everyday tasks like cooking and laundry. 3. Scalability to Complex and Multi-Step Manipulation Tasks: Expanding the scope of MPD to handle more complex and multi-step manipulation tasks is a natural progression. Research could focus on how MPD can be adapted or extended to perform tasks that involve multiple sequential manipulations or interactions with multiple objects.

Afternoon Spotlight

Few-Shot Learning of Force-Based Motions From Demonstration Through Pre-training of Haptic Representation	Statement × Impact on Future Robot Applications and Research: Our proposed semi-supervised Learning from Demonstration (LfD) approach addresses the challenge of learning contact-rich deformable manipulation tasks from a limited number of demonstrations. We evaluated our approach on a wiping task, which serves as a force-based manipulation task where the motion depends on the physical properties of the wiping tool and the object being wiped. Unlike previous approaches that relied heavily on large amounts of demonstration data, our method decouples the learnt model into a haptic representation encoder and a motion generation decoder. This allows us to pre-train the encoder using unsupervised data and then use few-shot LfD to train the decoder, leveraging human expertise. Our work demonstrates that pre-training significantly improves the ability of the LfD model to recognise physical properties and generate desired wiping motions for unseen sponges, outperforming the LfD method without pre-training on the physical robot hardware using the KUKA iiwa robot arm. This work has significant implications for real-world applications such as cleaning and cooking tasks, as it offers a scalable solution for learning generalisable deformable manipulation skills with minimal human effort. Taskonomy Roadmap for Future Robot Manipulation Research: Looking ahead, our taskonomy roadmap for future robot manipulation research consists of three key directions. Firstly, we propose extending our current work to incorporate multi-modal sensing, including vision, tactile, force, and proximity sensing. This expansion will enable robots to perform a greater variety of tasks requiring adaptation of motions based on the varying properties of manipulated objects. Secondly, while our approach considers generalisation of manipulation tasks to various objects, we aim to explore generalisation in two axes: generalisation of a skill model to handle various objects and generalisation of an object model to be shared across different tasks. This broader perspective will contribute to the development of more generalisable manipulation systems. Lastly, we intend to evaluate the influence of data informativeness and quantity on learning generalisable manipulation skills. In our approach, data informativeness increases from unsupervised simulated data to unsupervised real data and demonstration data, with a corresponding increase in the cost of data collection. Unsupervised real data eliminates the issue of the sim2real gap; however, it is still more expensive to collect than unsupervised simulated data. It is interesting to evaluate how the informativeness and quantity of the data influence these three types of data in learning generalisable manipulation skills. We are interested in how the characteristics of each data impact the robot's performance to develop data collection strategies for more robust and adaptable manipulation systems.
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals	Statement × The presented work contributes to the workshop's general goal of advancing sensorimotor skill learning for robot manipulation, particularly by addressing the critical challenge of scalability in imitation learning through the lens of partially labeled datasets. We believe, that the future of skill learning requires more, diverse training data. While language offers an intuitive interface for goal guidance, manually labeling all robot demonstrations is time-consuming and labor-intensive. Thus, we are interested in finding methods that improve learning from partially labeled datasets for better language-conditioned policy learning. Therefore, we collected uncurated robot play data, as its cheap and easy to collect. Next, we only need to label a few sequences and can learn a versatile language-conditioned policy in an effective manner. We belive that this approach is an effective way to scale our poliicies towards more versatile agents given enough data. By emphasizing the practical significance of learning from uncurated datasets with sparse language annotations, the paper proposes two self-supervised objectives to train policies, that can learn with few language labels effectively. Further, we introduce a novel imitation learning policy, called Multimodal Diffusion Transformer (MDT), that improves upon prior diffusion policy architectures and can deal with different goal modalities effectively. By demonstrating MDT's proficiency in learning versatile behavior from multimodal goals, including language and images, the work underscores the importance of efficient utilization of available data resources for building more adaptable and robust manipulation robots. Through its empirical validation and comprehensive evaluation across popular imitation learning benchmarks such as LIBERO and CALVIN, the paper presents new insights, that we believe are relevant for the workshop.
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning	Statement × The presented work directly contributes to the central theme of the workshop by addressing key challenges in efficient policies in visuomotor skill learning for robot manipulation. Diffusion Policies have recently gained widespread adoption as a policy representation for imitation learning. In contrast to other representation methods, it has shown strong capabilities to generalize well given enough training data, as seen in Octo. The current trend of training large policies on diverse data such as OXE demands a significant increase in computation. Thus, we believe there is a need to explore efficient architectures. Related to this, one of the major drawbacks of diffusion policy is still the high computation cost of generation of new actions in several denoising steps. Further, current diffusion policies typically contain several hundred million parameters. Our work aims to explore novel architecture that makes diffusion policies more computationally efficient to improve downstream skill learning. Therefore we explore a Mixture-of-Experts architecture for Diffusion Policies, where only a subset of parameters are required during each forward pass. This allows us to train bigger architectures on large-scale datasets while having more efficient policies. We believe that efficiency is one important building block to scale current policies towards more diverse skill learning and our work contains insights that are interesting for the audience of this workshop.
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models	Statement × Language is frequently used to instruct robots for manipulation tasks. A large amount of diverse language-annotated data is required for policies to exhit strong generalization capabilities. However, generating language annotations for demonstration datasets is expensive and not scalable. To address this, we present a framework that can label long-horizon demonstrations with language in a zero-shot manner without human intervention by leveraging current state-of-the-art foundation models.
Symmetry-aware Learning for Contact-rich Maniuplaion under Partial Observability	Statement × This study addresses learning peg-insertion tasks using a soft robot that can operate more safely and tolerate lower-frequency control signals than a rigid one. We consider a partially observable formulation and deep reinforcement learning in which the peg-to-hole pose can not be measured directly. Soft robots and partial observation learning will impact future robot research in which robots can perform contact-rich manipulation tasks under unstructured environments. Our roadmap for future robot manipulation research is to gradually make robots in various contact-rich tasks more robust against more severe uncertainty possible in daily life.
Learning Visuotactile Skills with Two Multifingered Hands	Statement × The bimanual dexterous tasks we study, such as steak serving and wine pouring, represent common yet intricate challenges that existing robotic systems still struggle to accomplish effectively. By developing a novel system capable of handling these complex manipulation tasks through end-to-end learning approaches, our research aims to push the boundaries of current robot capabilities. Successful demonstrations in this domain would increase confidence in the feasibility of deploying advanced robots for similar real-world applications and inspire further exploration of learning techniques. One of the most significant roadblocks impeding rapid progress in robot manipulation research is the lack of robust data collection infrastructure and large-scale datasets. Our work directly confronts this obstacle by proposing an integrated hardware and software solution that enables efficient large-scale data collection for bimanual dexterous robotic hands. This data pipeline also enables us to build multimodal datasets that cover rich sensory inputs and diverse action spaces specific to bimanual dexterous manipulation tasks. Looking ahead, our roadmap for future robot manipulation research prioritizes expanding the scope and complexity of tasks under investigation. For example, we aim to gradually venture into longer-horizon tasks that involving tool use, dynamic tasks with moving objects, and tasks requiring high-level reasoning or multi-modal inputs. In parallel, we will continue refining our end-to-end learning frameworks to enhance their ability to generalize across tasks, environments, and hardware platforms. This entails exploring novel neural network architectures, multi-task learning strategies, and techniques for transferring knowledge between distinct but related manipulation skills.
Learning Goal-Conditioned Diffusion Policy for Contact-Rich Bimanual Manipulation through Planning-Guided Data Synthesis	Statement × The success of behavior cloning (BC) in dexterous manipulation has largely limited to tasks that utilize only the robot end effectors rather than its full arm or full body. This is not surprising as it is challenging to demonstrate such skills through human teleoperation. To scale robot learning, we highlight the need for 1) leveraging simulation data 2) extending manipulation learning beyond (jaw- or claw-like) end-effector-based skills. In this work, we present a contact-rich bimanual manipulation task that is hard to teleoperate, and show we can learn a visuomotor policy through planning-guided data synthesis in simulation and zero-shot transfer the learned policy to hardware.
From Simple to Complex Skills: The Case of In-Hand Object Reorientation	Statement × This work shows the benefits of reusing pretrained motor skills for more complex tasks. It is motivated by the recent success in sim-to-real for dexterous manipulation, which shows promising results in various applications. However, tuning the reward and system identifications are notoriously hard and time consuming. We provide another way: reusing previous skills and compose them for future tasks. I believe in future robot learning, more and more mature motor skills will be avaialble, either made by the research community and industry. When we face a new task, we should build a hierarchy of existing controllers instead training another one from scratch.
A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models	Statement × Our recent work "A Robotic Skill Learning System Built Upon Diffusion Policies and Foundation Models" (currently under review and available on arXiv) shows how to leverage Diffusion Policies to perform complex tasks such as contact-rich bottle opening and granular material scooping while using large pre-trained foundational models to select the right policy given the users request as well as checking if defined preconditions for the tasks are fulfilled before execution. With this, it shows a potential approach to the problem of the wide variety of tasks robots will encounter in unstructured environments and how giving the user the power to demonstrate specific tasks to build an ever-growing library of tasks that are specific to the user needs. This sidesteps the problem of having to develop universal skills that, while more robust than before, are still in high danger of breaking if they are deployed outside the learned distribution by having the user perform the demonstrations in their specific environment. We expect that these skill-based systems will be very relevant for the future of robotic manipulation as their capability will be expanded to more and more varieties and environments.
Online Estimation of Articulated Objects with Factor Graphs using Vision and Proprioceptive Sensing	Statement × The main impact we hope to have with this work is to bring attention to the community the benefits of incorporating proprioceptive sensing into manipulation tasks. In particular, we are interested in manipulating articulated objects. We show in our paper how we can achieve impressive estimation accuracy of unknown articulations from very small motions (after only 0.5 degrees of rotation opening a door our estimator has an average accuracy of 90%, after 1.0 degree of rotation this improves to 97%). This is possible because we fuse learned priors based on visual information with kinematic sensing from the robot joint encoders to estimate the articulation. In our work we tackle the problem of estimating an unknown articulated object which has no visual cues to indicate its articulation. Therefore, it is not possible to open using vision alone and any initial guess of the articulation parameters needs to be updated as new information is received. We believe there is currently is a gap in the literature where vision and proprioceptive sensing should be used together to estimate the properties of objects. This paper addresses the gap for manipulating articulated objects and presents experiments on objects that most current methods would struggle to open.
Generative Factor Chaining: Coordinated Manipulation with Diffusion-based Factor Graph	Statement × My study revolves around solving complicated long-horizon planning tasks using short-horizon knowledge. Such tasks necessitate understanding the long-horizon action dependencies required to achieve the objectives. My research simplifies this problem using the options framework, defining tasks with skills or parameterized high-level actions (such as "pick," "place," and "move"). Given these skills (i.e., proficiency in solving short-horizon tasks), can we devise an intuitive planning strategy to directly chain them during inference for any arbitrary task? This line of study is crucial for future applications, as real-world tasks typically extend beyond short horizons, and collecting data for all possible scenarios is impractical. For instance, if we gather sufficient data on tasks like "picking up a hammer" and "picking up a nail" (individually from various tasks) and "moving a grasped object," can we infer solutions for unseen problems like hammering a nail directly at inference? Another example involves determining how to grasp a cup when the ultimate goal (after several steps of long-horizon reasoning) is either (a) placing it in a microwave or (b) placing it in a box, with the former requiring grasping by the handle and the latter by the rim.
Avoid Everything: Model-Free Collision Avoidance with Expert-Guided Fine-Tuning	Statement × Real-world robotic systems today largely exist in industrial spaces. While these spaces may change, the roboticists and operators can often work in tandem to design the spaces, tasks, and algorithms to be successful. These settings can be considered "technocratic" because they can be designed, dictated, and run by a highly trained technical team. I believe that robotic systems will have the greatest societal value by moving into "democratic" spaces, i.e. those that are defined by general population. However, these environments can be messy or chaotic. For example, typical homes have a long-tail of feasible environment configurations, and robots designed to interact with these settings will require careful consideration of safety. Traditional techniques in robotics make strong assumptions on scene observability to guarantee safety, yet many home environments cannot be instrumented in such a way as to ensure accurate scene models. In the last few years, our community has produced a lot of stellar research focused on learning skills for manipulation. In particular, much of this work has focused on learning an end-to-end policy that can mimic an expert performing complex manipulation tasks. However, these works have focused primarily on performing tasks in uncluttered settings where environmental collisions are not a major concern. In these settings, methods such as inverse kinematic following or motion planning with low-fidelity collision model are appropriate. In this research, we have focused on building tools from another perspective--choosing to think explicitly about end-to-end safety as a building block of a robust end-to-end system. Moving forward, we hope to see systems build off of our safe techniques so that they can perform dexterous manipulation tasks in complex or partially observed environments. While this may mean directly stitching our method into a pipeline that reasons at a task level about end effector motion, it may also lead to fully end-to-end systems that reason about the desired task in tandem with the relative safety of the actions. In "democratic" spaces, it is often impossible to adequately perform a task while satisfying hard collision-avoidance constraints. We believe that future end-to-end solutions will prove to be the fruitful when there is a trade-off between physical safety and task completion. Our submission to this workshop demonstrates a method for trading off between task completion and safety, and we hope this idea will help our community explore how we create generalized manipulation behaviors in democratic spaces."