Learning What To Say And What To Do: A Model For Grounding Language And Actions
MetadataShow full item record
Automation is becoming increasingly important in nowadays society, with robots performing a lot of repetitive tasks in industry and even entering our households in the form of vacuum cleaners and lawn mowers. When considering regular tasks outside of the controlled environments of industry, robots tend to perform poorly. In particular, in situations where robots have to interact with humans, a problem arises: how can a robot understand what the human means? While a lot of work has been made in the past towards visual perception and classification of objects, but understanding what action a verb translates into has still been an unexplored area. In solving this challenge, we would enable robots to execute commands given in natural language, and also to verbalise what actions they are performing when prompted.
This work studies how a robot can learn the meaning behind the sentences humans use, how it translates into its perception and the real world, but also how to translate its actions into sentences humans understand. To achieve this we propose a novel Bidirectional machine learning model, along with a data collection module that can be used by non-technical users. The main idea behind this model is the ability to generalise to novel concepts, being able to compose new sentences and actions from what it learned previously. Humans show this ability to generalise from a young age, and it is a desirable feature for this model. By using humans natural teaching instincts to teach the robot together with this generalisation ability we hope to obtain a model that allows people everywhere to teach the robot to perform the actions we desire. We validate the model in a number of tasks, using an iCub and Pepper robots physically interacting with objects in order to complete a natural language command. We test different actions, including motor actions and emotional displays, while using both transitive and intransitive verbs in the natural language commands.
The main contribution of this thesis is the development of a Bidirectional Learning Algorithm, applied to a Multiple Timescale Recurrent Neural Network enabling these models to link action and language in a bidirectional way. A second contribution sees the extension of Multiple Timescale architectures to Long Short-Term Memory models, increasing the capabilities of these models. Finally the third contribution is in the form of data collection modules, with the development of an easy-to-use module based on physical interaction and speech to provide the iCub and Pepper robots with the data to be learned.