|
For Trevor Taylor Candidate for Doctor of Philosophy Faculty of Information Technology Queensland University of Technology July 2002 Copyright © 2002 by Trevor Taylor Table of Contents
1 TITLEProposed title of the thesis is: 2 OBJECTIVESIn its simplest terms, the field of study is Computer Vision for Autonomous Robots. The ultimate goal of this research is to develop a vision module for controlling autonomous robots that allows them to collaborate with each other to build maps based on visual landmarks. (This is an extension of the Stage 1 proposal in that it adds collaboration, although it reduces the emphasis on “cheap vision” and hardware implementations.) The objectives of the research are layered:
Note that this approach inherently uses active vision because the camera will be mounted on the robot, which is free to move around. One possible application for these robots would be locating objects in hazardous environments. 2.1 Significance and InnovationOne significant area of this work is navigation by vision alone. Most researchers have used robots with an array of different sensors, which often did not include a camera, e.g. light detectors, ultrasonic sonar, IR (infrared), contact or “bump” sensors, etc. Brooks (Brooks, 1999) provides a good review of robots up to the early 1990s. One of Brooks’ students developed a robot called Polly (Horswill, 1993) that navigated using vision. Polly relied on local knowledge and had a built-in map. In contrast, this proposal is for robots that will start out with no knowledge of their environment and gradually accumulate information, getting smarter on subsequent searches and through the exchange of maps. Note that in this approach the initial conditions are important because the robot must establish its position in relation to previously generated maps. Vision-based navigation with no prior knowledge is still an open problem and is considered by some people to be the “holy grail” of computer vision. Although it is not expected that this problem will be solved by this research, the work will contribute to progress towards this goal. There is a reasonable body of literature on exploration and the use of landmarks, however it remains a problem that has not been thoroughly solved. See DeSouza and Kak (DeSouza & Kak, 2002) for a recent survey. In most reported cases, exploration and mapping has been performed by a single robot. Collaborative mapping is an area that is still in its infancy and involves aspects of AI (Artificial Intelligence) as well as computer vision. Work done to date has used a diverse range of methods and has tended to be theoretical. In some cases it has consisted solely of simulations. This research will investigate emergent collaborative behaviour and provide a practical example. Wireless communication between robots using Bluetooth will be investigated. This is a new standard for short-range communication that has not yet been widely adopted, so work in this area could be ground-breaking. 3 OUTLINE3.1 Robot VisionRobots have become popular over recent years to the extent that mass-market robots are now available. For instance, the Sony AIBO robot dog (http://www.aibo.com) is now in its second generation. At over A$3,000 this “entertainment robot” is hardly a toy. However, this is an industry with strong growth potential and therefore large research and marketing budgets. Lower down on the price scale, and appealing to a different audience, is the LEGO Mindstorms (http://mindstorms.lego.com). This system, also in its second generation, is so popular with hobbyists that several books have been published on how to build and program robots using it, e.g. (Baum, Gasperi, Hempel, & Villa, 2000). A key element that is missing from these robots, however, is vision. The AIBO has an on-board camera, but it is used only to take snapshots and to locate a bright pink ball that is supplied with the robot. Sony has recently released information on the programming interfaces for AIBO which should inspire further development. LEGO does not sell a camera for Mindstorms, and there is no facility to interface one, making these robots effectively blind. The first objective of this research is to develop a system that will allow a robot to navigate using vision alone. This has potential for commercial applications, depending on how successful it is. It should also be an original contribution to the body of knowledge on computer vision. In order to implement this system, Optic Flow will be a key algorithm. An extensive amount of work has been done on Optic Flow (Pham & Wardhani, 2002) and it should be possible to build on this. In the 1980s, Rodney Brooks at MIT broke with the traditional Sense-Plan-Act (SPA) approach to robotics and introduced his Subsumption Architecture (Brooks, 1986). This was fairly heretical at the time because the existing AI community believed that the only way for a robot to successfully navigate and accomplish tasks was if it had a complete representation of its real world environment. A great deal of effort had been expended on methods of representing the world symbolically. Brooks however argued that “the world is its own best model”. The Subsumption Architecture relies on layers of simple behaviours that can inhibit or augment lower layers. Robots using this type of architecture exhibit “emergent intelligence”, i.e. they appeared to behave intelligently without having a full understanding of the world around them. To some extent the Subsumption Architecture might be paraphrased as “walk before you run”. Basically it involves developing a series of simple behaviours such as obstacle avoidance, wall/corridor following, chasing a moving object, etc. These behaviours are layered, with the higher layers providing input to or inhibiting the lower layers. Note that none of these behaviours require the robot to retain information about its past motions, i.e. it does not need to build a map. However, it does operate autonomously in the sense that it is free to roam wherever it chooses. Because the robot has no state information, it cannot make intelligent decisions in the face of bad sensor data or extrapolate information already gained, e.g. a wall is likely to continue, even if the sensors say that it has suddenly disappeared. (A similar problem plagued the QUT Robot Soccer team where the ball would mysteriously “disappear” as far as the vision system was concerned, and then reappear later, sometimes in a different place. This confused the control algorithms, resulting in the robots doing peculiar “dances” on the field.) Current thinking is based on a Three Layer Architecture (Gat, 1998). This behaviour-based approach to robotics (Arkin, 1998) is a further outgrowth of Subsumption in the sense that it adds extra layers over the top of the reactive Subsumption layer. The additional two layers contain memories and state information about the past, and state information that is used to predict the future. The three levels are known by various names, but common terminology is the Controller, Sequencer and Deliberator. Regardless of the architecture, the final system must be immune to noise and very robust. 3.2
Any approach to map building will require a “fuzzy” map because the robot will not have accurate odometry information. There are many reasons why odometry measurements cannot be trusted, and accumulated errors will result in the robot becoming lost (Dellaert, Burgard, Fox, & Thrun, 1999). In any case, the robot will not have appropriate sensors for odometry. Constructing a map in memory will involve developing an architecture for map storage and traversal. These maps will of necessity be “fuzzy” in nature. Some work has focussed on using Visual Beacons (Clarke, 1998). These are artificial landmarks created to assist with navigation. It is not intended to rely on such beacons in this work, although their use is not precluded. Landmark recognition is a multi-stage process. Firstly, features need to be extracted from an image. This is necessary to reduce the dimensionality of the data to something manageable. (Storage of large numbers of images in the robot’s on-board memory is not feasible, whereas sets of features will take up considerably less memory.) Next it will be necessary to decide on which features are significant enough to represent a landmark. This is an unknown at present, and the algorithm to select appropriate features will be an outcome of the research. For example, it is possible to extract the vertical edges in an image and treat them as a form of barcode. In this way each different scene should generate a unique code. Ideally the landmark features should be represented in such a way that they are invariant under translation (in the horizontal plane at least). The barcode approach would rely on the relative spacing and length of the bars, and therefore be independent of the distance from which the landmark was viewed. Rotational invariance should not be relevant because the camera will not be rotated about the horizontal axis. There are several challenges in determining landmarks. Preliminary experiments in moving the camera around a typical office have shown that simple objects such as chairs can present quite different “signatures” depending on the viewing angle. This problem arises because the camera’s field of view is small and nearby objects therefore take up a substantial portion of the image frame. (See the images below under Preliminary Experiments.) It is also important to be able to handle overlapping images as the robot moves or turns and stitch them together to recognise landmarks as they transit across image boundaries. Finally, the landmark must be stored so that it can be recognised again later using a pattern matching method (which will also need to be developed). The landmark recognition algorithm must be robust and reproducible with an accuracy of better than 50%, i.e. it must correctly identify landmarks most of the time. The robot should be able to take advantage of its internally constructed map and the relative positions of landmarks to help resolve ambiguities when attempting to recognise a landmark. Humans prefer to build objects with right-angles. This means that in a man-made environment such as an office there is a large number of horizontal and vertical edges. An interesting corollary is that many moving objects have curves, e.g. cars are streamlined and have round wheels, or are irregular in shape, especially during motion, e.g. animals or humans. The robot is likely to encounter humans while exploring and it must understand that they are not suitable as landmarks, or be programmed to ignore them (apart from avoiding collisions). It is easy to over-generalise and there are exceptions to these “rules”. For instance, classical architecture makes extensive use of arches, circular or semi-circular windows, spires, flying buttresses, etc. However, the environment for this research will be a conventional office that consists of walls, corridors, doors, filing cabinets, desks, chairs, etc. The robots will be restricted to a single floor of an office building because they will not be capable of negotiating stairs or using an elevator. In fact, they will have limited opportunities to enter rooms and might spend much of their time in corridors. 3.3 Collaborative ExplorationThis work will focus on the one-to-one communication of mapping information between robots. The information will not be centrally collated. Swarms, or herds, of robots will not be considered either. Obviously, the robots will require some form of wireless communication. The communications protocol must be efficient in its use of bandwidth. Bluetooth has a nominal bandwidth of 1 Megabit, but the effective throughput is closer to 500-700 Kilobits. It also has limited range. Wireless LANs based on IEEE 802.11b have a bandwidth of 11 Megabits and much greater range, but at the expense of higher power consumption. Note that both of these operate in the 2.4 GHz range which is subject to interference from cordless phones, microwave ovens, etc. This could have the effect of reducing the throughput, although the communications should still be reliable. It is not reasonable to rely on sending entire images between robots because of the bandwidth required. Consider an image of 640 x 480 pixels in 24-bit colour (RGB). This is a total of 921,600 bytes of information, or roughly 1 Megabyte. Transferring 20-30 frames per second (a standard rate for full-motion video) is clearly not feasible even at a transfer rate of 11 Megabits per second. For video conferencing at a resolution of 352 x 288 and 8-bit colour, the frame size is 101,376 bytes or nearly 1 Megabit per image (including protocol overhead). This is the primary reason for developing a Landmark architecture. Landmarks should dramatically reduce the amount of information required to describe locations compared to raw images. In addition to communications, the robots will require a reasonable level of on-board intelligence. Apart from simply transferring all of the Landmark data from one robot to another, it is also necessary to merge the maps together in a coherent fashion. If the robots were to exchange information only when they see each other, then they would have a common point of reference. However, wireless communication means that they could talk to each other from different rooms, and it might be difficult to establish a common point in their maps. Therefore it will be important to establish initial conditions for the robots – location, orientation – so that they are “aware” of It is also desirable for the robots to execute a search pattern that minimises overlap. This is a non-trivial task in its own right, and might not be addressed in this study, i.e. the robots could simple roam at random trying to extend their knowledge, e.g. they could be given an imperative to move into unexplored territory. The fact that they are sharing their knowledge might lead to emergent collaborative behaviours. 3.4 LimitationsIt is important to note that there are some implicit limitations on the system to be constructed. In particular, it will be designed for indoor use – a typical office environment. This has implications for the lighting, nature of the obstacles, flooring, etc. In their survey of vision systems, DeSouza and Kak (DeSouza & Kak, 2002) make a clear distinction between robots designed for indoor and outdoor use. They also differentiate between structured and unstructured environments. Because operation in an unstructured environment is more useful, the decision has been made not to structure the environment in any way. Motion is assumed to be over a smooth horizontal surface. Most robots cannot handle discontinuities in the surface, e.g. steps or even rugs placed on the floor. It is further assumed that the robots will only be exploring a single floor of a typical office building. This has the effect of limiting the area to be explored to several hundred square meters. No mention has been made of Pan and Tilt capabilities for the camera. This adds to the complexity, e.g. moving the robot while the camera is panned to the side presents a much different view to when the camera points straight ahead. Also, it has been decided that there will be a single camera of the type that is readily available in cheap video cameras. These are very light (usually a single chip) and can typically operate off a battery. Black and White video cameras can be purchased for under $100, with colour a little more expensive. Having a single camera introduces the limitation that it is very difficult to extract depth information from the images. In fact, Horswill (Horswill, 1993) states that “computing depth is a notoriously difficult problem”. One way around this problem is to use stereo vision. Work has been done on so-called “stereo heads” for binocular vision, but this presents its own set of challenges and will not be addressed here. Given the limitation of a single (cheap) camera, approaches will be developed to account for distance using the horizon. For instance, the image could be divided horizontally into three areas: foreground, middle distance or horizon, and the far distance. Objects in the foreground (detected by means of edges or other methods) represent collision hazards. Objects in the far distance might be the most appropriate choices for landmarks. One of the constraints on the development of computer vision systems in the past has been limited computational resources. This is arguably no longer the case with today’s technology, but will certainly not be true within the few years required to complete this research. The power (CPU speed, memory size) of cheap, embedded systems will soon reach the point of processing several frames per second using computationally intensive algorithms. In particular, a Personal Digital Assistant (PDA) typically has a 204 MHz RISC processor with 64 MB (or more) of RAM memory. One of these could possibly be used to provide the on-board intelligence, with the advantage of providing a colour display screen as well. 4 RESEARCH METHODS AND PLANThe research will be broken into three phases, with a goal to write a paper for publication at the end of each phase. These phases are detailed below. As part of the research, algorithms will be devised for computer vision, navigation, etc. These will be verified experimentally. The algorithms will be refined through a cycle of devise, implement, test and revise. The work will include the development of the necessary hardware and software. There will be no collection of data and subsequent analysis as in many of the other sciences. Most of the results will be qualitative rather than quantitative. 4.1.1 Phase 1 – Robot Vision and Simple Navigation BehavioursDuring this phase, which will last about 12 months, a vision system will be developed for basic robot navigation. Note that this phase might overlap with Phase 2. This phase will involve a remote-controlled robot and a wireless camera, with the actual processing and control taking place on a PC. Processing will be performed using custom-written software supplemented by either Matlab or any of the many image processing packages available on the Internet or in textbooks. A Korean Soccer robot will be suitable for this phase. One approach might be to have a human operator guide the robot and train it until it learns how to navigate. In artificial intelligence terms this is supervised learning. Basic behaviours to be developed include:
In this phase, the robot will simply wander around. It does not need to retain a memory of where it has been or identify objects. 4.1.2
Phase 2 –
| ||||||||
|
|
|
|
Figure 2 – Office Environment - (a) Door and Bookcase, (b) Chair and Filing Cabinet |
|
As examples of the types of image processing that might be used, the first picture (Figure 2a) has been processed using some of the well-known edge detection methods called Sobel, Canny and Laplacian of Gaussian (LOG). These are shown in Figure 3a, 3b and 3c. No fine tuning has been performed on the images. They are simply provided as examples.
|
|
|
|
|
|
|
Figure 3 - Edge Detection - (a) Canny, (b) Sobel, (c) Laplacian of Gaussian |
Notice the effect of the carpet texture, which is particularly pronounced in the Canny and LOG images. An algorithm will be required to remove this clutter. Also (without refinement) the Canny method detects much more information than the Sobel as far as the door and bookcase are concerned. Further research will be required to determine the best methods to use and the parameters of each of these methods.
During the early stages of development, images such as those shown above will be viewed on the PC controlling the robot. The software will be able to superimpose its own interpretation of the scene to demonstrate graphically the reasons for navigation decisions.
In the final version of the system, however, it might not be possible to view what the robots “see”. It will be possible to interrogate the robots and obtain landmark information using a PC with a wireless LAN connection because it will be able to pretend that it is another robot and exchange map information. There might be a possibility of obtaining some of the processed images from the robots over the wireless LAN because they will be significantly smaller than full colour images. However, for reasons outlined above, it is unlikely that full video will be transmitted by the robots.
If a PDA is used to provide the on-board “intelligence”, it will have a colour screen and will therefore be able to display images. In this case, it would be necessary to follow the robot around to watch what is appearing on the screen. Although this is not an ideal situation, it does resolve the problem of trying to understand why the robots make certain decisions.
A decision has not been made on how to provide the on-board intelligence. There are several approaches, and these will be examined as part of the research:
Although FPGAs have a significant speed advantage for algorithms that are amenable to parallelization, they are of little benefit for algorithms that require recursion or extensive sequential processing. Because part of the research involves evaluating image processing algorithms for speed and utility, it is unclear at this stage whether or not the algorithms that are finally selected will benefit from parallel operations. Programming of FPGAs is at a very low level (at the level of individual logic gates, which is even lower than assembly language) requiring expertise that the candidate does not currently have. The propensity for Intellectual Property developers to want to sell their code has had the effect of holding back widespread adoption of FPGAs because subroutine libraries are not readily available.
An initial literature search was performed to determine the availability of programs for FPGAs and the feasibility of using them for image processing. Draper et al. (Draper et al., 2000) implemented various image processing algorithms, and they indicate that a simple 3x3 Convolution takes approximately 1,333 CLBs (Customisable Logic Blocks). With current technology, this is a mid-sized FPGA.
Assuming that several algorithms might be required in the final implementation, it might be necessary to use several FPGAs. Reprogramming them “on the fly” is possible, but this would significantly slow down the processing. However, this is the approach taken when using PCI-based FPGA boards in PCs as an adjunct to programs such as Matlab. The time required to constantly reprogram an FGPA might prevent the robot from operating in real time. Although it is not important that the robot move quickly, it should move at a reasonable speed, e.g. 0.1 to 1 metres per second.
In any case, an FPGA by itself would not be used to provide higher-level functions such as map storage, decision making and overall control. It is therefore expected that another processor will be required, and an FGPA would simply act as a co-processor.
Conventional processors have the advantage that they can run operating systems such as uClinux (http://www.uclinux.org). This provides a familiar programming environment, with a wealth of available software for development. Programming is performed at a much higher level of abstraction than for an FPGA. Many of the popular robots, e.g. Koala, have an on-board CISC processor.
A RISC processor could be added to a robot in two ways:
1. Take a lateral-thinking approach and interface a PDA to the robot.
2. Build a custom processor board containing the CPU, memory, etc.
Compaq and Toshiba have both released PDAs with built-in Bluetooth interfaces for wireless networking. An interface could be built to connect the robot / camera to one of these PDAs. It would then be possible to take advantage of the wireless communications as well as the colour display on the PDA itself. Development would be performed in the Windows CE (Pocket PC) environment, i.e. a high-level environment with extensive support, on the candidate’s office PC.
The candidate is currently supervising a project involving two undergraduate students who are developing an application on a Bluetooth device called a BlueMod. These devices run uClinux on a Motorola ColdFire processor. A BlueMod could provide a suitable interface to a PC for communication with a PDA. A BlueMod could also be used as an on-board processor for a robot because it contains a RISC CPU, although this processor only runs at 48 MHz and there is limited memory (8MB). Some hardware development would be required to interface the BlueMod to a robot. Software development could be done on a Linux platform, which might possibly be installed in a dual-boot configuration on the candidate’s office PC.
A preliminary literature review has been undertaken in several areas.
Computer Vision, and especially Image Processing, are not new areas of research. The Computer Science fraternity has been examining these areas for over 30 years.
For instance, one classic text by Gonzalez and Wintz (Gonzalez & Wintz, 1977 - Revised 1987), “Digital Image Processing”, was first published in 1977 and has since been revised and re-published twice, the last time with a change of authors (Gonzalez & Woods, 2002). Another well-known text on the subject even has the same title (Castleman, 1996), while texts commonly used for teaching at QUT also have similar titles (Jain, 1988; Russ, 1999).
Several texts include image processing software, the most up to date being the CVIP tools (Umbaugh, 1998). This software can be downloaded from the Internet. (There are others in the QUT Library, but they are so far out of date that they are not worth investigating.) Other free software includes:
This list of software is far from exhaustive, but it does give some indication that image processing and computer vision software is readily available on the Internet. However, this software will invariably have to be modified to suit an autonomous robot because robots do not have all the peripherals of a computer workstation such as display screens, mice, etc.
Computer Vision has been addressed by numerous textbooks (Ballard & Brown, 1982; Haralick & Shapiro, 1992; Marr, 1982; Vernon, 1991). These texts cover all of the basic aspects of vision, but tend to assume static images captured by perfect sensors.
Treatment of “noise removal” generally deals with artificially added noise, which is not what happens in reality. For instance, during preliminary experimentation it was found that entire video frames could be corrupted to the point of being useless. Also, the nature of noise is often not the “snow” portrayed in the textbooks, but a pattern of lines induced by a conflicting signal or interference. Brooks (Brooks, 1991) suggests “using redundancy over several images … in contrast to the approach in traditional computer vision research of trying to extract the maximal amount of information from a single image”.
A large amount of work on Computer Vision has been devoted to stereo heads, such as the Small Vision Module developed at the Stanford Research Institute (Konolige, 1997). However, this work will use a single camera. This might make it difficult to determine distances and depth, but it should still be possible to navigate. Active vision can assist in determining distances and the fact that the robot is moving can be used to advantage.
Field-Programmable Gate Arrays (FPGAs) are programmed (as their name suggests) at the level of the individual logic gate. This is an even lower level than assembly language (also called machine language).
In his review of the state of the art in image processing, Crookes (Crookes, 1999) points out that “Possibly the biggest problem for FPGA technology in accelerating image processing applications … is the extremely low level programming model”. He goes on to say “It is likely – indeed, essential – that this problem will become a major area of research in the future.” However, he still has a positive outlook on the future of FPGAs.
Much of the work with FPGAs has been done using plug-in boards for PCs, e.g. (Soldek & Mantiuk, 1999). While this provides a high-level development environment, it sidesteps the technical issues of building an on-board system for a robot.
2-D convolution is a fundamental operation in image processing, and it has been shown that this can be performed on an FPGA (Boussakta, 1999). Amtel have an Application Note for a 3x3 Convolver for their AT6000 FPGAs (Amtel).
Various different image processing operations are considered in (Demigny, Kessal, Bourguiba, & Boudouani, 2000). Edge and corner detection is addressed in (Torres-Huitzil & Arias-Estrada, 2000). Detecting edges is fundamental to most pattern recognition, and will be important for defining landmarks.
It is unclear at this stage whether or not FPGAs will be incorporated into this research.
Visual Beacons are artificial Landmarks that are created to assist robots to navigate. There have been several attempts at creating unique beacons (Clarke, 1998; Clarke & Nunes, 1998) that are easy to recognise and remain invariant under translation and rotation.
Using natural landmarks is highly desirable because it does not require the placement of beacons, i.e. the robot can operate in a new environment with no requirement to first prepare the environment. Work has been done on algorithms for selecting and identifying natural landmarks, e.g. (Sim & Dudek, 1999; Thacker & Harris, 1998).
Building a map is a complex task, especially when there is no accurate spatial information. Several approaches have been taken, some of them probabilistic in nature, e.g. Maximum Likelihood (Simmons et al., 2000) and Bayesian Filtering (Dellaert et al., 1999). Hidden Markov Models have also been used (Filliat & Meyer, 2000, 2001) in an attempt to overcome the inherent uncertainty in the robot’s location (also referred to as the pose in the literature).
As part of this research, new algorithms for selecting Landmarks are expected to be developed.
Collaborative exploration has received quite a bit of attention from a certain group of researchers headed by Dudek (Dudek, Jenkin, Milios, & Wilkes, 1993, 1996; Rekleitis, Dudek, & Milios, 2000; Rekleitis, Sim, Dudek, & Milios, 2001a, 2001b). One approach that they have explored is using one robot as a marker (it remains stationary) while the other one explores. The marker then moves some distance, and the two robots “see-saw” through the terrain in this way. A lot of the work was devoted to determining the optimal tiling pattern and proving that the robots would explore the entire environment.
Some of the work reported in the literature has been done in highly structured environments (Billard, Ijspeert, & Martinoli, 1999a, 1999b), or are simulations (Hayes, Martinoli, & Goodman, 2000). Although this does not invalidate the results, it does limit their applicability. As has already been pointed out, the objective of this research is to explore unstructured environments (although constrained in the sense that they will be office environments).
It appears, from the limited review to date, that further work can be done on exchanging maps between robots. This could be an area of significant new research.
Robotics per se is not a key area for this research. However, some appreciation of robotics is required because a robot will be the platform for the vision system. In addition, information on vision systems often appears in textbooks devoted to robotics.
One of the earliest autonomous robots was built by W. Grey
Walter (Walter, 1953 -
Reprinted 1963)
and is referred to as a Tortoise. (Some citations call it a Turtle, but Walter
named it a Tortoise after the one in “
The main point to note is that this early robot did not use vision because the technology was not sufficiently developed at the time. (The robot actually used vacuum tubes and carried a lead-acid battery!) However, it did use a light sensor, and it exhibited rudimentary intelligence.
No review of Robotics would be complete without mentioning the text by Braitenberg (Braitenberg, 1987) entitled “Vehicles”. This book discusses several simple robots of increasing complexity that demonstrate emergent intelligent behaviour. (Some of Braitenberg’s vehicles can also be built using LEGO Mindstorms, and there are web sites devoted to this.) His coverage of biological sensors is also fascinating. However, the main point to note is that very simple algorithms can result in behaviour that is apparently quite intelligent.
Brooks