News and opinions about the world of visual computing
Categories: "Companies" or "AMD" or "ARM" or "Adasworks" or "Apple" or "CEVA" or "Epic Games" or "Google" or "Graphcore" or "Hypereal" or "Imagination Technologies" or "Intel" or "Inuitive" or "LG" or "LunarG" or "Movidius" or "Nvidia" or "Oculus" or "Qualcomm" or "Razer" or "Renesas" or "Samsung" or "Sensics" or "Socionext" or "Texas Instruments" or "Tobii" or "Unity" or "Valve" or "Verisilicon" or "Wave Computing" or "Zspace"
Machine learning is an umbrella term for a group of related technologies which are critical to the development of visually-aware systems. There has been a huge explosion of interest in these technologies driven by their astonishing successes and a consequent explosion in the range and diversity of tools, formats and platforms available for developers to work with.
Neural Networks in particular are being successful at pattern matching and classification tasks and their new popularity is shown by the number of incompatible development and deployment tools that are emerging. Now The Khronos™ Group (www.khronos.org) is moving to reduce confusion and enable interoperability by announcing two new standardisation initiatives in that area.
A news release from Khronos today states that a new working group has been set up to define an “… API independent standard file format for exchanging deep learning data between training systems and inference engines.” and “… the OpenVX™ working group has released an extension to enable Convolutional Neural Network topologies to be represented as OpenVX graphs and mixed with traditional vision functions.” (The full press release is available here).
The first move will establish a standard way to move defined networks from the training phase, which is typically done offline using a wide range of rapidly developing tools and techniques, and the inference phase where the network is actually used in an application. This phase, especially in embedded systems is usually mapped onto highly optimised and constrained platforms so by providing a standard exchange format, Khronos’ Neural Network Exchange Format (NNEF) will eliminate much of the need for development and deployment platforms to share implementation details.
By concentrating on a flexible format to exchange data, rather than on the structure of the actual networks being deployed, Khronos aims to avoid stifling innovation in a fast-moving technology while still achieving the objective of reducing deployment friction. According to Khronos President Neil Trevett: “So many companies are actively developing mobile and embedded processor architectures the market is in danger of fragmenting, creating barriers for developers seeking to configure and accelerate inferencing engines across multiple platforms.”
With the related move by the OpenVX working group to provide an API to a set of standardised neural network topologies as well as, crucially, to allow the import of non-standard nets developed elsewhere into the openVX graph structure, Khronos is again smoothing the path for deployment of machine learning without restricting innovation.
This move by Khronos is very timely, with many companies looking for a viable deployment path into end-user equipment while at the same time, researchers in both academia and industry are continuing to come up with novel ideas almost on a weekly basis.
Much of the impetus for the new working group came from Adasworks, one of the frontrunners in promoting machine learning software. Laszlo Kishonti, Adasworks’ CEO, commented that: “We see the growing need for platform-independent neural network-based software solutions in the autonomous driving space. We cooperate closely with chip companies to help them build low-power, high-performance neural network hardware and believe firmly that an industry standard, which works across multiple platforms, will be beneficial for the whole market.”
It is unusual to see a standards body get ahead of the market in this fashion and it is to the credit of Khronos that they are doing it is a way that is sensitive to the needs of the technology. The new working group is only just starting up and there is no word on when a ratified standard will emerge but based on the recent activity from other Khronos groups, which has been very rapid for a body like this one, perhaps we will see something within the next year.
CEVA announced its new XM-6 vision processor yesterday (27th Sept) and will reveal details at today’s Linley Microprocessor conference in Santa Clara.
As one of the companies leading the race to provide power-efficient vision processing specifically aimed at the mobile and embedded markets (as opposed to repurposing power-hungry GPUs for workstation and datacenter usage) CEVA’s designs and the traction they get in the market tell us a lot about the development and maturity of vision based equipment including AR headsets, drones and more.
CEVA claims a 3x performance improvement on vector heavy code for the new core, with a 2x boost for ‘average’ code, versus the previous generation chip, as well as revealing that they have for the first time included hardware dedicated to specific functions.
On the performance front, CEVA makes no mention of increased clock rate and the total compute resource (vector and scalar integer and floating point units) has increased only moderately over the XM-4, leading to the conclusion that much of the improvement is due to detailed architecture changes introduced with this generation.
On the basics, XM-6 is similar to XM-4: four 32-bit scalar units linked up VLIW style to 128 16-bit integer MACs are included in the base version of the core, with a 32x16-bit floating point unit available as an option. In detail, however, there are some differences: the 128 MACs have been upgraded so that they now provide full 16x16 functionality ( with 8x16 going up to 256 wide) and where they were previously divided into two 32/64-wide SIMD vectors, the new core splits them differently, with one 64/128-wide SIMD and two 32/64-wide units.
At the most basic level, this allows for a convenient substitution of the optional FP units without disturbing the rest of the architecture but it also allows CEVA to better tune the vector resources to the workloads and increase the amount of available parallelism; a good proportion of the performance increase is likely to be due to that change alone.
It is also interesting that the optional floating point has now been downgraded to half precision (midP in GPU terms). This seems to represent a growing confidence among VPU designers that 32 bits is overkill for the vectorised portions of their workloads therefore they are optimising to 16 bits. The fact that this functionality is still optional is another strong indicator that having floating point at all is a luxury which many applications cannot afford, something which continues to drive the widening split between dedicated VPUs and repurposed GPUs.
Just as interesting is the fact that CEVA credits much of the performance improvement, not to gains in computational capacity but to improvements in efficiency due to its proprietary data handling schemes. A sophisticated buffer handling unit which offloads management of large image datasets is teamed up with two specific mechanisms for improving the vectorisation of operands and hence the overall utilisation of the vector units. Improvements to both of these, in particular the scatter gather mechanism whereby the load/store unit can assemble a 1D data vector from a 2D array in a single cycle represent a large part of the reason for this new release.
It’s certainly true that a large part of wringing maximum performance out of a massively parallel system is solving the data/compute mismatch problem and this is CEVA’s answer to that. It will be interesting to see how it plays out against the various multithreaded and hybrid approaches out there.
*Note: this story has been edited to clarify the number and type of MAC ALUs in the XM6.
First things first: Pokemon Go is not an augmented reality app. Yes, it has the little monsters superimposed on your phone camera feed and you can interact with them but the physical world seen through the camera has no role to play in the game. You can’t interact with it and it has no effect on anything you do. The much-publicised unintended outcomes where players have injured themselves or others simply serve to illustrate this point: the physical world is an impediment to the game, not part of it.
The only purpose of the phone in the game is to tell you that you have arrived in the right location and are oriented correctly to see the monster. The game is using GPS, compass and screen, that’s all: the camera is superfluous and the screen could just as easily display anything or nothing except the monsters without affecting the gameplay at all.
So what we have is an amazingly successful Treasure Hunt/GeoCacheing game with almost none of the attributes of augmented reality. Even so, it still has important lessons for how and why successful AR applications will make the grade and for why VR will always be a niche application.
The key feature of Pokemon Go is that it is highly social, it’s hardly a secret that the most successful online apps by far are social apps, and the more immediate it can be, the more successful an app will be. This is immediacy in the sense of responsiveness; real time, lightweight and disposable. The more an app is like chatting to real people in real life, the more appeal it has; snail mail is trumped by fax which is trumped by email and so on through text, snapchat and the rest.
We see it also in games, where quake arena is better than quake on its own. People want to be with people and Pokemon Go gives anyone who already plays the game an excuse to go out and enjoy it in the company of others. Of course, it also helps that Pokemon is an established game in its own right and the new version was made available on a ubiquitous platform so that more or less anyone could join in with no waiting, no learning curve and no real effort. Download, login and go. The trick now will be to keep them playing but that is another story.
So, out of all of this, we can discern the outlines of what can lead to a successful AR application: it must enable some existing activity in an improved way, it must involve little to no effort to take part and it must bring people together to share the experience.
These last two are the requirements that will sink VR. While there is certainly a place for immersion, it isn’t enough on its own and its corrollary, isolation, will be enough to restrict VR to a niche. Apple’s Tim Cook alluded to this recently when explaining why Apple has not yet taken any public steps into VR when he said: “Virtual reality sort of encloses ... the person into an experience that ... has a lower commercial interest over time ” and he’s right; that’s not compelling for most people.
If the requirements set out above seem daunting - how will it be possible to launch an AR app on a ubiquitous platform when it doesn’t exist yet? – Pokemon Go has another lesson for us: baby steps are acceptable. Simply adding SLAM to a phone and using that as the basis for real world interaction will take us to the next level and thanks to Tango, we will be getting that within the year.
Advances in display technology like Epson’s Moverio, perhaps, will make the headset experience better but until then, we have evidence that head mounted displays are not necessary. AR doesn’t have to be immersive which means it doesn’t have to be hands-free, leaving developers the freedom to deliver a much wider range of AR experiences without needing to address the headset problems which plague VR.
So, while Pokemon Go has caused much eye-rolling in some quarters for stealing the AR limelight while not actually being AR, it has turned out to be a very positive thing. Simply by being different, it has broken down preconceived notions of what AR is, pointed out once again what is really important and shown us a way forward based on users, rather than technology.
Vision systems are making their way into all sorts of mobile and portable equipment these days. It’s a natural consequence of the reduction in cost and power draw as silicon technology continues its way down the geometry curve and algorithms begin to appear that take advantage of this newly available compute resource.
The autonomous car is the poster child for this process but it should be no surprise, given its similar need for autonomy as well as the strong focus on visuals, that the same thing is happening in the drone industry. The surprise is that it has taken this long to become significant and the fact that it is happening now has to be down to some major changes taking place in the industry.
After being the preserve of hobbyists and commerce – mainly survey and surveillance – for so many years, drones have finally broken out as a major consumer item over the last few years and as usual, the introduction of consumers to the mix has driven both sales volume and innovation. As we all know, consumers have unreasonable expectations of everything and there is nothing the semiconductor industry does better than making the unreasonable possible.
There have been radical improvements in algorithm design over the last few years. The standout examples of this are SLAM (simultaneous location and mapping) which enables short range 3D modelling of environments in real time and HOG (histogram of oriented gradients) for detection and tracking of non-rigid objects (read: people). The addition of these capabilities to the repertoire represents a major functional upgrade which is immediately useful to the user and drones from all the major OEMs are beginning to show up which base enhanced ‘follow me’ and other navigation features on them.
The prospect of new FAA rules permitting beyond line-of-sight operation and single operator control of multiple drones is another driving factor. Scheduled to arrive later this year, these are sure to carry increased safety requirements with them and are critical to the feasibility of services like Amazon’s drone delivery program. In fact, vision seems to be a major part of the Amazon system, as evidenced by the promo videos showing the drone using a target image to locate the landing zone.
Products on show at last week’s Interdrone 2016 conference highlighted this as a significant trend, and the presence of vision experts from three major semiconductor companies confirmed it. Representatives of Nvidia, Intel and Qualcomm were at pains to showcase their expertise in vision and to demonstrate their commitment to the drone industry, and all three companies are backing that up with investment.
Intel was showing the results of their collaboration with Yuneec in the form of their Typhoon H quadcopter which integrates an Intel Realsense R200 camera system to implement a 3D mapping system based on their ‘active stereo’ technology. They have also announced their intention to acquire Movidius, whose Myriad 2 VPU is integrated into DJI’s Phantom 4 drone.
While DJI was not exhibiting at the conference, they made their presence felt by jointly announcing with Epson that they will offer the Moverio AR glasses as an accessory from their store – another way that FAA rules are shaping this industry, with the current mandate that pilots keep visual contact with the drone.
For its part, Nvidia was represented in several products with the Tegra K1 being used, in the case of Parrott, to implement a GPU accelerated SLAM and in other cases as an object recognition engine. Qualcomm, of course is active in vision for the mobile handset market and has been using visual navigation of a drone as a technology demonstrator for several years.
As well as these high profile examples, there is a slew of less-known companies entering the fray, most of them aiming for sensor fusion with vision to achieve a level of responsiveness and robustness that has so far been lacking.
With all this investment and new, clearer, operator rules promised by the FAA, these are exciting times for the drone industry.
Augmented reality is usually seen as a successor technology to virtual reality, a different application with a different (and harder) set of problems to solve which will come into its own in a few years.
VR is here today, the problems are being rapidly knocked down, billions of dollars are being invested and there is a huge user base just waiting for the hardware to cross that availability threshold and Bam! the whole thing is going to explode.
That’s the buzz, anyway but there are a number of measured arguments being made which cast serious doubt on the viability of VR as a mass consumer technology for today, and there are good reasons to expect that the problems with VR can only be solved by merging it with AR.
Most of the arguments predicting limited success for the current crop of VR implementations are non-technical, which makes sense now that many of the technical barriers which doomed previous generations of VR are being rapidly demolished.
We have now had plenty of opportunity to try out a whole range of headsets and it is clear that the classic problems: field of view, lens correction, responsiveness, are either solved or well on the way.
Issues like user isolation, the difficulty of mapping established interactive paradigms onto VR and the problem of providing an equivalent visual experience have not gone away, or even really been addressed in a meaningful way and are the focus of these arguments.
And then there’s the headset.
Those last two issues: equivalent visual experience to the best of what is already available and the need for a headset, were the twin nemeses of 3D TV. VR boosters bristle at the mention of that epic failure, arguing that these are two different things but that is not the point. People are the same, which is what matters, and visual quality is important to them so we do need to pay attention to the lessons learned from that failure.
3D TV has the problem that, because of the need to deliver a different image to each eye, the visual quality is degraded by comparison with a standard video. This in itself has been enough to dissuade many viewers but gamers are more resilient in that regard than general consumers so why is it still a problem for VR? The answer lies in the extreme nature of the requirements and the gap between where graphics hardware is today and where it needs to be to deliver a true, no compromise experience.
The driving constraint of VR is latency, what is referred to as motion–to-photon delay, so that in addition to delivering a left/right pair, the system must also meet a very tight deadline for getting the pixels on the screen. From sensing a motion of the head, modern systems aim to get the resulting image on screen within a frame time and they aim to do it consistently, without the variations which are acceptable in a standard gaming system.
Standard game systems observe the twitch limit, the tenth of a second or so which represents the limit of human reaction time. This means that the various stages of the game can be pipelined: each stage must stay within the target frame rate but it’s OK to have the final displayed image several frames behind, as long as the overall delay stays within the twitch limit.
VR systems don’t have that luxury. Everything, including responding to the sensor plus animating the scene and then rendering it must happen an order of magnitude faster, with a corresponding reduction in the sophistication of the models and quality of the rendering. This is not a trivial point: it means that GPU performance must increase by 10 before the visual quality of today’s PC is available on a VR system and according to one study the last order-of-magnitude increase took eight years to happen. There’s a reason the best looking VR games are mostly static, and this is it.
VR has to offer a compelling reason, even to gamers, if they are to adopt it. Immersion is not enough: game play and realism are at least as important. If these are substandard then you are just immersed in a crappy game. Fortunately, VR has much more to offer than visual immersion so we are seeing games like “The Climb” from Crytek which tries to leverage that but in fact has been more useful in exposing more of the shortcomings of current equipment. Which brings us back to the headset.
If polarized glasses were an issue for 3D TV, the isolation and discomfort of a headset is the same issue in spades for VR.
I believe the hardware manufacturers when they say that the enormous sums being spent on design of lighter and more ergonomic headsets will result in devices no more inconvenient than a pair of sunglasses. But that only gets us to the same status as 3D glasses, which is still a problem, and it overlooks the real issue which is isolation. Isolation from the real environment and isolation from real people.
Once in a VR world, the user is blind to the real world, and now that we are starting to see apps and demos which encourage moving around, the disadvantages of that are starting to become obvious.
A world which gives no indication of its extent is not a comfortable place to be. Add a second VR user to the same space and the problem gets even worse; as long as the headsets are unaware of each other’s positions, there is the potential for collisions. This is a very poor user experience, no matter how compelling the immersion feels otherwise.
Without a solution to these problems, VR remains a solo activity conducted while seated or constrained to very limited movement (Google’s Tilt Brush app works nicely within those constraints but is hardly the basis for mass adoption of the technology). As long as that is the case, its consumer appeal will never cross over from enthusiast to mainstream.
Now, I am not going to argue that VR will fail completely but I am convinced that mass deployment is much further away than a lot of the hype would suggest. This is because the required technologies are coming from AR and are in a much earlier phase of development, which puts back the adoption curve by several years.
To be fair to the VR hardware companies, they are aware of the need for positioning and they have come out with systems like HTC’s Lighthouse to address it. The problem is that these are partial solutions and positioning is not the only problem.
Vision based systems can provide robust solutions to a lot of the problems outlined above, and we are starting to see VR headsets which include forward facing cameras (as well as internal, eye tracking cameras – an approach to solving the GPU performance problem).
With a camera the headset can autonomously map its environment, including stationary and moving objects such as other users, and can also eliminate the need for handheld controllers via gesture control.
More importantly, vision can convincingly locate the user within the virtual world in a way that is not possible today. The spooky disembodied hands in “The Climb” can be correctly connected to the user’s body, giving a feeling of immersion that goes well beyond the visual.
The point at which that becomes possible is the point a compelling user experience has been created. What makes it possible is vision; until now seen as the domain of augmented reality.
The downsides of including vision are that it adds expense, for the sensors plus a new vision-specific processing unit, as well as delay. Google’s Tango project plus recent events in the auto industry have shown us how much more there is to do on the software side but my opinion is that the benefits outweigh the costs, especially since I am convinced that current generation ‘blind’ headsets will soon be regarded as incomplete products.
So my conclusion is that we are not there yet for VR. It will take another hardware cycle to get the equipment right, and the succesful products will leverage a whole new set of skills coming from a different set of companies.