We've reached a significant milestone here at Visualise The World with the launch of the inaugural issue of The VPU Report!
What is it? It's a direct-to-the-point, detailed analysis of vision processors, as SoC or as IP, from four of the leading companies in the field. Future quarterly issues of the report will present up to the minute information on VPUs available from a different selection of companies building up into a detailed resource for marketing and management professionals. The current issue starts the ball rolling with analysis of Movidius, Intel, Ceva and Inuitive: four of the most interesting companies operating in the consumer edge device category.
At just under fifty pages, packed with over twenty detailed diagrams, plus feature and performance tables along with concise description, the report strips down to basics and presents just the facts so that you can quickly absorb technical details in a convenient, easy to read style.
Along with that, you will find a quick summary of the companies reviewed (useful when so many entrants into this field are startups) and commentary to shed light on the potential and pitfalls of VPU design.
If you are interested or involved in the coming revolution in vision enabled devices, or if you need to know about the chips and IP that will run the vision and neural network algorithms that will power them, you will need to get this report. And occasionally there will be a scoop, such as in this issue a first public description of an as-yet unreleased device!
The report actually speaks for itself, so I've attached a pdf of the contents and introduction. Take a look, then click on the link above and I'll get in touch to give you rates and subscription options.
CEVA announced its new XM-6 vision processor yesterday (27th Sept) and will reveal details at today’s Linley Microprocessor conference in Santa Clara.
As one of the companies leading the race to provide power-efficient vision processing specifically aimed at the mobile and embedded markets (as opposed to repurposing power-hungry GPUs for workstation and datacenter usage) CEVA’s designs and the traction they get in the market tell us a lot about the development and maturity of vision based equipment including AR headsets, drones and more.
CEVA claims a 3x performance improvement on vector heavy code for the new core, with a 2x boost for ‘average’ code, versus the previous generation chip, as well as revealing that they have for the first time included hardware dedicated to specific functions.
On the performance front, CEVA makes no mention of increased clock rate and the total compute resource (vector and scalar integer and floating point units) has increased only moderately over the XM-4, leading to the conclusion that much of the improvement is due to detailed architecture changes introduced with this generation.
On the basics, XM-6 is similar to XM-4: four 32-bit scalar units linked up VLIW style to 128 16-bit integer MACs are included in the base version of the core, with a 32x16-bit floating point unit available as an option. In detail, however, there are some differences: the 128 MACs have been upgraded so that they now provide full 16x16 functionality ( with 8x16 going up to 256 wide) and where they were previously divided into two 32/64-wide SIMD vectors, the new core splits them differently, with one 64/128-wide SIMD and two 32/64-wide units.
At the most basic level, this allows for a convenient substitution of the optional FP units without disturbing the rest of the architecture but it also allows CEVA to better tune the vector resources to the workloads and increase the amount of available parallelism; a good proportion of the performance increase is likely to be due to that change alone.
It is also interesting that the optional floating point has now been downgraded to half precision (midP in GPU terms). This seems to represent a growing confidence among VPU designers that 32 bits is overkill for the vectorised portions of their workloads therefore they are optimising to 16 bits. The fact that this functionality is still optional is another strong indicator that having floating point at all is a luxury which many applications cannot afford, something which continues to drive the widening split between dedicated VPUs and repurposed GPUs.
Just as interesting is the fact that CEVA credits much of the performance improvement, not to gains in computational capacity but to improvements in efficiency due to its proprietary data handling schemes. A sophisticated buffer handling unit which offloads management of large image datasets is teamed up with two specific mechanisms for improving the vectorisation of operands and hence the overall utilisation of the vector units. Improvements to both of these, in particular the scatter gather mechanism whereby the load/store unit can assemble a 1D data vector from a 2D array in a single cycle represent a large part of the reason for this new release.
It’s certainly true that a large part of wringing maximum performance out of a massively parallel system is solving the data/compute mismatch problem and this is CEVA’s answer to that. It will be interesting to see how it plays out against the various multithreaded and hybrid approaches out there.
*Note: this story has been edited to clarify the number and type of MAC ALUs in the XM6.