High efficiency and low energy consumption, CEVA's newly pushed DSP is leading the way

Introduction: CEVA today announced the CEVA-XM6, a fifth-generation image and computer vision DSP product with better performance, more computing power, and lower power consumption. Deep learning, neural networks, and image/visual processing are already some of the most important areas in computer science, but many of the tools they rely on are still in the preliminary stages. The ability of machine learning to process data in real time and accurately also tends to be expensive.

Note: This article was first compiled by ANAND TECH, author Ian Cutress, and compiled by Lei Fengnet (search for â€œLei Feng Netâ€ public number) , and may not be reproduced without permission.

Deep learning, neural networks, and image/visual processing have become a big area, however, many applications that rely on it are still in the preliminary stage. Cars are the most typical examples of applications in these fields. Solving the problems facing automobiles requires the simultaneous understanding and development of hardware and software, the ability to process data in real time with high precision, and open up a series of roads for other machine learning codes. The problem is cost and power consumption. The CEVA-XM4 DSP wants to be the first programmable DSP to support deep learning, and today, the new XM6 IP with a software ecosystem has also been introduced, with greater efficiency, greater computing power, and new energy-saving patents. .

Play IP games

When CEVA announced that the pre-training accuracy of the XM4 DSP deduced fixed-point algorithm is basically the same as that of the full algorithm, with an error of less than 1%, it won many analysts in the field. CEVA stated that high performance and power efficiency make it competitive. The initial progress of the software framework stands out. The IP announcement was released in the Q1 2015 quarter, and the license was obtained in the second year. The first batch of silicon steel produced using IP will go offline this year. Since then, CEVA has released its CDNN2 platform, which is a one-click compilation tool that trains the network and translates it into code that fits CEVA XM IPS. The new generation of XM6 integrates the features of the XM4, improves the configuration, has access to hardware accelerators, and has a new hardware accelerator. Moreover, it still retains the compatibility of the CDNN2 platform. Such encoding is compatible with XM4 and can also be used on the XM6. High performance operation.

CEVA is an IP business, such as ARM, and works with semiconductor companies and then sells them to OEMs. This usually takes a long time to push new products into the market from conception, especially when the safety and automotive industries are rapidly developing. CEVA has transformed the XM6 into a scalable, programmable DSP that can span the market with a single code base while utilizing additional features to improve power, performance, and cost.

Todayâ€™s announcement includes the new XM6 DSP, a new series of CEVA image and vision software libraries, a new set of hardware accelerators, and integration into the CDNN2 ecosystem. CDNN2 is a one-click compilation tool that detects convolutions and applies data in the best way than logic blocks and accelerators.

XM6 will support OpenCL and C++ development tools, as well as software elements including CEVA's computer vision, neural networks, and visual processing libraries with third-party tools. The hardware implements AXI connections for the processing part of the standard XM6 core to interact with accelerators and memory. The XM6 IP has a convolutional hardware accelerator CDNN assistant that allows low-power fixed-function hardware to handle difficult parts of a neural network system, such as GoogleNet, correcting fish-eye images or warping the lens, distortion of the image is known, the function of the transformation It is fixed-feature-friendly, as well as other third-party hardware accelerators.

The XM6's two new hardware features will help most image processing and machine learning algorithms. The first is decentralized-aggregate, or the ability to read L1 cached to 32 address values â€‹â€‹in a vector register during a cycle. The CDNN2 compiler tool recognizes serial code loading and implements vectorization to allow this functionality. Dispersion-aggregation increases data load time when the required data is distributed through the memory structure. Since the XM6 is configurable IP, the size/dependency of L1 data storage is adjustable at the silicon design level, CEVA said, this feature is valid for any size L1. The vector register used for processing at this level is a VLIW implementer with a width of 8 in order to meet this requirement.

The second function is called "sliding-window" data processing. This particular technology of visual processing has been patented by CEVA. There are many ways to process images in process or intelligence. Usually the algorithm will use one or a large number of pixels required by the platform. For smart parts, the number of these blocks will overlap, resulting in different areas of the image being reused by different computing areas. CEVA's approach is to retain this data so that the amount of information needed for the next analysis is less. It sounds very simple. In 2009, I did a similar analysis of three-dimensional differential equations. Indeed, I was surprised that it did not implement visual/image processing before. If you have local storage, reusing raw data can save time and save energy.

CEVA claims that the performance gain of XM6 in heavy vector workload is 3 times that of XM4, which is an average of 2 times the migration kernel. The XM6 is also easier to configure than the XM4 in terms of encoding, providing "50% additional control."

Combined with a specific CDNN hardware accelerator (HWA), CEVA points out that the convolutional layers in the ecosystem, such as GoogleNet, consume most of the cycle. The CDNN HWA uses this encoding and implements fixed hardware for it using 512 MACs. It uses 16-bit support to achieve an 8x performance gain with 95% utilization. CEVA mentioned that using a 12-bit approach will save chip area and cost while minimizing the loss of precision, but there are also some developers who require a full 16-bit approach to support future projects, so the result is a choice of 16 bits.

In this area of â€‹â€‹automotive image/video processing, CEVA has two major competitors, namely MobilEye and NVIDIA, which launched TX1 to facilitate neural network training and reasoning. Based on the 690 MHz case, TX1's TSMC 20nm planar processing technology, CEVA said that their internal simulation gave a single XM6 platform that is 25 times more efficient, and four times faster than AlexNet and GoogleNet. Of course, although the XM6 can also operate at 16nm or 28nm FinFETs, these are the result of its operation at 20nm. This means that based on the data released by a single batch of TX1, the XM6 uses Alexnet on the FP16, which can perform 268 frames per second compared to 67 frames per second, which is only 800 mW compared to 5.1 w. In 16FF, the power value may be lower, and CEVA told us that their internal metrics were initially done at 28 nm/16 FF, but they used TX1 to remeasure all aspects of it at 20 nm. It should be noted that the TX1 multi-batch values â€‹â€‹indicate that it is more efficient than a single batch, however, it does not provide other more contrast values. CEVA also implements power gating using the DVFS scheme, which reduces power when parts of the DSP or the accelerator are idle.

Obviously, NVIDIA's strengths are the availability of its solutions, and CUDA/OpenCL software development, both of which CEVA wants to implement with a one-click software platform such as CDNN2 and improve hardware such as XM6. Look at which semiconductor partners and future implementation tools can combine this image processing with machine learning. CEVA pointed out that smart phones, automobiles, security and commercial applications such as drones and automation will be the main targets.

Via:ANAND TECH