Edge AI challenges memory technology

With the rise of AI at the edge comes a whole host of new requirements for memory systems. Can today’s memory technologies live up to the stringent demands of this challenging new application, and what do emerging memory technologies promise for edge AI in the long-term?

The first thing to realize is that there is no standard “edge AI” application; the edge in its broadest interpretation covers all AI-enabled electronic systems outside the cloud. That might include “near edge,” which generally covers enterprise data centers and on-premise servers.

Further out are applications like computer vision for autonomous driving. Gateway equipment for manufacturing performs AI inference to check for flaws in products on the production line. 5G “edge boxes” on utility poles analyse video streams for smart city applications like traffic management. And 5G infrastructure uses AI at the edge for complex but efficient beam-forming algorithms.

At the “far edge,” AI is supported in devices such as mobile phones — think Snapchat filters — voice control of appliances and IoT sensor nodes in factories performing sensor fusion before sending the results to another gateway device.

The role of memory in edge AI systems— to store neural network weights, model code, input data and intermediate activations — is the same for most AI applications. Workloads must be accelerated to maximize AI computing capacity in order to remain efficient, so demands on capacity and bandwidth are generally high. However, application-specific demands are many and varied, and may include size, power consumption, low voltage operation, reliability, thermal/cooling considerations and cost.

Edge data centers

Edge data centers are a key edge market. The use cases range from medical imaging, research and complex financial algorithms, where privacy prevents uploading to the cloud. Another is self-driving vehicles, where latency prevents it.

These systems use the same memories found in servers in other applications.

“It is important to use low latency DRAM for fast, byte-level main memory in applications where AI algorithms are being developed and trained,” said Pekon Gupta, solutions architect at Smart Modular Technologies, a designer and developer of memory products. “High capacity RDIMMs or LRDIMMs are needed for large data sets. NVDIMMs are needed for system acceleration — we use them for write caching and checkpointing instead of slower SSDs.”

Pekon Gupta

Locating computing nodes close to end users is the approach taken by telecommunications carriers.

“We’re seeing a trend to make these [telco] edge servers more capable of running complex algorithms,” Gupta said. Hence, “service providers are adding more memory and processing power to these edge servers using devices like RDIMM, LRDIMM and high-available persistent memory like NVDIMM.”

Gupta sees Intel Optane, the company’s 3D-Xpoint non-volatile memory whose properties are between DRAM and Flash, as a good solution for server AI applications.

“Both Optane DIMMs and NVDIMMs are being used as AI accelerators,” he said. “NVDIMMs provide very low latency tiering, caching, write buffering and metadata storage capabilities for AI application acceleration. Optane data center DIMMs are used for in-memory database acceleration where hundreds of gigabytes to terabytes of persistent memory are used in combination with DRAM. Although these are both persistent memory solutions for AI/ML acceleration applications, they have different and separate use cases.”

Kristie Mann, Intel’s director of product marketing for Optane, told EE TimesOptane is gaining applications in the server AI segment.

Intel’s Kristie Mann

“Our customers are already using Optane persistent memory to power their AI applications today,” she said. “They are powering e-commerce, video recommendation engines and real-time financial analysis usages successfully.  We are seeing a shift to in-memory applications because of the increased capacity available.”

DRAM’s high prices are increasingly making Optane an attractive alternative. A server with two Intel Xeon Scalable processors plus Optane persistent memory can hold up to 6 terabytes of memory for data-hungry applications.

“DRAM is still the most popular, but it has its limitations from a cost and capacity perspective,” said Mann. “New memory and storage technologies like Optane persistent memory and Optane SSD are [emerging] as an alternative to DRAM due to their cost, capacity and performance advantage. Optane SSDs are particularly powerful caching HDD and NAND SSD data to continuously feed AI applications data.”

Optane also compares favourably to other emerging memories which are not fully mature or scalable today, she added.

An Intel Optane 200 Series module. Intel says Optane is
already used to power AI applications today. (Source: Intel)

GPU acceleration

For high-end edge data center and edge server applications, AI compute accelerators like GPUs are gaining traction. As well as DRAM, the memory choices here include GDDR, a special DDR SDRAM designed to feed high-bandwidth GPUs, and HBM, a relatively new die-stacking technology which places multiple memory die in the same package as the GPU itself.

Both are designed for the extremely high memory bandwidth required by AI applications.

For the most demanding AI model training, HBM2E offers 3.6 Gbps and provides a memory bandwidth of 460 GB/s (two HBM2E stacks provides close to 1 TB/s). That’s among the highest performance memory available, in the smallest area with the lowest power consumption. HBM is used by GPU leader Nvidia in all its data center products.

GDDR6 is also used for AI inference applications at the edge, said Frank Ferro, senior director of product marketing for IP Cores at Rambus. Ferro said GDDR6 can meet the speed, cost and power requirements of edge AI inference systems. For instance, GDDR6 can deliver 18 Gbps and provides 72 GB/s. Having four GDDR6 DRAMs provides close to 300 GB/s of memory bandwidth.

“GDDR6 is used for AI inference and ADAS applications, Ferro added.

When comparing GDDR6 to LPDDR, Nvidia’s approach for most non-datacenter edge solutions from the Jetson AGX Xavier to Jetson Nano, Ferro acknowledged that LPDDR is suited to low-cost AI inference at the edge or endpoint.

“The bandwidth of LPDDR is limited to 4.2 Gbps for LPDDR4 and 6.4 Gbps for LPDDR5,” he said. “As the memory bandwidth demands go up, we will see an increasing number of designs using GDDR6. This memory bandwidth gap is helping to drive demand for GDDR6.”

Frank Ferro of Rambus

Despite being designed to fit alongside GPUs, other processing accelerators can take advantage of GDDR’s bandwidth. Ferro highlighted the Achronix Speedster7t, an FPGA-based AI accelerator used for inference and some low-end training.

“There is room for both HBM and GDDR memories in edge AI applications,” said Ferro. HBM “will continue to be used in edge applications. For all of the advantages of HBM, the cost is still high due to the 3D technology and 2.5D manufacturing. Given this, GDDR6 is a good trade-off between cost and performance, especially for AI Inference in the network.”

HBM is used in high-performance data center AI ASICs like the Graphcore IPU. While it offers stellar performance, its price tag can be steep for some applications.

Qualcomm is among those using this approach. Its Cloud AI 100 targets AI inference acceleration in edge data centers, 5G “edge boxes,” ADAS/autonomous driving and 5G infrastructure.

“It was important for us to use standard DRAM as opposed to something like HBM, because we want to keep the bill of materials down,” said Keith Kressin, general manager of Qualcomm’s Computing and Edge Cloud unit. “We wanted to use standard components that you can buy from multiple suppliers. We have customers who want to do everything on-chip, and we have customers that want to go cross-card. But they all wanted to keep the cost reasonable, and not go for HBM or even a more exotic memory.

“In training,” he continued, “you have really big models that would go across [multiple chips], but for inference [the Cloud AI 100’s market], a lot of the models are more localized.”

The far edge

Outside the data center, edge AI systems generally focus on inference, with a few notable exceptions such as federated learning and other incremental training techniques.

Some AI accelerators for power-sensitive applications use memory for AI processing. Inference, which is based on multi-dimensional matrix multiplication, lends itself to analog compute techniques with an array of memory cells used to perform calculations. Using this technique, Syntiant’s devices are designed for voice control of consumer electronics, and Gyrfalcon’s devices have been designed into a smartphone where they handle inference for camera effects.

In another example, intelligent processing unit specialist Mythic uses analog operation of flash memory cells to store an 8-bit integer value (one weight parameter) on a single flash transistor, making it much denser than other compute-in-memory technologies. The programmed flash transistor functions as a variable resistor; inputs are supplied as voltages and outputs collected as currents. Combined with ADCs and DACs, the result is an efficient matrix-multiply engine.

Mythic’s IP resides in the compensation and calibration techniques that cancel out noise and allow reliable 8-bit computation.

Mythic uses an array of Flash memory transistors to make dense multiply-accumulate engines (Source: Mythic)

Aside from compute-in-memory devices, ASICs are popular for specific edge niches, particularly for low- and ultra-low power systems. Memory systems for ASICs use a combination of several memory types. Distributed local SRAM is the fastest, most power-efficient, but not very area-efficient. Having a single bulk SRAM on the chip is more area efficient but introduce performance bottlenecks. Off-chip DRAM is cheaper but uses much more power.

Geoff Tate, CEO of Flex Logix, said finding the right balance between distributed SRAM, bulk SRAM and off-chip DRAM for its InferX X1 required a range of performance simulations. The aim was to maximize inference throughput per dollar — a function of die size, package cost and number of DRAMs used.

“The optimal point was a single x32 LPDDR4 DRAM; 4K MACs (7.5 TOPS at 933MHz); and around 10MB SRAM,” he said. “SRAM is fast, but it is expensive versus DRAM. Using TSMC’s 16-nm process technology, 1MB of SRAM takes about 1.1mm2. “Our InferX X1 is just 54mm2 and due to our architecture, DRAM accesses are largely overlapped with computation so there is no performance comprise. For large models having a single DRAM is the right trade-off, at least with our architecture,” Tate said.

The Flex Logix chip will be used in edge AI inference applications that require real-time operation, including analyzing streaming video with low latency. This includes ADAS systems, analysis of security footage, medical imaging and quality assurance/inspection applications.

What kind of DRAM will go in alongside the InferX X1 in these applications?

“We think LPDDR will be the most popular: a single DRAM gives more than 10GB/sec of bandwidth… yet has enough bits to store the weights/intermediate activations,” said Tate. “Any other DRAM would require more chips and interfaces and more bits would need to be bought that aren’t used.”

Is there room for any emerging memory technologies here?

“The wafer cost goes up dramatically when using any emerging memory, whereas SRAM is ‘free,’ except for silicon area,” he added. “As economics change, the tipping point could change too, but it will be further down the road.”

Emerging memories

Despite the economics of scale, other memory types hold future possibilities for AI applications.

MRAM (magneto-resistive RAM) stores each bit of data via the orientation of magnets controlled by an applied electrical voltage. If the voltage is lower than required to flip the bit, there is only a probability a bit will flip. This randomness is unwanted, so MRAM is driven with higher voltages to prevent it. Still, some AI applications can take advantage of this inherent stochasity (which can be thought of as the process of randomly selecting or generating data).

Experiments have applied its MRAM’s stochasity capabilities to Gyrfalcon’s devices, a technique whereby the precision of all the weights and activations is reduced to 1-bit. This is used to dramatically reduce compute and power requirements for far-edge applications. Trade-offs with accuracy are likely, depending on how the network is re-trained. In general, neural networks can be made to function reliably despite the reduced precision.

“Binarized neural networks are unique in that they can function reliably even as the certainty of a number being -1 or +1 is reduced,” said Andy Walker, product vice president at Spin Memory. “We have found that such BNNs can still function with high levels of accuracy as this certainty is reduced [by] introducing what is called ‘bit error rate’ of the memory bits being written incorrectly.”

Andy Walker of Spin Memory

MRAM can naturally introduce bit error rates in a controlled manner at low voltage levels, maintaining accuracy while lowering power requirements even further. The key is determining the optimum accuracy at the lowest voltage and shortest time. That translates into the highest energy efficiency, Walker said.

While this technique also applies to higher precision neural networks, it’s especially suited to BNNs because the MRAM cell has two states, which matches the binary states in a BNN.

Using MRAM at the edge is another potential application, according to Walker.

“For edge AI, MRAM has the ability to run at lower voltages in applications where high-performance accuracy isn’t a requirement, but improvements in energy efficiency and memory endurance are very important,” he said. “In addition, MRAM’s inherent nonvolatility allows for data conservation without power.

One application is as a so-called unified memory “where this emerging memory can act as both an embedded flash and SRAM replacement, saving area on the die and the avoiding the static power dissipation inherent in SRAM.”

While Spin Memory’s MRAM is on the verge of commercial adoption, specific implementation of the BNN would work best on a variant of the basic MRAM cell. Hence, it remains at the research stage.

Neuromorphic ReRAM

Another emerging memory for edge AI applications is ReRAM. Recent research by Politecnico Milan using Weebit Nano’s silicon oxide (SiOx) ReRAM technology showed promise for neuromorphic computing. ReRAM added a dimension of plasticity to neural network hardware; that is, it could evolve as conditions change—a useful quality in neuromorphic computing.

Current neural networks can’t learn without forgetting tasks they’ve been trained on, while the brain can do this quite easily. In AI terms, this is “unsupervised learning,” where the algorithm performs inference on datasets without labels, looking for its own patterns in data. The eventual result could be ReRAM-enabled edge AI systems that can learn new tasks in-situ and adapt to the environment around them.

Overall, memory makers are introducing technologies offering speed and bandwidth required for AI applications. Various memories, whether on the same chip as the AI compute, in the same package, or on separate modules, are available to suit many edge AI application.

While the exact nature of memory systems for edge AI depends on the application, GDDR, HBM and Optane are proving popular for data centers, while LPDDR competes with on-chip SRAM for endpoint applications.

Emerging memories are lending their novel properties to research designed to advance neural networks beyond the capabilities of today’s hardware to enable future power-efficient, brain-inspired systems.

>> This article was originally published on our sister site, EE Times.


Source Article