Six Ways To Advance Modern Architecture For AI Systems

View of the clouds reflected in the curve glass office building. 3d rendering
These days, many engineering teams are coming up against a common problem – basically speaking, the models are too big. This problem comes in various forms, but there’s often a connecting thread and a commonality to the challenges.
Project are running up against memory constraints. As parameters range into the billions and trillions, data centers have to keep up. Stakeholders have to look out for thresholds in vendor services. Cost is generally an issue.
However, there are new technologies on the horizon that can take that memory footprint and compute burden, and reduce them to something more manageable.
How are today’s innovators doing this?
Let’s take a look.
Input and Data Compression
First of all, there is the compression of inputs.
You can design a loss algorithm to compress the model, and even run a compressed model versus the full one; compression methodologies are saving space when it comes to specialized neural network function.
Here’s a snippet from a paper posted at Apple’s Machine Learning Research resource:
“Recently, several works have shown significant success in training-free and data-free compression (pruning and quantization) of LLMs achieving 50-60% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline.”
That’s one example of how this can work.
This Microsoft document looks at prompt compression, another component of looking at how to shrink or reduce data in systems.
The Sparsity Approach: Focus and Variation
Sometimes you can carve away part of the system design, in order to save resources.
Think about a model where all of the attention areas work the same way. But maybe some of the input area is basically white space, where the rest of it is complex and relevant. Should the model’s coverage be homogenous or one-size-fits-all? You’re spending the same amount of compute on high and low attention areas.
Alternately, people engineering the systems can remove the tokens that don’t get a lot of attention, based on what’s important and what’s not.
Now in this part of the effort, you’re seeing hardware advances as well. More specialized GPU and multicore processors can have an advantage when it comes to this kind of differentiation, so take a look at everything that makers are doing to usher in a whole new class of GPU gear.
Changing Context Strings
Another major problem with network size is related to the context windows that systems use.
If they are typical large language systems operating on a sequence, the length of that sequence is important. Context means more of certain kinds of functionality, but it also requires more resources.
By changing the context, you change the ‘appetite’ of the system. Here’s a bit from the above resource on prompt compression:
“While longer prompts hold considerable potential, they also introduce a host of issues, such as the need to exceed the chat window’s maximum limit, a reduced capacity for retaining contextual information, and an increase in API costs, both in monetary terms and computational resources.”
Directly after that, the authors go into solutions that might have broad application, in theory, to different kinds of fixes.
Dynamic Models and Strong Inference
Here are two more big trends right now: one is the emergence of strong inference systems, where the machine teaches itself what to do over time based on its past experience. Another is dynamic systems, where the input weights and everything else changes over time, rather than remaining the same.
Both of these have some amount of promise, as well, for helping to match the design and engineering needs that people have when they’re building the systems.
There’s also the diffusion model where you add noise, analyze, and remove that noise to come up with a new generative result. We talked about this last week in a post about the best ways to pursue AI.
Last, but not least, we can evaluate traditional systems such as digital twinning. Twinning is great for precise simulations, but it takes a lot of resources – if there’s a better way to do something, you might be able to save a lot of compute that way.
These are just some of the solutions that we’ve been hearing about and they dovetail with the idea of edge computing, where you’re doing more on an endpoint device at the edge of a network. Microcontrollers and small components can be a new way to crunch data without sending it through the cloud to some centralized location.
Think about all of these advances as we sit through more of what people are doing these days with AI.