Running Out Of Data: It’s A Strange Problem…

Perhaps in the earlier days of AI/ML, you were a little curious about what the limiting factors would be in these new technologies.

One potential one was cost, but we’ve seen the value of compute consistently moving lower as new powerful LLMs come along. Another would be data center capacity, but in the U.S. and elsewhere, we’re building data centers like there’s no tomorrow.

Yes, but what about the core asset of these systems? What about the data?

The concept of data scarcity is not new to many engineers. It’s the idea that you just don’t have enough high-level quality data to make systems run on a knowledgeable basis. In other words, the AI is flying blind, because it doesn’t have enough data points to operate in a granular way.

Here’s how some experts characterize data scarcity, from Midhat Tilawat at All About AI:

“Data scarcity in AI refers to the insufficient availability of high-quality training data, hindering the development of effective machine learning models and leading to reduced AI performance.”

In machine learning, they used to refer to the “curse of dimensionality” and problems of “underfitting/overfitting” in ways that approach this concern.

Data Scarcity in the AI Community

Now: are we running out of data? You can see people arguing about this on the Internet. There may be a lack of quality data in some domains, but does that mean that the data doesn’t exist, or that it’s simply not accessible?

To some, the lack of data means that we’ve drained the well.

“The Internet is a vast ocean of human knowledge, but it isn’t infinite,” wrote Nicola Jones for Nature at the end of last year. “And artificial intelligence (AI) researchers have nearly sucked it dry.”

You can see notable figures echoing this sentiment. Copilot cites this Opentools article, suggesting that Dario Amodei has “expressed concerns” about this eventuality, while the news on Sam Altman indicates he might be more worried about compute.

In any case, others believe that we are not close to out of data, but that we just have to use the data that we have better.

Open and Closed Systems, and Proprietary Data

A recent segment at Imagination in Action at our Stanford conference in September took on these questions and more, where Marcie Vu of Greycoft and Ari Morcos of Datology talk to Julie Choi of Cerebras about all of those logistics involving enterprise AI.

In the intro, Vu talked about moving the bar for founders higher, leveraging distilled models for collaboration, and deciding whether to build your own model, or use one from a vendor.

The conversation later moved toward the marginal cost of compute, and eventually, to the idea of closed versus open models.

“Two years ago, I think there was this very widely held belief that the closed source models were going to be just so much better than any of the open source models, that there was no chance to compete,” Morcos said. “And I think related to that, there was this commonly held belief that the cost of training models to the frontier was going to get higher and higher and higher with every successive model.”

Is Open Source Competitive Now?

Morcos suggested that now, we’ve seen that open source is competitive, and that the billion-dollar closed models that people had predicted in the last decade haven’t turned out to be dominant.

However, he talked about a “frontier research problem” involving object storage as something that you don’t want a few companies to be in charge of.

“When you think about training a model, typically, the way people come into this is, they have a certain amount of budget,” he said. “I have (for example) $10 million in compute that I can spend on this model. I’m going to show it whatever data I can for that. I’m going to get performance out of that. Well, in between, you have a bunch of data sitting in storage on S3, too. I’m going to feed it into a model through a data loader.”

Here, he said, it’s important to think about how engineers and the people in charge would drive these processes, in relatively uncharted waters.

“There are hundreds of choices you would make,” he said. “These choices are around which data do you want to show to the model? Do you want to show all of it? Do you want to show some subset about it? How do you want to sequence that data? The order might matter.”

That, he added, will determine some fundamental things about how your model works.

“All these choices that you would make … have a dramatic orders-of-magnitude impact on how quickly your model learns, to what performance it will learn, and how big of a model you can train to that performance,” he said.

As for data, both panelists talked about data beyond the Internet, proprietary data, and how to avoid hitting a data wall.

Morcos said there’s a lot of “juice” you can get out of existing data, just by working with it in different ways, and then there’s synthetic data as well.

“I really believe we’re going to be in a world where every enterprise will be able to train its own models for a million dollars, which is really not very much, and be able to access its proprietary data in this really critical and important way,” he said. “You hear the story a lot. ‘We’ve run out of data. We mined the Internet. The Internet is done.’ Well, first of all, the internet is a very, very small minority of the total data that exists in the world.”

His company, he noted, is finding ways to assist clients in this sort of process.

“The vast majority of data in the world is proprietary, and sitting on company servers, and we want to help companies actually unlock the ability to access that data and get value out of it,” he added. “(Also,) the data wall really only matters if we’re using our existing data sets to maximum utility, and we are very, very, very far from that.”

Vu agreed, and talked about a model that her own company is pursing.

“We’re very early days in terms of being able to unlock the data that we have, particularly within the enterprise,” she said, sharing a strategy that involves casting a wider net. “We do actually spend time investing in businesses that maybe are not AI-first or AI-native, but we call it ‘AI-accelerated,’” she said.

AI Jargon

I also learned some terms here, listening to these two talk about data use. “Benchmaxing,” for one, is when systems do well on benchmarks, but not in the real world. Morcos suggested this can be due to too much synthetic data.

“You ask a model, ‘hey, produce some data points about biology’ or whatever, and you ask it to produce this,” he said. “In this case, all the information is coming from the model itself. And this means that you can only ever teach a model something that the synthetic data generating model already understood. So in this way, you view this form of synthetic data as model distillation in disguise, via that synthetic data. This is what we see most commonly used.”

Then there’s something called “rephrasing,” where companies take existing data and put it into new formats in order to feed AI better. Discussing this in detail, Morcos talked about how companies work on rephrasing, and what’s important as you move through this process.

First, he said, you have to identify the data.

Then small models can go to work manipulating these data points for a fresh approach.

“We built a system that can now be applied to companies proprietary data,” he explained.
“So rather than that just being a synthetic data set, you can now, as a company, feed in your own data and have that be augmented and rephrased at scale, in a really effective way. And we can do that for pretty low cost.”

Into the Future

Choi asked the panelists what they see in the future of AI.

Vu mentioned robotics, and applied intelligence.

Morcos mentioned a sort of AI dynamism that’s going to move the goalposts.

“We’re going to move towards a world where all models are constantly being fine-tuned on incoming data,” he said.

The Predictions of Data Use

I thought this panel was quite good for helping us to think through data limitations. If this analysis is correct, we will not hit a data wall anytime soon. We’ll figure out how to work with existing data points, as well as synthetic data, and grow the playground for AI to expand against that potentially limiting factor.

And once again it will feel like the sky is the limit in AI applications and use cases.

As for open source and closed source models, we’ll have to watch whether companies embrace the vendor lock-in that comes with a closed system, or community access that a company has with open source designs.

Data privacy and security will be paramount.

So think about this in the rollout of new exciting AI systems.

Forbes