Why synthetic data is reshaping the future of AI training

Episode 325 Why synthetic data is reshaping the future of AI training

May 20, 202526 mins

Data Engineering Data Science Generative AI

Jump to

Read transcript

Overview

As the demand for high-quality training data continues to surge, synthetic data is emerging as a game-changing tool in the world of AI development. But is it the silver bullet enterprises need—or a potential minefield of risks?

In this episode of Today in Tech, host Keith Shaw sits down with Alexius Wronka, CTO of Data and Growth at Invisible Technologies, to explore the advantages, limitations, and ethical challenges of using synthetic data to train large language models (LLMs) and enterprise AI systems.

🔍 Topics Covered:
* What exactly is synthetic data?
* Key benefits vs. human-generated data
* Use cases in healthcare, autonomous vehicles, and enterprise AI
* Dangers of model overfitting and data hallucination
* Synthetic content, explainability, and detection tools
* The Matrix analogy: Are we training AI inside simulations?

Whether you're a data scientist, CTO, or just curious about how AI models learn, this episode offers a deep dive into one of the most critical and misunderstood trends in machine learning.

👉 Don't forget to like, comment, and subscribe for more episodes of Today in Tech!

#SyntheticData #AITraining #InvisibleTechnologies #AlexiusWronka #TodayInTech #KeithShaw #EnterpriseAI #GenerativeAI #TechPodcast

Transcript

00:00

Keith Shaw: As the need for data for AI training continues to increase, many companies are turning to synthetic data to build their projects. But will this create new problems down the road?

We're going to discuss the pros and cons of using synthetic data on this episode of Today in Tech. Hi everybody. Welcome to Today in Tech. I'm Keith Shaw. Joining me on the show today is Alexius Wronka. He is the CTO of Data and Growth at Invisible Technologies.

Welcome to the show, Alexius. Alexius Wronka: Thanks, Keith. Happy to be here. Keith: I love the company name—Invisible Technologies. It’s just got this cool factor to it. You must be thrilled working for them. Alexius: It's great.

00:44

It's the best word.

00:48

Keith: It really is a cool name. Before we begin, I want to establish what the general consensus is around the definition of synthetic data. As we get into the topic, I think some things I consider synthetic data might actually be synthetic content or output, especially with generative AI.

Anything from a generative AI prompt could be considered synthetic content. But I want to hone in on what we mean by synthetic data—or what you and your team generally define it as.

01:19

Alexius: I think the key here is that virtually any content or output, like you described, could be classified as data. In the context of the work Invisible does with AI training, we’re thinking about data used for training and improving large language models.

So if there’s human-generated data and then computer-generated data, synthetic data sits in that second bucket—data created by computers in some capacity.

01:55

Keith: So when we talk about it, why are companies looking to use synthetic data? What are the benefits? And is it okay to call it synthetic data versus “real” data—or should we just say human data?

02:12

Alexius: I’d call it human-created. It’s still real data. The idea is, as model trainers reach the limit of publicly available data sources—and given how slow it is to create human data—we’re trying to figure out how to create larger datasets faster.

That allows us to do different things with model training. Human data takes a long time to build. It’s prone to errors. Synthetic data lets you create thousands—millions—of data points in minutes. That’s one of the major advantages.

03:00

Keith: When I think of synthetic data, I start imagining Excel spreadsheets, database lists—like customer data, for example. We talk to companies that use synthetic data to create training data not based on real customers.

03:26

Alexius: You're thinking of dummy customers to build a test environment or new application without violating GDPR. That’s a very naive use case for synthetic data.

The more sophisticated application is: how do we take known human data and modify it to help fine-tune models in a direction we want—especially when historical datasets don’t exist?

04:21

Keith: Why else are companies turning to synthetic data? Some stories I've read say there just isn’t enough human-generated data to properly train AI. I covered autonomous vehicles a lot, and they talked about using simulated data to train the systems because they lacked enough real-world sensor or LiDAR data.

04:58

Alexius: It really comes down to the fact that there just isn’t enough data. That’s the main reason for synthetic data. When they built AlphaGo, they used synthetic data because not enough games had been played.

They built two models to go against each other in an adversarial system to generate synthetic datasets. What’s interesting about synthetic data in adversarial systems is that they can generate novel datasets that help fine-tune and improve your model over time.

05:41

Keith: So the benefit is that you get the data needed to train the models, right?

05:49

Alexius: Exactly—and you get it much faster. You can iterate more quickly. Creating human datasets might take days, weeks, or even months to get a large enough corpus. Then it goes into the model, and everyone crosses their fingers hoping it improves.

06:07

Keith: And synthetic datasets are probably cleaner than human-generated ones, right? No bad, unclear, or unstructured data—you can structure it right from the start. Alexius: Yeah.

One of the biggest challenges in creating human datasets is orchestrating the effort across many people to complete the same task with the same output. It’s full of errors. People have to create and QA the right thing. With synthetic data, it just follows a process.

It algorithmically gets there, and the output is consistent 99.9% of the time.

06:59

Keith: On the other end of the scale—these things aren’t perfect—what concerns do companies raise with you about synthetic data? What could be troublesome?

07:18

Alexius: From a model training perspective, the biggest challenge with synthetic data is: does it give us anything new or interesting that actually improves the model? If it's built from data that already exists in the model, the concern is you’re just overfitting the model to the synthetic data.

That’s why adversarial systems are useful—they help generate novel datasets. A lot of companies now ask: what’s the right mix of human and synthetic data?

How much human data do I need to create novel situations that guide models in different directions, while also having a large enough dataset to train the model?

The way I talk about it is: when building a new or state-of-the-art model, you want to get it “up to bar.” Synthetic data gets you there quickly. Think about distillation techniques that DeepSeek used—they took data from other models and used high-quality outputs to reach a strong baseline.

But to get “above the bar,” you have to return to human data—creating datasets that get you out of what I’d call the synthetic data moat.

08:55

Keith: When I think of synthetic and human data, I start picturing a baking analogy—like oil and water. Can you mix them to create something new? Or do they always separate? Or is that the worst analogy ever?

09:15

Alexius: If you took one piece of synthetic data and one piece of human-created data, the untrained eye wouldn’t know the difference.

There are tools online where you can input a model-generated output and ask, “Did AI create this?” Sometimes they’re accurate—sometimes they misidentify human content as AI-generated, and vice versa. So I’d say: it’s all water.

09:51

Keith: So not oil and water—more like adding Kool-Aid to water? You get different water?

10:03

Alexius: I’d say it’s more like using two types of milk—dairy milk and oat milk—and seeing how each changes a recipe. If you’re making pancakes, cow’s milk gives one flavor, oat milk another. Both act as liquids and binding agents, but they’re both still milk.

10:25

Keith: I love that we’re using baking analogies. One thing I’m seeing is companies using AI models with internal data for more accurate results—Retrieval Augmented Generation, or RAG.

Do they still have use cases for synthetic data in these situations, or are they moving from a big-picture model to a narrower focus?

10:55

Alexius: That’s a great question. A lot of model builders are moving away from common public languages like Python or JavaScript—there’s lots of data there. But now they’re working on enterprise apps built in Java, C++, or other lower-level languages, where there’s less data.

They might only have two or three large internal codebases. So how do they get more examples? That’s a great opportunity to use synthetic data—create versions of that code to build more instances.

Then you might have human annotators label that data and feed it back into the model, so it better understands your specific codebase.

11:57

Keith: I missed this earlier: does the use of synthetic data happen more in specific industries or company departments? Where is it used most?

12:16

Alexius: We touched on this. Self-driving is a big one—there just isn’t enough data. Healthcare is another. These are industries very focused on safety or heavy regulation. They need to work out all the bugs before deploying AI in production.

You can’t put a car on the road unless you’re 99% sure it won’t crash.

12:55

Keith: In healthcare, is this where they use AI to detect tumors early—where there’s not enough real data?

13:09

Alexius: Maybe not tumors just yet. But think about care plans. You can create different patient scenarios with various diseases or symptoms, then fine-tune a model on that data to generate care plans. That’s a great use case. There are HIPAA laws, so using real data is tough.

Synthetic data can create realistic scenarios for model training. Keith: Got it.

13:50

One fear I hear—and this might be more about synthetic content—is that results might be inaccurate. Is it like the “copy of a copy” problem, where synthetic data gets tweaked and re-fed into the system and degrades over time?

14:28

Alexius: That’s a real concern—reinforced hallucinations. A hallucinated output gets into a core dataset, and if you retrain on that, it becomes baked into the system. That’s why you need human annotators to look at the data, create new datasets, and catch those anomalies early.

15:09

Keith: This is slightly off-topic, but in a prep call with your colleague Adam, we talked about a story where linguists noticed AI using the word “delve” more and more—and then humans started using it because they saw it in AI outputs.

If synthetic data creates something not based on human content, but we start believing it, is that dangerous? Or am I going down a Matrix rabbit hole?

16:05

Alexius: It’s a fascinating question. As humans—as biological machines—we adapt. When we see things repeatedly in our environment, we adjust and assume it’s normal. People use ChatGPT to write emails, respond to Slack messages. Over time, their own language starts mirroring the AI’s style.

16:53

Keith: Right—we read AI-generated outputs without realizing it, and start adopting the same words. "Delve" is the one they highlighted. I first thought it was “realm,” but it’s “delve.” There’s an NPR article about it.

17:23

Alexius: That’s fascinating. Language evolves naturally. Think about “who” versus “whom.” Hardly anyone uses “whom” correctly anymore. Language just moves on.

17:50

Keith: I started as a newspaper writer. Now people are using acronyms in texts. It feels like language is devolving—almost like we’re back to hieroglyphics with emojis.

18:09

Alexius: How many people have you heard actually pronounce “LOL”? Keith: Nobody does.

18:13

Wait—people do? Alexius: Yeah. I hang out with people who actually say “LOL” out loud.

18:22

Keith: I’ve heard both “LOL” and “L-O-L,” instead of laughing. Go back 25 years—before texting—and nobody was doing that. But as things enter the environment, we adapt.

18:37

Keith: Let’s jump back to my regularly scheduled questions. I think you mentioned earlier—when it comes to synthetic data used in training—should it be identified? In the ChatGPT world or AI-generated images, it’s good practice to disclose that they were created by AI.

Is there a similar conversation happening for synthetic training data? Or does it not matter as much?

19:14

Alexius: Anyone using data today wants to know its point of origin. If it’s synthetic, they want to know that. As we said earlier, synthetic data has specific use cases—like improving models and getting them to baseline quality.

But if someone’s not getting the results they expect, it’s important to know whether the training data was synthetic or human. Transparency is incredibly important. And from my background—what matters most in data is where it comes from and how transparent the pipeline is between its origin and the user.

If there’s a black box in the middle, that’s when things fall apart.

20:22

Keith: That leads me to explainability. It feels like the black box is a big issue—ask a data scientist, “Why did the model make this decision?” and they usually say, “Well...” Alexius: Exactly.

20:36

It gets into the whole concept of chain-of-thought reasoning. Explainability is still incredibly hard. You can trace it back to model weights and say, “this is kind of why,” but to truly pin down the probabilistic reason one word followed another—we don’t have that capability yet.

You’d need to tie model weights and training directly to output, and that would be close to machine code—stuff most people can’t read.

21:17

Keith: Does it become a chicken-or-egg scenario? Or am I just making weird logic leaps? Alexius: No, I get what you’re saying. We don’t know why.

No matter how much we try to explain it—any explanation from the model is still the product of a probabilistic system we don’t fully understand.

21:43

Even if a model gives reasoning for its answer, that reasoning is itself generated by another probabilistic process. To really explain it, you’d have to connect the training data and weights directly to outputs—and that would likely require understanding near-machine-level code.

22:14

Keith: With synthetic and human data—are there tools that detect if something was produced synthetically? I know you talked about labeling it, but can systems also detect it automatically? Alexius: Absolutely.

22:35

If we want students and learners to practice honestly, this is critical—so they don’t just cheat. Detection is a huge part of model evaluation—understanding not just performance, but what kind of output a model will produce.

You want to be able to say: “Was this generated by a machine or a human?” In fraud detection and other enterprise use cases, this is a major investment. Enterprises need to know what’s real and what’s synthetic.

23:26

Keith: Is this mostly something only data scientists need to care about, or should the entire enterprise understand synthetic data?

23:42

Alexius: I don’t think we’re quite there yet, but we’re heading in that direction. Just like you train your workforce around phishing and cybersecurity, we’ll need similar education around synthetic data.

You’ll get emails from “the CEO” that look completely legit—maybe even from their email address—and you’ll need tools to protect against that.

24:20

Keith: But how does that tie into synthetic data?

24:25

Alexius: That is synthetic data—malicious synthetic data. Bad actors are going to use these models. We need systems that can detect when data didn’t come from a real human.

24:43

Keith: Have best practices been established yet for handling synthetic data?

24:49

Alexius: Still very early—very much the Wild West. Some regulated industries, like full self-driving or healthcare, are starting to build frameworks. But for most enterprises, this is still early days. Although, I’d say it’s less Wild West and more risk-averse. There’s more caution than recklessness.

Companies aren’t rushing tools to market without thinking through the risks.

25:24

Keith: One last analogy, Alexius—was The Matrix the ultimate example of synthetic data? A simulated 1990s Earth?

25:36

Alexius: Remember: nobody knows what chicken tastes like.

25:39

Keith: So if I wanted to explain synthetic data to someone, could I say, “It’s like the world created by the machines in The Matrix”?

25:48

Alexius: Yeah—I think that’s a great example. It’s virtual reality.

25:53

Keith: Alexius, thanks again for chatting with us about this great topic. Alexius: Thanks, Keith. Keith: That’s going to do it for this week’s show. Be sure to like the video, subscribe to the channel, and leave your thoughts below if you're watching on YouTube.

Join us every week for new episodes of Today in Tech. I’m Keith Shaw. Thanks for watching.

Show me more

Why Apple’s Foundation Models Framework matter

Jun 17, 20256 mins

Apple PhotosDeveloperGenerative AI

Global Microsoft 365 outage disrupts Teams and Exchange services

By Nidhi Singal

Jun 17, 20254 mins

Grammarly looks to evolve into an always-on desktop AI agent

Jun 17, 20255 mins

Artificial IntelligenceGenerative AIProductivity Software

How to be a leader through constant chaos

Jun 3, 202549 mins

Employee ExperienceIT Leadership

Podcast: Are your employees spying on you?

May 6, 202525 mins

CybercrimePhysical SecuritySecurity

Podcast: Are companies wasting their money on failed AI projects?

Apr 8, 202535 mins

Artificial IntelligenceBudgetingGenerative AI

Why AI agents are the new cybersecurity nightmare

Jun 17, 202536 mins

Data and Information SecurityGenerative AIIdentity and Access Management

AI is turning your online data into a real-world threat

Jun 10, 202539 mins

CybercrimeData PrivacyInternet Security

Leadership Chaos: How to stay steady when everything is shifting

Jun 3, 202549 mins

Employee ExperienceIT Leadership

Sponsored Links