Why synthetic data is reshaping the future of AI training

Overview

As the demand for high-quality training data continues to surge, synthetic data is emerging as a game-changing tool in the world of AI development. But is it the silver bullet enterprises needโ€”or a potential minefield of risks?

In this episode of Today in Tech, host Keith Shaw sits down with Alexius Wronka, CTO of Data and Growth at Invisible Technologies, to explore the advantages, limitations, and ethical challenges of using synthetic data to train large language models (LLMs) and enterprise AI systems.

๐Ÿ” Topics Covered:
* What exactly is synthetic data?
* Key benefits vs. human-generated data
* Use cases in healthcare, autonomous vehicles, and enterprise AI
* Dangers of model overfitting and data hallucination
* Synthetic content, explainability, and detection tools
* The Matrix analogy: Are we training AI inside simulations?

Whether you're a data scientist, CTO, or just curious about how AI models learn, this episode offers a deep dive into one of the most critical and misunderstood trends in machine learning.

๐Ÿ‘‰ Don't forget to like, comment, and subscribe for more episodes of Today in Tech!

#SyntheticData #AITraining #InvisibleTechnologies #AlexiusWronka #TodayInTech #KeithShaw #EnterpriseAI #GenerativeAI #TechPodcast

Register Now

Transcript

Keith Shaw: As the need for data for AI training continues to increase, many companies are turning to synthetic data to build their projects. But will this create new problems down the road?

We're going to discuss the pros and cons of using synthetic data on this episode of Today in Tech. Hi everybody. Welcome to Today in Tech. I'm Keith Shaw. Joining me on the show today is Alexius Wronka. He is the CTO of Data and Growth at Invisible Technologies.

Welcome to the show, Alexius. Alexius Wronka: Thanks, Keith. Happy to be here. Keith: I love the company nameโ€”Invisible Technologies. Itโ€™s just got this cool factor to it. You must be thrilled working for them. Alexius: It's great.

It's the best word.

Keith: It really is a cool name. Before we begin, I want to establish what the general consensus is around the definition of synthetic data. As we get into the topic, I think some things I consider synthetic data might actually be synthetic content or output, especially with generative AI.

Anything from a generative AI prompt could be considered synthetic content. But I want to hone in on what we mean by synthetic dataโ€”or what you and your team generally define it as.

Alexius: I think the key here is that virtually any content or output, like you described, could be classified as data. In the context of the work Invisible does with AI training, weโ€™re thinking about data used for training and improving large language models.

So if thereโ€™s human-generated data and then computer-generated data, synthetic data sits in that second bucketโ€”data created by computers in some capacity.

Keith: So when we talk about it, why are companies looking to use synthetic data? What are the benefits? And is it okay to call it synthetic data versus โ€œrealโ€ dataโ€”or should we just say human data?

Alexius: Iโ€™d call it human-created. Itโ€™s still real data. The idea is, as model trainers reach the limit of publicly available data sourcesโ€”and given how slow it is to create human dataโ€”weโ€™re trying to figure out how to create larger datasets faster.

That allows us to do different things with model training. Human data takes a long time to build. Itโ€™s prone to errors. Synthetic data lets you create thousandsโ€”millionsโ€”of data points in minutes. Thatโ€™s one of the major advantages.

Keith: When I think of synthetic data, I start imagining Excel spreadsheets, database listsโ€”like customer data, for example. We talk to companies that use synthetic data to create training data not based on real customers.

Alexius: You're thinking of dummy customers to build a test environment or new application without violating GDPR. Thatโ€™s a very naive use case for synthetic data.

The more sophisticated application is: how do we take known human data and modify it to help fine-tune models in a direction we wantโ€”especially when historical datasets donโ€™t exist?

Keith: Why else are companies turning to synthetic data? Some stories I've read say there just isnโ€™t enough human-generated data to properly train AI. I covered autonomous vehicles a lot, and they talked about using simulated data to train the systems because they lacked enough real-world sensor or LiDAR data.

Alexius: It really comes down to the fact that there just isnโ€™t enough data. Thatโ€™s the main reason for synthetic data. When they built AlphaGo, they used synthetic data because not enough games had been played.

They built two models to go against each other in an adversarial system to generate synthetic datasets. Whatโ€™s interesting about synthetic data in adversarial systems is that they can generate novel datasets that help fine-tune and improve your model over time.

Keith: So the benefit is that you get the data needed to train the models, right?

Alexius: Exactlyโ€”and you get it much faster. You can iterate more quickly. Creating human datasets might take days, weeks, or even months to get a large enough corpus. Then it goes into the model, and everyone crosses their fingers hoping it improves.

Keith: And synthetic datasets are probably cleaner than human-generated ones, right? No bad, unclear, or unstructured dataโ€”you can structure it right from the start. Alexius: Yeah.

One of the biggest challenges in creating human datasets is orchestrating the effort across many people to complete the same task with the same output. Itโ€™s full of errors. People have to create and QA the right thing. With synthetic data, it just follows a process.

It algorithmically gets there, and the output is consistent 99.9% of the time.

Keith: On the other end of the scaleโ€”these things arenโ€™t perfectโ€”what concerns do companies raise with you about synthetic data? What could be troublesome?

Alexius: From a model training perspective, the biggest challenge with synthetic data is: does it give us anything new or interesting that actually improves the model? If it's built from data that already exists in the model, the concern is youโ€™re just overfitting the model to the synthetic data.

Thatโ€™s why adversarial systems are usefulโ€”they help generate novel datasets. A lot of companies now ask: whatโ€™s the right mix of human and synthetic data?

How much human data do I need to create novel situations that guide models in different directions, while also having a large enough dataset to train the model?

The way I talk about it is: when building a new or state-of-the-art model, you want to get it โ€œup to bar.โ€ Synthetic data gets you there quickly. Think about distillation techniques that DeepSeek usedโ€”they took data from other models and used high-quality outputs to reach a strong baseline.

But to get โ€œabove the bar,โ€ you have to return to human dataโ€”creating datasets that get you out of what Iโ€™d call the synthetic data moat.

Keith: When I think of synthetic and human data, I start picturing a baking analogyโ€”like oil and water. Can you mix them to create something new? Or do they always separate? Or is that the worst analogy ever?

Alexius: If you took one piece of synthetic data and one piece of human-created data, the untrained eye wouldnโ€™t know the difference.

There are tools online where you can input a model-generated output and ask, โ€œDid AI create this?โ€ Sometimes theyโ€™re accurateโ€”sometimes they misidentify human content as AI-generated, and vice versa. So Iโ€™d say: itโ€™s all water.

Keith: So not oil and waterโ€”more like adding Kool-Aid to water? You get different water?

Alexius: Iโ€™d say itโ€™s more like using two types of milkโ€”dairy milk and oat milkโ€”and seeing how each changes a recipe. If youโ€™re making pancakes, cowโ€™s milk gives one flavor, oat milk another. Both act as liquids and binding agents, but theyโ€™re both still milk.

Keith: I love that weโ€™re using baking analogies. One thing Iโ€™m seeing is companies using AI models with internal data for more accurate resultsโ€”Retrieval Augmented Generation, or RAG.

Do they still have use cases for synthetic data in these situations, or are they moving from a big-picture model to a narrower focus?

Alexius: Thatโ€™s a great question. A lot of model builders are moving away from common public languages like Python or JavaScriptโ€”thereโ€™s lots of data there. But now theyโ€™re working on enterprise apps built in Java, C++, or other lower-level languages, where thereโ€™s less data.

They might only have two or three large internal codebases. So how do they get more examples? Thatโ€™s a great opportunity to use synthetic dataโ€”create versions of that code to build more instances.

Then you might have human annotators label that data and feed it back into the model, so it better understands your specific codebase.

Keith: I missed this earlier: does the use of synthetic data happen more in specific industries or company departments? Where is it used most?

Alexius: We touched on this. Self-driving is a big oneโ€”there just isnโ€™t enough data. Healthcare is another. These are industries very focused on safety or heavy regulation. They need to work out all the bugs before deploying AI in production.

You canโ€™t put a car on the road unless youโ€™re 99% sure it wonโ€™t crash.

Keith: In healthcare, is this where they use AI to detect tumors earlyโ€”where thereโ€™s not enough real data?

Alexius: Maybe not tumors just yet. But think about care plans. You can create different patient scenarios with various diseases or symptoms, then fine-tune a model on that data to generate care plans. Thatโ€™s a great use case. There are HIPAA laws, so using real data is tough.

Synthetic data can create realistic scenarios for model training. Keith: Got it.

One fear I hearโ€”and this might be more about synthetic contentโ€”is that results might be inaccurate. Is it like the โ€œcopy of a copyโ€ problem, where synthetic data gets tweaked and re-fed into the system and degrades over time?

Alexius: Thatโ€™s a real concernโ€”reinforced hallucinations. A hallucinated output gets into a core dataset, and if you retrain on that, it becomes baked into the system. Thatโ€™s why you need human annotators to look at the data, create new datasets, and catch those anomalies early.

Keith: This is slightly off-topic, but in a prep call with your colleague Adam, we talked about a story where linguists noticed AI using the word โ€œdelveโ€ more and moreโ€”and then humans started using it because they saw it in AI outputs.

If synthetic data creates something not based on human content, but we start believing it, is that dangerous? Or am I going down a Matrix rabbit hole?

Alexius: Itโ€™s a fascinating question. As humansโ€”as biological machinesโ€”we adapt. When we see things repeatedly in our environment, we adjust and assume itโ€™s normal. People use ChatGPT to write emails, respond to Slack messages. Over time, their own language starts mirroring the AIโ€™s style.

Keith: Rightโ€”we read AI-generated outputs without realizing it, and start adopting the same words. "Delve" is the one they highlighted. I first thought it was โ€œrealm,โ€ but itโ€™s โ€œdelve.โ€ Thereโ€™s an NPR article about it.

Alexius: Thatโ€™s fascinating. Language evolves naturally. Think about โ€œwhoโ€ versus โ€œwhom.โ€ Hardly anyone uses โ€œwhomโ€ correctly anymore. Language just moves on.

Keith: I started as a newspaper writer. Now people are using acronyms in texts. It feels like language is devolvingโ€”almost like weโ€™re back to hieroglyphics with emojis.

Alexius: How many people have you heard actually pronounce โ€œLOLโ€? Keith: Nobody does.

Waitโ€”people do? Alexius: Yeah. I hang out with people who actually say โ€œLOLโ€ out loud.

Keith: Iโ€™ve heard both โ€œLOLโ€ and โ€œL-O-L,โ€ instead of laughing. Go back 25 yearsโ€”before textingโ€”and nobody was doing that. But as things enter the environment, we adapt.

Keith: Letโ€™s jump back to my regularly scheduled questions. I think you mentioned earlierโ€”when it comes to synthetic data used in trainingโ€”should it be identified? In the ChatGPT world or AI-generated images, itโ€™s good practice to disclose that they were created by AI.

Is there a similar conversation happening for synthetic training data? Or does it not matter as much?

Alexius: Anyone using data today wants to know its point of origin. If itโ€™s synthetic, they want to know that. As we said earlier, synthetic data has specific use casesโ€”like improving models and getting them to baseline quality.

But if someoneโ€™s not getting the results they expect, itโ€™s important to know whether the training data was synthetic or human. Transparency is incredibly important. And from my backgroundโ€”what matters most in data is where it comes from and how transparent the pipeline is between its origin and the user.

If thereโ€™s a black box in the middle, thatโ€™s when things fall apart.

Keith: That leads me to explainability. It feels like the black box is a big issueโ€”ask a data scientist, โ€œWhy did the model make this decision?โ€ and they usually say, โ€œWell...โ€ Alexius: Exactly.

It gets into the whole concept of chain-of-thought reasoning. Explainability is still incredibly hard. You can trace it back to model weights and say, โ€œthis is kind of why,โ€ but to truly pin down the probabilistic reason one word followed anotherโ€”we donโ€™t have that capability yet.

Youโ€™d need to tie model weights and training directly to output, and that would be close to machine codeโ€”stuff most people canโ€™t read.

Keith: Does it become a chicken-or-egg scenario? Or am I just making weird logic leaps? Alexius: No, I get what youโ€™re saying. We donโ€™t know why.

No matter how much we try to explain itโ€”any explanation from the model is still the product of a probabilistic system we donโ€™t fully understand.

Even if a model gives reasoning for its answer, that reasoning is itself generated by another probabilistic process. To really explain it, youโ€™d have to connect the training data and weights directly to outputsโ€”and that would likely require understanding near-machine-level code.

Keith: With synthetic and human dataโ€”are there tools that detect if something was produced synthetically? I know you talked about labeling it, but can systems also detect it automatically? Alexius: Absolutely.

If we want students and learners to practice honestly, this is criticalโ€”so they donโ€™t just cheat. Detection is a huge part of model evaluationโ€”understanding not just performance, but what kind of output a model will produce.

You want to be able to say: โ€œWas this generated by a machine or a human?โ€ In fraud detection and other enterprise use cases, this is a major investment. Enterprises need to know whatโ€™s real and whatโ€™s synthetic.

Keith: Is this mostly something only data scientists need to care about, or should the entire enterprise understand synthetic data?

Alexius: I donโ€™t think weโ€™re quite there yet, but weโ€™re heading in that direction. Just like you train your workforce around phishing and cybersecurity, weโ€™ll need similar education around synthetic data.

Youโ€™ll get emails from โ€œthe CEOโ€ that look completely legitโ€”maybe even from their email addressโ€”and youโ€™ll need tools to protect against that.

Keith: But how does that tie into synthetic data?

Alexius: That is synthetic dataโ€”malicious synthetic data. Bad actors are going to use these models. We need systems that can detect when data didnโ€™t come from a real human.

Keith: Have best practices been established yet for handling synthetic data?

Alexius: Still very earlyโ€”very much the Wild West. Some regulated industries, like full self-driving or healthcare, are starting to build frameworks. But for most enterprises, this is still early days. Although, Iโ€™d say itโ€™s less Wild West and more risk-averse. Thereโ€™s more caution than recklessness.

Companies arenโ€™t rushing tools to market without thinking through the risks.

Keith: One last analogy, Alexiusโ€”was The Matrix the ultimate example of synthetic data? A simulated 1990s Earth?

Alexius: Remember: nobody knows what chicken tastes like.

Keith: So if I wanted to explain synthetic data to someone, could I say, โ€œItโ€™s like the world created by the machines in The Matrixโ€?

Alexius: Yeahโ€”I think thatโ€™s a great example. Itโ€™s virtual reality.

Keith: Alexius, thanks again for chatting with us about this great topic. Alexius: Thanks, Keith. Keith: Thatโ€™s going to do it for this weekโ€™s show. Be sure to like the video, subscribe to the channel, and leave your thoughts below if you're watching on YouTube.

Join us every week for new episodes of Today in Tech. Iโ€™m Keith Shaw. Thanks for watching. ย