As the demand for high-quality training data continues to surge, synthetic data is emerging as a game-changing tool in the world of AI development. But is it the silver bullet enterprises needโor a potential minefield of risks?In this episode of Today in Tech, host Keith Shaw sits down with Alexius Wronka, CTO of Data and Growth at Invisible Technologies, to explore the advantages, limitations, and ethical challenges of using synthetic data to train large language models (LLMs) and enterprise AI systems.๐ Topics Covered:* What exactly is synthetic data?* Key benefits vs. human-generated data* Use cases in healthcare, autonomous vehicles, and enterprise AI* Dangers of model overfitting and data hallucination* Synthetic content, explainability, and detection tools* The Matrix analogy: Are we training AI inside simulations?Whether you're a data scientist, CTO, or just curious about how AI models learn, this episode offers a deep dive into one of the most critical and misunderstood trends in machine learning.๐ Don't forget to like, comment, and subscribe for more episodes of Today in Tech!#SyntheticData #AITraining #InvisibleTechnologies #AlexiusWronka #TodayInTech #KeithShaw #EnterpriseAI #GenerativeAI #TechPodcast
Register Now
Keith Shaw: As the need for data for AI training continues to increase, many companies are turning to synthetic data to build their projects. But will this create new problems down the road?
We're going to discuss the pros and cons of using synthetic data on this episode of Today in Tech. Hi everybody. Welcome to Today in Tech. I'm Keith Shaw. Joining me on the show today is Alexius Wronka. He is the CTO of Data and Growth at Invisible Technologies.
Welcome to the show, Alexius. Alexius Wronka: Thanks, Keith. Happy to be here. Keith: I love the company nameโInvisible Technologies. Itโs just got this cool factor to it. You must be thrilled working for them. Alexius: It's great.
It's the best word.
Keith: It really is a cool name. Before we begin, I want to establish what the general consensus is around the definition of synthetic data. As we get into the topic, I think some things I consider synthetic data might actually be synthetic content or output, especially with generative AI.
Anything from a generative AI prompt could be considered synthetic content. But I want to hone in on what we mean by synthetic dataโor what you and your team generally define it as.
Alexius: I think the key here is that virtually any content or output, like you described, could be classified as data. In the context of the work Invisible does with AI training, weโre thinking about data used for training and improving large language models.
So if thereโs human-generated data and then computer-generated data, synthetic data sits in that second bucketโdata created by computers in some capacity.
Keith: So when we talk about it, why are companies looking to use synthetic data? What are the benefits? And is it okay to call it synthetic data versus โrealโ dataโor should we just say human data?
Alexius: Iโd call it human-created. Itโs still real data. The idea is, as model trainers reach the limit of publicly available data sourcesโand given how slow it is to create human dataโweโre trying to figure out how to create larger datasets faster.
That allows us to do different things with model training. Human data takes a long time to build. Itโs prone to errors. Synthetic data lets you create thousandsโmillionsโof data points in minutes. Thatโs one of the major advantages.
Keith: When I think of synthetic data, I start imagining Excel spreadsheets, database listsโlike customer data, for example. We talk to companies that use synthetic data to create training data not based on real customers.
Alexius: You're thinking of dummy customers to build a test environment or new application without violating GDPR. Thatโs a very naive use case for synthetic data.
The more sophisticated application is: how do we take known human data and modify it to help fine-tune models in a direction we wantโespecially when historical datasets donโt exist?
Keith: Why else are companies turning to synthetic data? Some stories I've read say there just isnโt enough human-generated data to properly train AI. I covered autonomous vehicles a lot, and they talked about using simulated data to train the systems because they lacked enough real-world sensor or LiDAR data.
Alexius: It really comes down to the fact that there just isnโt enough data. Thatโs the main reason for synthetic data. When they built AlphaGo, they used synthetic data because not enough games had been played.
They built two models to go against each other in an adversarial system to generate synthetic datasets. Whatโs interesting about synthetic data in adversarial systems is that they can generate novel datasets that help fine-tune and improve your model over time.
Keith: So the benefit is that you get the data needed to train the models, right?
Alexius: Exactlyโand you get it much faster. You can iterate more quickly. Creating human datasets might take days, weeks, or even months to get a large enough corpus. Then it goes into the model, and everyone crosses their fingers hoping it improves.
Keith: And synthetic datasets are probably cleaner than human-generated ones, right? No bad, unclear, or unstructured dataโyou can structure it right from the start. Alexius: Yeah.
One of the biggest challenges in creating human datasets is orchestrating the effort across many people to complete the same task with the same output. Itโs full of errors. People have to create and QA the right thing. With synthetic data, it just follows a process.
It algorithmically gets there, and the output is consistent 99.9% of the time.
Keith: On the other end of the scaleโthese things arenโt perfectโwhat concerns do companies raise with you about synthetic data? What could be troublesome?
Alexius: From a model training perspective, the biggest challenge with synthetic data is: does it give us anything new or interesting that actually improves the model? If it's built from data that already exists in the model, the concern is youโre just overfitting the model to the synthetic data.
Thatโs why adversarial systems are usefulโthey help generate novel datasets. A lot of companies now ask: whatโs the right mix of human and synthetic data?
How much human data do I need to create novel situations that guide models in different directions, while also having a large enough dataset to train the model?
The way I talk about it is: when building a new or state-of-the-art model, you want to get it โup to bar.โ Synthetic data gets you there quickly. Think about distillation techniques that DeepSeek usedโthey took data from other models and used high-quality outputs to reach a strong baseline.
But to get โabove the bar,โ you have to return to human dataโcreating datasets that get you out of what Iโd call the synthetic data moat.
Keith: When I think of synthetic and human data, I start picturing a baking analogyโlike oil and water. Can you mix them to create something new? Or do they always separate? Or is that the worst analogy ever?
Alexius: If you took one piece of synthetic data and one piece of human-created data, the untrained eye wouldnโt know the difference.
There are tools online where you can input a model-generated output and ask, โDid AI create this?โ Sometimes theyโre accurateโsometimes they misidentify human content as AI-generated, and vice versa. So Iโd say: itโs all water.
Keith: So not oil and waterโmore like adding Kool-Aid to water? You get different water?
Alexius: Iโd say itโs more like using two types of milkโdairy milk and oat milkโand seeing how each changes a recipe. If youโre making pancakes, cowโs milk gives one flavor, oat milk another. Both act as liquids and binding agents, but theyโre both still milk.
Keith: I love that weโre using baking analogies. One thing Iโm seeing is companies using AI models with internal data for more accurate resultsโRetrieval Augmented Generation, or RAG.
Do they still have use cases for synthetic data in these situations, or are they moving from a big-picture model to a narrower focus?
Alexius: Thatโs a great question. A lot of model builders are moving away from common public languages like Python or JavaScriptโthereโs lots of data there. But now theyโre working on enterprise apps built in Java, C++, or other lower-level languages, where thereโs less data.
They might only have two or three large internal codebases. So how do they get more examples? Thatโs a great opportunity to use synthetic dataโcreate versions of that code to build more instances.
Then you might have human annotators label that data and feed it back into the model, so it better understands your specific codebase.
Keith: I missed this earlier: does the use of synthetic data happen more in specific industries or company departments? Where is it used most?
Alexius: We touched on this. Self-driving is a big oneโthere just isnโt enough data. Healthcare is another. These are industries very focused on safety or heavy regulation. They need to work out all the bugs before deploying AI in production.
You canโt put a car on the road unless youโre 99% sure it wonโt crash.
Keith: In healthcare, is this where they use AI to detect tumors earlyโwhere thereโs not enough real data?
Alexius: Maybe not tumors just yet. But think about care plans. You can create different patient scenarios with various diseases or symptoms, then fine-tune a model on that data to generate care plans. Thatโs a great use case. There are HIPAA laws, so using real data is tough.
Synthetic data can create realistic scenarios for model training. Keith: Got it.
One fear I hearโand this might be more about synthetic contentโis that results might be inaccurate. Is it like the โcopy of a copyโ problem, where synthetic data gets tweaked and re-fed into the system and degrades over time?
Alexius: Thatโs a real concernโreinforced hallucinations. A hallucinated output gets into a core dataset, and if you retrain on that, it becomes baked into the system. Thatโs why you need human annotators to look at the data, create new datasets, and catch those anomalies early.
Keith: This is slightly off-topic, but in a prep call with your colleague Adam, we talked about a story where linguists noticed AI using the word โdelveโ more and moreโand then humans started using it because they saw it in AI outputs.
If synthetic data creates something not based on human content, but we start believing it, is that dangerous? Or am I going down a Matrix rabbit hole?
Alexius: Itโs a fascinating question. As humansโas biological machinesโwe adapt. When we see things repeatedly in our environment, we adjust and assume itโs normal. People use ChatGPT to write emails, respond to Slack messages. Over time, their own language starts mirroring the AIโs style.
Keith: Rightโwe read AI-generated outputs without realizing it, and start adopting the same words. "Delve" is the one they highlighted. I first thought it was โrealm,โ but itโs โdelve.โ Thereโs an NPR article about it.
Alexius: Thatโs fascinating. Language evolves naturally. Think about โwhoโ versus โwhom.โ Hardly anyone uses โwhomโ correctly anymore. Language just moves on.
Keith: I started as a newspaper writer. Now people are using acronyms in texts. It feels like language is devolvingโalmost like weโre back to hieroglyphics with emojis.
Alexius: How many people have you heard actually pronounce โLOLโ? Keith: Nobody does.
Waitโpeople do? Alexius: Yeah. I hang out with people who actually say โLOLโ out loud.
Keith: Iโve heard both โLOLโ and โL-O-L,โ instead of laughing. Go back 25 yearsโbefore textingโand nobody was doing that. But as things enter the environment, we adapt.
Keith: Letโs jump back to my regularly scheduled questions. I think you mentioned earlierโwhen it comes to synthetic data used in trainingโshould it be identified? In the ChatGPT world or AI-generated images, itโs good practice to disclose that they were created by AI.
Is there a similar conversation happening for synthetic training data? Or does it not matter as much?
Alexius: Anyone using data today wants to know its point of origin. If itโs synthetic, they want to know that. As we said earlier, synthetic data has specific use casesโlike improving models and getting them to baseline quality.
But if someoneโs not getting the results they expect, itโs important to know whether the training data was synthetic or human. Transparency is incredibly important. And from my backgroundโwhat matters most in data is where it comes from and how transparent the pipeline is between its origin and the user.
If thereโs a black box in the middle, thatโs when things fall apart.
Keith: That leads me to explainability. It feels like the black box is a big issueโask a data scientist, โWhy did the model make this decision?โ and they usually say, โWell...โ Alexius: Exactly.
It gets into the whole concept of chain-of-thought reasoning. Explainability is still incredibly hard. You can trace it back to model weights and say, โthis is kind of why,โ but to truly pin down the probabilistic reason one word followed anotherโwe donโt have that capability yet.
Youโd need to tie model weights and training directly to output, and that would be close to machine codeโstuff most people canโt read.
Keith: Does it become a chicken-or-egg scenario? Or am I just making weird logic leaps? Alexius: No, I get what youโre saying. We donโt know why.
No matter how much we try to explain itโany explanation from the model is still the product of a probabilistic system we donโt fully understand.
Even if a model gives reasoning for its answer, that reasoning is itself generated by another probabilistic process. To really explain it, youโd have to connect the training data and weights directly to outputsโand that would likely require understanding near-machine-level code.
Keith: With synthetic and human dataโare there tools that detect if something was produced synthetically? I know you talked about labeling it, but can systems also detect it automatically? Alexius: Absolutely.
If we want students and learners to practice honestly, this is criticalโso they donโt just cheat. Detection is a huge part of model evaluationโunderstanding not just performance, but what kind of output a model will produce.
You want to be able to say: โWas this generated by a machine or a human?โ In fraud detection and other enterprise use cases, this is a major investment. Enterprises need to know whatโs real and whatโs synthetic.
Keith: Is this mostly something only data scientists need to care about, or should the entire enterprise understand synthetic data?
Alexius: I donโt think weโre quite there yet, but weโre heading in that direction. Just like you train your workforce around phishing and cybersecurity, weโll need similar education around synthetic data.
Youโll get emails from โthe CEOโ that look completely legitโmaybe even from their email addressโand youโll need tools to protect against that.
Keith: But how does that tie into synthetic data?
Alexius: That is synthetic dataโmalicious synthetic data. Bad actors are going to use these models. We need systems that can detect when data didnโt come from a real human.
Keith: Have best practices been established yet for handling synthetic data?
Alexius: Still very earlyโvery much the Wild West. Some regulated industries, like full self-driving or healthcare, are starting to build frameworks. But for most enterprises, this is still early days. Although, Iโd say itโs less Wild West and more risk-averse. Thereโs more caution than recklessness.
Companies arenโt rushing tools to market without thinking through the risks.
Keith: One last analogy, Alexiusโwas The Matrix the ultimate example of synthetic data? A simulated 1990s Earth?
Alexius: Remember: nobody knows what chicken tastes like.
Keith: So if I wanted to explain synthetic data to someone, could I say, โItโs like the world created by the machines in The Matrixโ?
Alexius: YeahโI think thatโs a great example. Itโs virtual reality.
Keith: Alexius, thanks again for chatting with us about this great topic. Alexius: Thanks, Keith. Keith: Thatโs going to do it for this weekโs show. Be sure to like the video, subscribe to the channel, and leave your thoughts below if you're watching on YouTube.
Join us every week for new episodes of Today in Tech. Iโm Keith Shaw. Thanks for watching. ย
Sponsored Links