Skip to Main Content

Synthetic Data: Who, What, When, Where, and Why?

Let's take a deep dive into the world of synthetic data and market research.

Synthetic Data, a hot topic in the research world at the moment, has captivated attention due to its potential impact on the industry; from removing bias from data collection, to speeding up the fielding process. Like most AI applications, synthetic data has both die hard fans and skeptics, but the Advanced Analytics team at Dig are excited about the potential that synthetic datasets might offer.

What is it?

Synthetic data is artificially generated data, either through AI or algorithms, that might be able to slot into your research process. It’s essentially programming AI with various personas (demographic or psychographic), and asking the AI to complete a survey as if it was the personas. It’s not real human data, but AI is able to reflect some of the complexities and characteristics of humans through machine learning, and provide the same sort of answers that real humans would.

Advantages of synthetic data

The main advantage of synthetic data is its potential to vastly reduce the time and cost involved in research. Similar to the switch to online surveys from CATI (Computer-Assisted Telephonic Interview), synthetic data has the promise of making research faster and cheaper. Due to external pressures for time and money, this has the potential to streamline and accelerate traditional data generation (who doesn’t want near instantaneous results for something they’re working on?).

Synthetic data also has the ability to create and supplement large and diverse data sets, where real-world data is difficult to come by, deepening the richness and insights from the analysis, as well as supporting model and proof of concept testing for new methodologies.

It also holds immense potential in proof of concept analysis, where it enables faster iterations. Joel Armstrong, Director of AI, Dig Insights, emphasized in our recent Between Two Joels Q&A session, that AI can kickstart the process by generating data, facilitating rapid experimentation and refinement of methodologies.

Moreover, synthetic data offers a unique advantage over human-generated responses: consistency. Unlike human participants who may exhibit fatigue or varying responses based on factors like time of day, AI remains steadfast in its responses, ensuring reliability in data collection.

"AI-generated synthetic data is a powerful new tool to add to your toolbelt. We're working hard to learn when and where synthetic data is most useful, so we can use this new tool while maintaining our standards of quality for the work we deliver."- Joel Armstrong, Director, AI, Dig Insights

Challenges of Synthetic Data

The main challenge is trust - how can we trust the validity of the generated data? In our Between Two Joels session, Armstrong said experimenting with validating the generated data is crucial to understanding the accuracy of the output. As AI takes on different personas, supported by machine model learning, there is a risk for bias, due to the fact that models are trained on available internet data (which is inherently biased).

The internet, which provided the training data for these models, is not representative of many cultures. Even if synthetic data represents a reasonable approximation of North American people, it won’t necessarily be useful for international research. And this brings up another challenge; the models are trained on existing, historical data. This can lead to the AI struggling to understand how consumers feel now and to project their future attitudes and behaviors.

As our experts pointed out in our Between Two Joels session recently, synthetic data isn't yet fully validated and thus may not be considered a reliable method. Additionally, concerns linger regarding bias inherent in AI models, reflecting biases present in the data they were trained on.

Moreover, the applicability of synthetic data in certain contexts, such as business-to-business (B2B) research, remains questionable due to ongoing validation efforts and limitations in replicating human decision-making processes accurately.

Another risk of synthetic data is in the name. Synthetic immediately gives us the idea that this is “fake” data. Some experts worry that this undercooks its potential and makes people unnecessarily nervous.

What now?

While the potential of synthetic data has everyone in the market research industry excited over the prospects for accelerating research processes and expanding data sets, we need to tread cautiously. Validating generated data and addressing bias are essential to getting the full potential of synthetic data. As advancements continue, the market research industry will need to remain adaptive, embracing new methodologies while upholding standards of quality and validity.

If you want to learn more about synthetic data, watch our webinar and Q&A with our team of AI experts here.