Can synthetic data really predict future behavior? We put it to the test.
There are multiple potential applications for AI in the insights industry, one of which is synthetic data. Companies can do primary research with AI-generated synthetic data. But how reliable is that data?
Our first investigation answers the question – can synthetic data predict consumer choice?
TLDR: Kind of… but no, not really.
What is synthetic data?
You can think of synthetic data as AI cosplaying a bunch of different personas and answering your survey as those personas.
Synthetic data has three primary advantages.
- It’s fast. Imagine completing thousands of interviews in minutes.
- It’s cheap. You don’t need to pay panels or incentives for human respondents.
- It has the potential to produce higher quality data: no respondent fatigue, no survey fraud.
The test
We wanted to see if AI could go beyond modeling existing data – if it would really understand consumers enough to predict the unexpected, like the Barbenheimer craze of 2023.
We randomly chose 30 movies from the top 200 box office performers of the year for 2018, 2019, and 2023 (90 movies total). We then took the title, the description from IMDB, and the domestic USA box office performance for each movie.
We chose these years because the AI Model we used (model: gpt-3.5-burbo-1106) was trained on data from before 2023. The AI model “knows” the performance of the 2018 and 2019 movies (in terms of critical reviews and social media discussion), but it doesn’t have access to information about the performance of the 2023 movies.
Testing the 2023 movies is analogous to using synthetic data to evaluate new innovations. Ideally, synthetic data will be able to predict performance of the 2023 movies based on the title and IMDB description because the model understands what consumers want. This would support the validity of using synthetic data to predict performance of new-to-market movies and potentially predict the performance of new-to-market ideas in other categories.
We ran a study with 500 synthetic respondents per year. As part of this, we provided the AI with “personas” based on the industry-reported demographic breakdowns of moviegoers. This included demographic information on gender, age, ethnicity, number of key technology devices owned, and mirroring the population distribution in terms of household income.
We used our proprietary Upsiide platform for the movie evaluation and correlated performance of each movie in our test with the actual box office performance for each year.
The results
AI-generated synthetic data correlates well with movie performance in 2018 and 2019 (correlations of 0.75 in each year). However AI struggles to predict performance of the 2023 movies (correlations of 0.43).
The 2023 performance prediction is much worse when we remove sequels with a correlation of 0.15. Performance of movies with sequels are easier to predict for the model because it could draw on known performance for the earlier movie.
AI-generated data looks backward effectively (using the data on which the model was trained) but struggles to predict future behavior.
Much with anything, synthetic data can be useful in certain use cases right now. But when it comes to planning for your business’ future, it still has some catching up to do. We encourage critical investigation as part of AI adoption, and as such we promise to always tell you the good, the bad, and the ugly of our AI experiments.