On the predictability of Search Trends

Monday, August 17, 2009 at 8/17/2009 02:19:00 PM



Since launching Google Trends and Google Insights for Search, we've been providing daily insight into what the world is searching for. An understanding of search trends can be useful for advertisers, marketers, economists, scholars, and anyone else interested in knowing more about their world and what's currently top-of-mind.

As many have observed, the trends of some search queries are quite seasonal and have repeated patterns. See, for instance, the search trends for the query "ski" hit their peak during the winter seasons in the US and Australia. The search trends for basketball correlate with annual league events, and are consistent year-over-year. When looking at trends of the aggregated volume of search queries related to particular categories, one can also observe regular patterns in some categories like Food & Drink or Automotive. Such trends sequences appear quite predictable, and one would naturally expect the patterns of previous years to repeat looking forward.

On the other hand, for many other search queries and categories, the trends are quite irregular and hard to predict. Examples include the search trends for obama, twitter, android, or global warming, and the trend of aggregate searches in the News & Current Events category.

Having predictable trends for a search query or for a group of queries could have interesting ramifications. One could forecast the trends into the future, and use it as a "best guess" for various business decisions such as budget planning, marketing campaigns and resource allocations. One could identify deviation from such forecasting and identify new factors that are influencing the search volume as demonstrated in Flu Trends.

We were therefore interested in the following questions:

  • How many search queries have trends that are predictable?
  • Are some categories more predictable than others? How is the distribution of predictable trends between the various categories?
  • How predictable are the trends of aggregated search queries for different categories? Which categories are more predictable and which are less so?
To learn about the predictability of search trends, and so as to overcome our basic limitation of not knowing what the future will entail, we characterize the predictability of a Trends series based on its historical performance. In other words, we estimate the a posteriori predictability of a sequence determined by the error of forecasted trends vs the actual performance.

Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The error between the forecasting trends and the actual trends characterizes the predictability level of a sequence, and when the error is smaller than a pre-defined threshold, we denote the trends query as predictable.

Our work to date is summarized in a paper called On the Predictability of Search Trends which includes the following observations:
  • Over half of the most popular Google search queries are predictable in a 12 month ahead forecast, with a mean absolute prediction error of about 12%.
  • Nearly half of the most popular queries are not predictable (with respect to the model we have used).
  • Some categories have particularly high fraction of predictable queries; for instance, Health (74%), Food & Drink (67%) and Travel (65%).
  • Some categories have particularly low fraction of predictable queries; for instance, Entertainment (35%) and Social Networks & Online Communities (27%).
  • The trends of aggregated queries per categories are much more predictable: 88% of the aggregated category search trends of over 600 categories in Insights for Search are predictable, with a mean absolute prediction error of of less than 6%.
  • There is a clear association between the existence of seasonality patterns and higher predictability, as well as an association between high levels of outliers and lower predictability. For the Entertainment category that has typically less seasonal search behavior as well as relatively higher number of singular spikes of interest, we have seen a predictability of 35%, where as the category of Travel with a very seasonal behavior and lower tendency for short spikes of interest had a predictability of 65%.
  • One should expect the actual search trends to deviate from forecast for many predictable queries, due to possible events and dynamic circumstances.
  • We show the forecasting vs actual for trends of a few categories, including some that were used recently for predicting the present of various economic indicators. This demonstrates how forecasting can serve as a good baseline for identifying interesting deviations in actual search traffic.
As we see that many of the search trends are predictable, we are introducing today a new forecasting feature in Insights for Search, along with a new version of the product. The forecasting feature is applied to queries which are identified as predictable (see, for instance, basketball or the trends in the Automotive category) and then shown as an extrapolation of the historical trends and search patterns.

There are many more questions that can be looked at regarding search trends in general, and their predictability in particular, including design and testing more advanced forecasting models, getting other insights into the distributions of sequences, and demonstrating interesting deviations of actual-vs-forecast for predictable trends series. We'd love to hear from you - share with us your findings, published results or insights - email us at insightsforsearch@google.com.

10 comments:

Rob J Hyndman said...

Interesting analysis. However, it looks like there are a couple of problems:
1. The NMSSE uses the variance of the historical data as a denominator. This only works for stationary data and it is unlikely that any of the data here are stationary. See my paper at http://robjhyndman.com/papers/another-look-at-measures-of-forecast-accuracy for a discussion of such measures.
2. The use of regression on the de-seasonalised data assumes linear trends (assuming this is what was meant -- it is not explained clearly). A much better solution would be to use a local linear method to allow the trends to adapt over time. e.g., Holt's method.

Bertil Hatt said...

Time series econmetrics. . . So many childhood memories coming back (I kinda had a nerdy childhood). So I guess you are all going through classic models: my professors were better teacher then I ever will be, so I'll let you pick in their books what refinement to choose.

What would be fascinating (and potentially fun with regards to identifiability and the law of very large numbers, etc.) is to try to tell what websites help you predict unpredictably popular search terms: where those words appeared *before* they were massively looked?

The resulting classification would be fun to compare to whatever replaced PageRank, being a mostly off-line SocialRank; it would also be great to compare to Leskovek's classification blogs who borke the most news.

Wil said...

I just did a presentation on the trend for hangover, shockingly enough big spikes on Saturday and Sunday :)

Thanks for sharing guys/gals

MzGingeR said...

Predictability of some trends can help publishers, advertisers and businesses, plan ahead for these trends. Great info!

incrediblehelp said...

I am surprised Entertainment was so low considering we know when shows.actors will be popular considering release dates are always known in advance.

Salem said...

I think if bloggers were to read Google's reports more often, they'd find that google knows what they're talking about when they say 'trending'. Fantastic read... Thanks again, google.

Usman said...

Very interesting work! I'm not an expert at all in this area, but I'm curious about something (tell me if it's dumb).

I notice that your notion of predictability is defined by how well the model predicts based strictly on historical data. Have you, or has anyone, investigated the utility of a predictability measure that takes into account how well the model "predicts" trends/events given both past and "future" data (relative to the event instance)?

The task here would then change from "complete-the-sentence" to "fill-in-the-blank" as it were, where the sentence is the complete time series and the blank is a smaller "test" period from that series, which is withheld from the model. This might make "prediction" easier for the model, but any discrepancies might shed more insight as to the inherent unpredictability of such an event or trend.

I know it's not a new idea, but please pardon me if it's useless in the context of what you're trying to accomplish.

Prez said...

this is really intresting, if the search trends of young mind with terrorism and other bad thougts was identified, it can be changed at the begining itself, this can be used for good cause if works out well...

Michael F. Martin said...

The predictability measure is rather strange. Seems more accurate to call it a "repeatability" measure, no? It seems to merely translate a history in time and average to see how close various periods match.

A more sophisticated method for determining "predictability" might look for periodicities within same width time windows. The method described in the paper would seem to require that all periodicities persist over similar-width time-windows. What if there's "dispersion"?

Of course, there's no way to Fourier transform the normalized units... for us, the public.

Glad said...

Google can give us some more case studies or examples for some good analysis....