Here at Google it is not uncommon for researchers to work on products (and is modus operandi for me); you can come up with something interesting, experiment to convince yourself and others that it is worthwhile, and then work as part of the team to build it and help create world-changing products. But are you willing to do that? That decision is the hard part, not publishing a paper.
I think from a Google standpoint, we need to make sure these barriers don't form, that making products, experimenting and having a venue for trying bold new approaches continues to be part of the culture.
Research Blog
Can You Publish at Google?
Tuesday, May 06, 2008 at 5/06/2008 03:10:00 PM
VisualRank
Thursday, May 01, 2008 at 5/01/2008 01:30:00 PM
Posted by Shumeet Baluja and Yushi Jing
At WWW-2008, in Beijing, China, we presented our paper "PageRank for Product Image Search". In this paper, we presented a system that used visual cues, instead of solely text information, to determine the rank of images. The idea was simple: find common visual themes in a set of images, and then find a small set of images that best represented those themes. The resulting algorithm wound up being PageRank, but on an entirely inferred graph of image similarities. Since the release of the paper, we've noticed lots of coverage in the press and have received quite a few questions. We thought we could answer a few of them here.
"Why did we choose to use products for our test case?" First and foremost, product queries are popular in actual usage; addressing them is important. Second, users have strong expectations of what results we should return for these queries; therefore, this category provides an important set of examples that we need to address especially carefully. Third, on a pragmatic note, they lend themselves well to the type of "image features" that we selected in this study. Since the publication of the paper, we've also extended our results to other query types, including travel-related queries. One of the nice features of the approach is that (we hope) it will be easy to extend to new domains; as research in measuring image or object similarity continues, the advances can easily be incorporated into the similarity calculation to compute the underlying graph; the computations on the graph do not change.
"Where are we going from here?" Besides broadening the sets of queries (and sets of features) for which we can use this approach, there are three directions we're exploring. First, estimating similarity measures for all of the images on the web is computationally expensive; approximations or alternative computations are needed. Second, we hope to evaluate our approach with respect to the large number of recently proposed alternative clustering methods. Third, many variations of PageRank can be used in quite interesting ways for image search. For example, we can use some of these previously published methods to reintroduce, in a meaningful manner, the textual information that the VisualRank algorithm removed. In the end, we have an approach that has an easy integration with both text and visual clues. Stay tuned for more on that in the coming months.
And now to answer the most commonly asked question, "Is it live?" Not yet. Currently, it is research in progress (click here to help speed up the process). In the meantime, though, if you'd like another sneak peek of our research on large graphs, this time in the context of YouTube datamining, just follow the link.
Finally, we want to extend our deepest thanks to the people who helped on this project, especially the image-search team; without their help, this research would not have been possible.
Research in the Cloud: Providing Cutting Edge Computational Resources to Scientists
Wednesday, April 23, 2008 at 4/23/2008 02:13:00 PM
Posted by Christophe Bisciglia, Senior Software Engineer, and Alfred Spector, Vice President of Research
The emergence of extremely large datasets, well beyond the capacity of almost any single computer, has challenged traditional and contemporary methods of analysis in the research world. While a simple spreadsheet or modest database remains sufficient for some research, problems in the domain of "computational science," which explores mathematical models via computational simulation, require systems that provide huge amounts of data storage and computer processing (current research areas in computational science include climate modeling, gene sequencing, protein mapping, materials science and many more). As an added hurdle, this level of computational infrastructure is often not affordable to research teams, who usually work with significant budgetary restrictions.
Fortunately, as the Internet technology industry expands its global infrastructure, accessing world class distributed computational and storage resources can be as simple as visiting a website. Building on its Academic Cloud Computing Initiative (ACCI) announced last October, Google and IBM, with the National Science Foundation, announced in February the CluE initiative to address this particular need. After coordinating the technical details with Google and IBM, the NSF posted the official solicitation of proposals last week.
Our primary goal in participating in the CluE initiative is to encourage the understanding, further refinement and --importantly-- targeted application of the latest distributed computing technology and methods across many academic disciplines. Engaging educators and researchers with the new potential of distributed computing for processing and analyzing extremely large datasets is an invaluable investment for any technology company to make, and Google in particular is pleased to make a contribution to the academic community that has enabled so many recent advances in the industry.
We're looking forward to an eclectic collection of proposals from the NSF's solicitation. We believe many will leverage the power of distributed computing to produce a diverse range of knowledge that will provide long term benefit to both the research community and the public at large. We also hope that Google's contribution to this low cost, open source approach to distributed computing will allow many more in the academic community to take advantage of this pervasive technological shift.
More details, including information on how to apply for access to these resources, is available on the NSF site.
Deploying Goog411
Friday, March 28, 2008 at 3/28/2008 03:34:00 PM
Posted by Francoise Beaufays
A couple of years ago, a few of us got together and decided to build Goog411. It would be a free phone service that users could call to connect to any business in the US, or simply to browse through a list of businesses such as "bookstores" in a given city. Everything would be fully automated, with no operator in the background, just a speech recognition system to converse with the user, and Google Maps to execute the business search.
We knew that speech recognition is not a solved problem; there would be users for whom the system wouldn't work well, and queries that would be harder to recognize than others. But we got big assets through hosting the service: we could iterate as often as we wanted on any component of the system, we'd have access to all the data, and we could measure whatever seemed relevant to callers. So we built Goog411, started taking some traffic, defining metrics, and iterated many, many times.
We learned a few interesting things in the process (see our ICASSP paper). For example, we discovered that databases with lists of business names are almost useless to train a language model for how users answer the question "What business name or category?"; aggregated web query logs from Google Maps yield far better performance. And we found the speech data we collect through our own service is almost as useful to model new queries as the web data, even though we have orders of magnitude less of it. After all, you may type "real estate" in Google Maps to glance at a few properties, but would you ask for it over the phone while driving your car?
Today Goog411 has grown from an experiment into a product, and we're working on expanding the service to Canada. As calls flow through the system, our focus is still on making the best use of the increasing data, defining metrics that best correlate to the user's experience, and taking advantage of the computer resources and data sources available within Google.
Maybe our most rewarding experience so far has been to see our traffic grow, and to see repeat callers succeed more and more often with the system. Have you tried it already? Just call 1-800-GOOG-411, and don't hesitate to send us feedback!
This year's scalability conference
Monday, February 11, 2008 at 2/11/2008 11:53:00 AM
Posted by Andrew Schwerin, Software Engineer
Managing huge repositories of data and large clusters of machines is no easy task -- and building systems that use those clusters to usefully process that data is even harder. Last year, we held a conference on scalable systems so a bunch of people who work on these challenges could get together and share ideas. Well, it was so much fun that we've decided to do it again.
This year, the conference is taking place in Seattle on Saturday, June 14. (Registration is free.) If you'd like to talk about a topic on scalable or large-scale systems that is near and dear to your heart, we'd love to hear from you. Potential topics include:
Development, deployment and production:
- Systems, environments and languages for building, deploying and debugging complex datacenter-scale apps, or for allowing teams of collaborating engineers to work together on such apps more effectively
- Unique challenges of scaling services for mobile devices
- Location-aware scaling techniques
- Experiences designing scalable apps involving mobile devices
Google Education Summit
Thursday, October 18, 2007 at 10/18/2007 05:31:00 PM
Posted by Jeff Walz and Kevin McCurley
The world's research and educational infrastructures are tightly intertwined. Research universities enable students to participate in research activities, and research contributes to the vitality of the educational experience. At Google, we also recognize the importance of education to our research and engineering activities. In addition to our own in-house activities, we maintain strong ties to academic institutions through visiting faculty programs and summer internships. In recognition of the importance of education to Google's mission, we also recently organized a Google Education Summit. Mehran Sahami has more to say about this in a recent blog post.
OpenHTMM Released
Sunday, September 23, 2007 at 9/23/2007 02:01:00 PM
Posted by Ashok C. Popat, Research Scientist
Statistical methods of text analysis have become increasingly sophisticated over the years. A good example is automated topic analysis using latent models, two variants of which are Probabilistic latent semantic analysis and Latent Dirichlet Allocation.
Earlier this year, Amit Gruber, a Ph.D. student at the Hebrew University of Jerusalem, presented a technique for analyzing the topical content of text at the Eleventh International Conference on Artificial Intelligence and Statistics in Puerto Rico.
Gruber's approach, dubbed Hidden Topic Markov Models (HTMM), was developed in collaboration with Michal Rosen-Zvi and Yair Weiss. It differs notably from others in that, rather than treat each document as a single "bag of words," it imposes a temporal Markov structure on the document. In this way, it is able to account for shifting topics within a document, and in so doing, provides a topic segmentation within the document, and also seems to effectively distinguish among multiple senses that the same word may have in different contexts within the same document.
Amit is currently a doing graduate internship at Google. As part of his project, he has developed a fresh implementation of his method in C++. We are pleased to release it as the OpenHTMM package to the research community under the Apache 2 license, in the hopes that it will be of general interest and facilitate further research in this area.
