Google Research Blog logo

From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas





Yet in each word some concept there must be...
— from Goethe's Faust (Part I, Scene III)

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google's core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

How do we represent concepts? Our approach piggybacks on the unique titles of entries from an encyclopedia, which are mostly proper and common noun phrases. We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia's groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article's canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept's url. Our database thus includes weights that measure degrees of association. For example, the top two entries for football indicate that it is an ambiguous term, which is almost twice as likely to refer to what we in the US call soccer:



text=football url count
1.  Association football  44,984
2.  American football  23,373
⋮ 

An inverted index can be used to perform reverse look-ups, identifying salient terms for each concept. Some of the highest-scoring strings — including synonyms and translations — for both sports, are listed below:




concept:
soccer
football and Football
Soccer and soccer
Association football
fútbol and Fútbol
footballer
Futbol and futbol
Fußball
futebol
futbolista
サッカー
축구
footballeur
Fußballspieler
sepak bola
足球
فوتبال
футболист
כדורגל
piłkarz
voetbalclub
ฟุตบอล
bóng đá
voetbal
Foutbaal
futebolista
لعبة كرة القدم
fotbal
          concept:
football
American football
football and Football
fútbol americano
football américain
アメリカンフットボール
American football rules
futebol americano
فوتبال آمریکایی
美式足球
football americano
Amerikan futbolu
Le Football Américain
football field
อเมริกันฟุตบอล
פוטבול
كرة القدم الأمريكية
Futbol amerykański
미식축구
futbolu amerykańskiego
football team
американского футбола
Amerikai futball
sepak bola Amerika
football player
američki fudbal
反則
كرة القدم الأميركية

Associated counts can easily be turned into percentages. The following table illustrates the concept-to-words dictionary direction — which may be useful for paraphrasing, summarization and topic modeling — for the idea of soft drink, restricted to English (and normalized for punctuation, pluralization and capitalization differences):



url=Soft_drink text
1.  soft drink (and soft-drinks)     28.6 
2.  soda (and sodas)     5.5 
3.  soda pop 0.9 
4.  fizzy drinks 0.6 
5.  carbonated beverages (and beverage)     0.3 
6.  non-alcoholic 0.2 
7.  soft 0.1 
8.  pop 0.1 
9.  carbonated soft drink (and drinks)     0.1 
10.  aerated water 0.1 
11.  non-alcoholic drinks (and drink)     0.1 
12.  soft drink controversy 0.0 
13.  citrus-flavored soda 0.0 
14.  carbonated 0.0 
15.  soft drink topics 0.0 
⋮ 

The words-to-concepts dictionary direction can disambiguate senses and link entities, which are often highly ambiguous, since people, places and organizations can (nearly) all be named after each other. The next table shows the top concepts meant by the string Stanford, which refers to all three (and other) types:



text=Stanford url type
1.  Stanford University 50.3  ORGANIZATION
2.  Stanford (disambiguation) 7.7  a disambiguation page
3.  Stanford, California 7.5  LOCATION
4.  Stanford Cardinal football 5.7  ORGANIZATION
5.  Stanford Cardinal 4.1  multiple athletic programs
6.  Stanford Cardinal men's basketball 2.0  ORGANIZATION
7.  Stanford prison experiment 2.0  a famous psychology experiment
8.  Stanford, Kentucky 1.7  LOCATION
9.  Stanford, Norfolk 1.0  LOCATION
10.  Bank of the West Classic 1.0  a recurring sporting event
11.  Stanford, Illinois 0.9  LOCATION
12.  Leland Stanford 0.9  PERSON
13.  Charles Villiers Stanford 0.8  PERSON
14.  Stanford, New York 0.8  LOCATION
15.  Stanford, Bedfordshire 0.8  LOCATION
⋮ 

The database that we are providing was designed for recall. It is large and noisy, incorporating 297,073,139 distinct string-concept pairs, aggregated over 3,152,091,432 individual links, many of them referencing non-existent articles. For technical details, see our paper (to be presented at LREC 2012) and the README file accompanying the data.

We hope that this release will fuel numerous creative applications that haven't been previously thought of!


Produced by Angel X. Chang and Valentin I. Spitkovsky; parts of this work are descended from an earlier collaboration between University of Basque Country's Ixa Group's Eneko Agirre and Stanford's NLP Group, including Eric Yeh, presently of SRI International, and our Ph.D. advisors, Christopher D. Manning and Daniel Jurafsky.

Labels: ,

Smart Pricing may increase average publisher revenue



Online publisher networks, such as Google’s AdSense or the Yahoo! Publisher Network, enable advertisers to simultaneously contest click auctions for thousands - even millions - of web publisher ad slots, all with a single max CPC bid. Recognizing that different publishers deliver disparate performance for advertisers, some networks feature automated systems to help advertisers bid more efficiently with that single bid - effectively discounting click prices on publishers according to the relative value of clicks on each publisher’s ad slots. Google, for example, applies Smart Pricing (SP) for this purpose to appropriately discount advertiser bids on the Google Display Network. 

It is widely accepted that a well-executed system like SP enhances advertiser value. Whether SP also improves network revenue - and hence, via publisher revenue sharing agreements - average publisher revenue, remains a matter of some dispute. While it is clear that higher performing publishers will do better than lower performing publishers, opinion is divided as to whether publishers are on average better or worse off with SP.

Skepticism is understandable - the system by its very nature entails discounting advertiser bids. But if advertisers indeed get more value from a smart-priced network then we would expect them to bid higher because of that feature. The key question is whether the network revenue produced by their SP-discounted higher bids is more, less, or the same as the revenue produced by their undiscounted regular bids. In other words, does Smart Pricing grow the revenue pie?

In this paper, I develop a simple and tractable model of an auction-based publisher click network, replete with an idealized version of SP and profit-maximizing advertisers, and use it to derive insights into the revenue effects of systems like SP. While there is no claim here with regard to the revenue impact of SP-like systems on any actual publisher network, it is hoped that the arguments in the paper will help guide intuition and shape realistic expectations for publishers. And the main implication of this analysis is good news for networks and publishers alike - under reasonable conditions Smart Pricing, and its non-Google analogs, can significantly grow the pie.

Labels:

Is beautiful usable? What is the influence of beauty and usability on reactions to a product?



Did you ever come across a product that looked beautiful but was awful to use? Or stumbled over something that was not nice to look at but did exactly what you wanted?

Product usability and aesthetics are coexistent, but they are not identical. To understand how usability and aesthetics influence reactions to a product, we conducted an experimental lab study with 80 participants. We created four versions of an online clothing shop varying in beauty (high vs. low) and usability (high vs. low). Participants had to find a number of items in one of those shops and buy them. To understand how the factors of beauty and usability influence final users happiness, we measured how they much they liked the shop before and after interaction.

The results showed that the beauty of the interface did not affect how users perceived the usability of the shops: Participants (or Users) were capable of distinguishing if a product was usable or not, no matter how nice it looked. However, the experiment showed that the usability of the shops influenced how users rated the products' beauty. Participants using shops with bad usability rated the shops as less beautiful after using the shops. We showed that poor usability lead to frustration, which put the users in a bad mood and made them rate the product as less beautiful than before interacting with the shop.


Successful products should be beautiful and usable. Our data provide insight into how these factors work together.

Labels: ,

Google, the World Wide Web and WWW conference: years of progress, prosperity and innovation



More than forty members of Google’s technical staff gathered in Lyon, France in April to participate in the global dialogue around the state of the web at the World Wide Web conference (WWW) 2012. A decade ago, Larry Page and Sergey Brin applied their research to an information retrieval problem and their work—presented at WWW in 1998—led to the invention of today’s most popular search engine.

As I've watched the WWW conference series evolve over the years, a couple of larger trends struck me in this year's edition. First, there seems to be more of a Mobile Web presence in the technical program, relative to recent years. The refereed program included several interesting Mobile papers, including the Best Student Paper Awardee from Stanford University researchers: Who Killed My Battery: Analyzing Mobile Browser Energy Consumption, Narendran Thiagarajan, Gaurav Aggarwal, Angela Nicoara, Dan Boneh, Jatinder Singh.

Second, one gets the sense that the WWW community is moving from the classic "bag of words" view of web pages, to an entity-centric view. There were a number of papers on identifying and using entities in Web pages. While I'm loathe to view this as a vindication of "the Semantic Web" (mainly because this has become an overloaded phrase that people elect to interpret as suits them), the technical capability to get at entities is clearly here. The question is -- what is the killer application? Finally, it’s nice to see that recommendation systems are becoming a major topic of focus at WWW. This paper was a personal favorite: Build Your Own Music Recommender by Modeling Internet Radio Streams, Natalie Aizenberg, Yehuda Koren, Oren Somekh.

In keeping with tradition, Google was a major supporter, sponsoring the conference, the Best Paper Award (Counting beyond a Yottabyte, or how SPARQL 1.1 Property Paths will prevent adoption of the standardMarcelo Arenas, Sebastián Conca and Jorge Pérez) and four PhD student travel grants. We chatted with hundreds of attendees who hung out with us at the Google booth to chat and see demos about the latest Google product and research developments (see full schedule of booth talks).


Googlers were also active member of the vibrant research community at WWW:

David Assouline delivered the keynote for the Demo Track -- to a standing-room-only crowd -- on the Google Art Project, which uses a combination of various Google technologies and expert information provided by our museum partners to create a unique online art experience. Googler Alon Halevy served as a program committee member. Googlers were also co-authors of the following papers:
Googlers co-organized three workshops:
Additionally, a Googler led a tutorial:
Googlers presented a poster:
  • Google Image Swirl by Yushi Jing, Henry Rowley, Jingbin Wang, David Tsai, Chuck Rosenberg, Michele Covell (Googlers)
At the conference, we also paid homage to the founding of the World Wide Web and the strong community and enterprise it’s created since the 1990s, seen in the Euronews report: Web inventor Tim Berners-Lee on imagining worlds. Through our products and support of WWW in 2013, we look forward to continuing to nurture the world wide web’s open ecosystem of knowledge, innovation and progress.

Add Research at Google to your circles on G+ to learn more about our academic conference involvement, view pictures from events, and hear about upcoming programming and presence at conferences

Labels: ,

Video Stabilization on YouTube



One thing we have been working on within Research at Google is developing methods for making casual videos look more professional, thereby providing users with a better viewing experience. Professional videos have several characteristics that differentiate them from casually shot videos. For example, in order to tell a story, cinematographers carefully control lighting and exposure and use specialized equipment to plan camera movement.

We have developed a technique that mimics professional camera moves and applies them to videos recorded by hand-held devices. Cinematographers use specialized equipment such as tripods and dollies to plan their camera paths and hold them steady. In contrast, think of a video you shot using a mobile phone camera. How steady was your hand and were you able to anticipate an interesting moment and smoothly pan the camera to capture that moment? To bridge these differences, we propose an algorithm that automatically determines the best camera path and recasts the video as if it were filmed using stabilization equipment. Specifically, we divide the original, shaky camera path into a set of segments, each approximated by either a constant, linear or parabolic motion of the camera. Our optimization finds the best of all possible partitions using a computationally efficient and stable algorithm. For details, check out our earlier blog post or read our paper, Auto-Directed Video Stabilization with Robust L1 Optimal Camera Paths, published in IEEE CVPR 2011.

The next time you upload your videos to YouTube, try stabilizing them by going to the YouTube editor or directly from the video manager by clicking on Edit->Enhancements. For even more convenience, YouTube will automatically detect if your video needs stabilization and offer to do it for you. Many videos on YouTube have already been enhanced using this technology.

More recently, we have been working on a related problem common in videos shot from mobile phones. The camera sensors in these phones contain what is known as an electronic rolling shutter. When taking a picture with a rolling shutter camera, the image is not captured instantaneously. Instead, the camera captures the image one row of pixels at a time, with a small delay when going from one row to the next. Consequently, if the camera moves during capture, it will cause image distortions ranging from shear in the case of low-frequency motions (for instance an image captured from a driving car) to wobbly distortions in the case of high-frequency perturbations (think of a person walking while recording video). These distortions are especially noticeable in videos where the camera shake is independent across frames. For example, take a look at the video below.


Original video with rolling shutter distortions


In our recent paper titled Calibration-Free Rolling Shutter Removal, which was awarded the best paper at IEEE ICCP 2012, we demonstrate a solution to correct these rolling shutter distortions in videos. A significant feature of our approach is that it does not require any knowledge of the camera used to shoot the video. The time delay in capturing two consecutive rows that we mention above is in fact different for every camera and affects the extent of distortions. Having knowledge of this delay parameter can be useful, but difficult to obtain or estimate via calibration. Imagine a video that is already uploaded to YouTube -- it will be challenging to obtain this parameter! Instead, we show that just the visual data in the video has enough information to appropriately describe and compensate for the distortions caused by the camera motion, even in the presence of a rolling shutter. For more information, see the narrated video description of our paper.

This technique is already integrated with the YouTube stabilizer. Starting today, if you stabilize a video from a mobile phone or other rolling shutter cameras, we will also automatically compensate for rolling shutter distortions. To see our technique in action, check out the video below, obtained after applying rolling shutter compensation and stabilization to the one above.


After stabilization and rolling shutter removal


Labels: ,

An Experiment in Music and Crowd-Sourcing



The Bodleian Library is the main research library at the University of Oxford. It is also one of the oldest libraries in the world, dating back to the 14th century. But the staff of the Bodleian operates very much in the 21st century, using the latest technology to solve their unique problems.

A few years ago, the library acquired a set of 4,000 popular piano pieces from the mid-Victorian period. There’s very little information available on these pieces, so Bodleian staff decided to use crowdsourcing to collect information on this corpus of music. Through a Google Focused Award, they have digitized and made the entire set of music available online.

By visiting the What’s-the-Score website, which opened yesterday, ‘citizen librarians’ can help by describing the scores and contributing to the creation of an online catalogue. They can also include links to audio or video recordings. This is the first time the Bodleian has used this approach to collect catalog information. Typically, a large group of researchers are required to find this information.

Martin Holmes, Alfred Brendel Curator of Music at the Bodleian Libraries, commented: ‘In making the scores available online, they will not only be accessible for academic study and research but will also be there to enjoy for anyone who is interested in various aspects of Victorian music, culture and society.’

The Bodleian Library is one of dozens of recipients of Google Focused Awards. These awards are for research in areas of study that are of key interest to Google as well as the research community. These unrestricted gifts are for two to three years, and the recipients have the advantage of access to Google tools, technologies, and expertise. The Bodleian’s experiment in crowdsourcing to build up data on a specialized collection is timely and interesting, given the number of such collections becoming available on the web.

Labels: , ,

From Open Research to Open Flow



Did you know Open Flow has its roots in academia? Back in May 2006 Vint Cerf was visiting Stanford to deliver an invited lecture. Following the talk he met with Stanford Professor Nick McKeown and learned about the Clean Slate Internet project. Nick was looking for support and Google’s involvement in what he described as a lab for “radical new ideas in networking”. Vint felt the program looked “intellectually healthy but might be a very long term matter to bear fruit”. Vint explained that for us to get involved we would want to have Google engineers excited and engaged.

Professor McKeown met with Google networking and infrastructure experts to present his ideas. Everybody knew that Software-Defined Networking (SDN) had great promise but the Open Flow effort seemed a bit ambitious for a professor and a couple of grad students. Googlers Dave Presotto and Stephen Stuart agreed to take a chance on it and sponsor a small research grant to fund another student and to get Google engaged. As Google and the industry got more involved, Open Flow began to gain traction. In June 2008 Google provided another grant to support more students and in late 2009 Google joined the initiatives consortium with other industry members.

Google engineers Stephen Stuart and Jim Wanderer worked closely with Stanford and lead Google’s Open Flow development and deployment. In 2011 the Open Networking Foundation (ONF) was formed to accelerate Software-Defined Networking standards and foster a robust market and ecosystem. Google’s own Urs Hölzle became ONF’s first President and Chairman of the Board.

Google involvement and support of this academic effort was a key factor to the speedy development and deployment of Open Flow and SDN - technology that made it from a university research project to running Google’s WAN in record time.

Labels: