If you squint the right way, you will notice that graphs are everywhere. For example, social networks, popularized by Web 2.0, are graphs that describe relationships among people. Transportation routes create a graph of physical connections among geographical locations. Paths of disease outbreaks form a graph, as do games among soccer teams, computer network topologies, and citations among scientific papers. Perhaps the most pervasive graph is the web itself, where documents are vertices and links are edges. Mining the web has become an important branch of information technology, and at least one major Internet company has been founded upon this graph.
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
We've been using Pregel internally for a while now, but we are beginning to share information about it outside of Google. Greg Malewicz will be speaking at the joint industrial track between ACM PODC and ACM SPAA this August on the very subject. In case you aren't able to join us there, here's a spoiler: The seven bridges of Königsberg — inspiration for Leonhard Euler's famous theorem that established the basics of graph theory — spanned the Pregel river.
20 comments:
Wow. That is strange timing. We just pushed our new Spinn3r Social Media Rank this morning:
http://spinn3r.com/rank
Obviously we love graphs too ;)
is shortest path distance something your looking at? what kind of edge weights would you put for this?
A nice Exalead application which considers the web as a graph : Constellations
Well, will be very interesting to see this unfold. Do I read you correctly that Pregel is more geared towards OLAP than runtime transactional storage?
For those of you who can't wait, there's always the Neo4j open source / commercial graph database, which is available right now:
http://neo4j.org
(Disclaimer: I'm involved.)
-EE
I'm curious, Googlers, are you planing on releasing this to the world as either open source software, or a subscription service? Also, how about showing us that PageRank (or any other) example?
Can it do anything interesting with RDF graphs? either as a naive "object x has property p with value y" structure, or including information about which document/datasource claims that x-property-y triple?
I'd be very interesting in trying out this software, is it going to be opened up to the "outside" for use?
I believe that the path with the most resting places would be a very long path.
It sounds like Pregel seems a computing framework based on dynamic programming techniques for the graph operations. BTW, Does it have own structure, not uses BigTable?
I have been working with Dryad(and DryadLINQ) which is also a distributed graph processing engine from Microsoft for sometimes.
I would really like to see a code sample showing the programming model of Pregel.
How do the edges (communication channels) are implemented? Files, TCP pipes?
Thank you for Info.
Where can i get some good info on Pregel?
Wow this is the first real evidence I have seen of Google being EVIL.
If you have a look at www.feedbackloop.com.au
You will see it is basically the exact same idea.
It was launched in February and submitted to the Firefox plugin area.
Went straight into experimental. Was not fully approved. (Common for new Firefox extensions)
Looking back now, the Firefox plugin has now been removed from the extensions database, and Google comes out with exactly the same idea, and it goes straight into Release, and is promoted as a recommended Plugin.
Sounds terribly evil to me.
James
Hopefully, Google can provide the information from these graphs in a form understandable to ordinary people. The data can be quite helpful for those following the web. However, they aren't going to use it if they can't understand a single bit of it.
Just FYI, Apache Hama made the BSP package on Hadoop -- http://docs.google.com/present/view?id=dc8qtchr_1f44rjccd
How does Pregel avoid infinite looping?
There could be a bug or specific conditions in which a vertex never sets itself as "done".
From the initial paper, it sounds like a simple specialization of MapReduce, although I'm curious about optimization which are hinted in there.
Will the actual Pregel API be disclosed in a whitepaper?
Also, I'm wondering whether it was actually implemented on MapReduce or some other infrastructure.
I'd be very interested in more details about Pragel. LIke others I'm interested in details regarding the future of Pragel as well as some examples of the API it exposes.
I'm quite interested in this topic. Can you post the ACM paper on this topic?
Chris
take a look to http://incubator.apache.org/hama/
a distributed scientific package on Hadoop for massive matrix and graph data based on Bulk Synchronous Parallel Model.
Thanks redbochs!
Post a Comment