Thursday, February 21, 2008

Mining "Stories" from Blogs

I just finished reading a paper called, Mining Blog Stories Using Community-Based and Temporal Clustering, which I found very interesting as it is extremely related to the work that we have been doing. As I read, I had the following thoughts and questions.

  • It is refreshing to discover others that are pursuing similar research paths as us.
  • Their work gives more fuel to the approaches we have been taking.
  • Some of the formalism is elegant and useful (particularly, for talking about blogs and entries), however, some of it gets cumbersome (the lookup table is definitely necessary).
  • Using Lucene library to index the data is an idea that we could consider (we have been storing the data in a MySQL database of our own make)
  • I'd be curious to know how many degrees of separation they crawled from their seed blogs? I would guess that they did not get too far since the blogs in our study appear to be less sparse on average. They reported over 2,000 blogs having 1 million entries (on average, 50 entires per blog per month). In Social Capital in the Blogosphere we retrieved blog content just two degrees away from Scoble (our single seed blog) and obtained over 38,000 blogs having 13 million entries (on average, 28.5 entries per blog per month).
  • How did they perform blog entity resolution? (here is an approach that we used)
(Note: Thanks to both Christophe who sent me the paper and Robbie Haertel for sending it to Christophe --- Social capital in action. ;) )

1 comment:

SteveG said...

We've had this same kind of overload from reading blogs and decided to graph the paths between blog references. A map makes a big difference.

Take a look and leave us some feedback.