BlogMatcher

BlogMatcher FAQ

What is BlogMatcher?

BlogMatcher is a program that helps people find weblogs that match their interests and find like-minded blogs. When given an URL to a weblog (called "Reference Blog") the system finds other blogs that appear to discuss similar topics.

How do I use BlogMatcher?

Go to
Enter the URL to your weblog, or a weblog you generally find interesting.
Enjoy!

How does it work?

It's all really simple. The basic premise of BlogMatcher is that two blogs that link to the same sites share some sort of topical commonality. If you link to an article in your blog, then the chances are, you'll be interested in reading other people's opinions about the same article.
Here's a case in point. If I do a search on my blog, it will find other blogs that link to sites like slashdot.org, kuro5hin.org, wired.com, Safari and guardian.co.uk. As it turns out, people who also link to those sites tend to be technically-oriented Mac using liberals -exactly the kind of people I like to hang out with. BlogMatcher seems to work because when it comes to blogs, you are who you link to.
But the really cool thing is, you never really know how it works. There are so many factors, that you'll probably be surprised at the results. In most cases, you'll probably be pleasantly surprised.

How does the scoring work?

Here's the basics:

Deep links score higher
Common links (links you share with many sites) score lower
Uncommon links score higher
Scores are generated dynamically for each search. Scores for the same link will differ depending on the reference blog and even when you do the search.

How did you do it?

This was actually a weekend project (well, it's almost 5am on Monday morning as I write this). I came up with the idea on Friday and wrote a prototype on Friday evening. On Saturday, I improved the indexing portion of the system. On Sunday, I spent the afternoon trying to integrate with Postgres, but after getting horrible performance for some unknown reason, I started rewriting chunks of it in C at around midnight.
The basic search algorithm was really simple, but it took quite a few tries to get decent performance. The first PHP version took about 3 seconds to search through 2000 blogs. The first C version did it in around 1 second. The current C program takes around 0.3 seconds for 2900 blogs.
Update 4/28/03: Version 3.0 uses a completely re-written search/matching engine. The engine, written in C++, is basically a highly specialized server daemon that has all the data stored in memory. An average search takes somewhere in the 0.01 - 0.05 second range, and takes around 0.1 - 0.3 seconds to generate the results. Version 3.0 also uses a new link-scoring algorithm.

The indexer...

The indexer runs every 4 hours (0,4,8,... CDT), and starts by fetching the changes.xml from weblogs.com. It then downloads all of the recently modified weblogs, and indexes them. Since the changes list goes back about 3 hours, the indexer catches most recently modified blogs, but not all. But over the course of a few days, hopefully it'll get most of them.
I'd index more often, except I'm a little weary of exceeding my bandwidth usage. We'll see how things go and I might start indexing more often.

Can you index my blog?

Ping weblogs.com when you update your blog, and it will be indexed sooner or later.
Update 4/27/03: You can also use this form to have your blog (re)indexed. If your blog is being indexed for the first time, it will not show up until the next time the search engine is restarted.
Update 1/25/05: The indexer now fetches data from weblogs.com's shortChanges.xml which only contains 5 minutes' worth of updates. However, the indexer will run much more frequently than before to compensate. It still won't get every blog that pings weblogs.com, but it'll get a lot of them, sooner or later.

What can I do to improve my results?

Because BlogMatcher primarily uses links, the best thing to do is to upate your blog often and add links. Don't just link to sites you often go to, but include links to specific articles or stories that you find interesting. Linking to other like-minded blogs could help too.

Where are my Blogroll links?

If you're using the JavaScript-based Blogroll, it won't be indexed because the current indexing agent only downloads the top HTML file (and the indexer doesn't know anything about JavaScript). If you have links in your blog posts themselves, that should be enough for BlogMatcher to work off of.

Is there anything similar out there?

I've been informed of a couple of vaguely similar sites:

Mark Pilgrim's NewDoor
And http://yuntis-usb.ecsl.cs.sunysb.edu/help/queries/#SimilarLists
Both are pretty different to BlogMatcher though. I also have a sneaking suspicion that BlogRolling.com is thinking about doing something similar (they own blogmatch.com.

This is cool, but... why?

I wanted to do mankind a service.
Okay, the truth is, I dream of working for Google, and I wanted to work on my data mining skillz. I also wanted to dig through the overwhelming number of blogs that are out there and find ones that I'd like without wasting time on those that I wouldn't be interested in.
Update 1/25/05: I'll be graduating this June, and am looking for a job. If you think you might have a fun job for me, please get in touch...

55953 blogs (1394 MB) indexed. Index last updated: 05/23/2005 03:42:04
BlogMatcher v3.0 - Brought to you by Ryo Chijiiwa. Please read the FAQ.