Ben Allfree :: Painless Programming

Guaranteed results for Ruby on Rails, PHP, Facebook, mySpace, and more

Web harvesting

May 30th, 2008 · No Comments

I was talking to a client today who has business plans that revolve around harvesting data from web sites. How do you do it? Here’s my step-by-step process:

1) Crawl for links
2) Fetch link content
3) Index link content
4) Transform link content

If you design your system that way, you end up with something that can be scaled easily. You also have an architecture where you can run or re-run any phase of the process independently, which turns out to be a really convenient thing since page formats are changing all the time. Always save the original fetched content so you can go back and re-process as necessary.

I prefer Ruby, ActiveRecord, and mySQL for web data mining. What about you?

The topic is too large to discuss in depth here, but I thought I would throw out a teaser for anyone who is interested in this kind of thing.

Tags: , , ,

0 responses so far ↓

  • There are no comments yet...Kick things off by filling out the form below.

Leave a Comment