One of the classic Hadoop MapReduce tutorials counts words in a text corpus. Word counts are a great way to teach the fundamentals of MapReduce, and there’s a lot of free books on Project Gutenburg.
To follow along at home, checkout Couchrest from the Github
git clone git://github.com/jchris/couchrest.git
I’ve included the example code as well as 3 books in the “example” directory.
The short version is:
cd couchrest
ruby examples/word_count/word_count.rb
Follow the instructions printed by the script to view the results in your browser. Or, go ahead and trigger the next step with:
ruby examples/word_count/word_count_query.rb
The initial reduction can take about 20 minutes to run on the average MacBook, so this ruby script will probably time out and fail the first time. Go get some coffee. When you come back, run it again. Once the reduce has run, queries should be nearly instantaneous.
The code teaches the fundamentals of CouchDB view functions, collation order, and reduce query params, and provides some helpful output while doing so.
The upshot is that you can now query for the count of any word, in one of the three indexed books, or in all three. And those queries are fast!
6 comments on CouchDB MapReduce example: word count
..oh.. I do has them.
Thanks for making this example.
[debug] [<0.51.0>] Spawning new update process for view group _design/word_count in database word-count-example. [info] [<0.55.0>] Spawning new javascript instance. [info] [<0.55.0>] HTTP Error (code 500): {'EXIT', {noproc, {gen_server, call, [<0.62.0>,{pread_bin,9073960}]}}} [info] [<0.55.0>] 127.0.0.1 - - "GET /word-count-example/_view/word_count/count" 500Any ideas? I tried to get more information about the error, but I’m not even sure where is the best place to start.
whoops
You need to use the public clone URL:
a lolcat for your troubles
(someone has to use the textile functionality)
thnks, fixing!
spgarbet$ git clone git://github.com/jchris/couchrest.git Initialized empty Git repository in /Users/spgarbet/Projects/couchrest/.git/ github.com[0: 65.74.177.129]: errno=Operation timed out fatal: unable to connect a socket (Operation timed out) fetch-pack from ‘git://github.com/jchris/couchrest.git’ failed.
Stupid firewall (see previous post). Everytime we get a port open, it lasts about a week, then the firewall guys close it again.