#clojure logs

2008-04-30

13:55ChouserHere's a fun challenge (I'll try to do it myself if I get some time later)...
13:56ChouserImplement MapReduce (and of course the standard wordcount example) for SMP (shared memory, not a cluster) in Clojure.
13:57ChouserI'm betting the MapReduce "framework" on top of agents is no more than 10 lines of code, and the wordcount example would be another 3.
13:57rhickeyChouser: link?
13:58Chouserto MapReduce?
13:59rhickeyI've read the MapReduce papers, wondering if there was a benchmark thingy
13:59rhickeyor implementation in another language...
13:59Chouseroh, no. I just made up the challenge because a friend was being impressed by this: http://labs.trolltech.com/blogs/category/labs/threads/qt-concurrent/
14:11albinoIs that close to the widefinder description?
15:16Chouseris there some reason pmap manages its own thread pool instead of using agents?
15:17ChouserI'm not smart enough to know *how* it would use agents, I'm just assuming that's possible... somehow...
15:47rhickeyChouser: pmap may have predated agents
15:47Chouserah, ok.
16:25nsinghalI have a question
16:25nsinghalhttp://paste.lisp.org/display/59989
16:28Chouseryour initial example works for me, replacing (new Socket) with (throw (new Exception))
16:32nsinghalThats the question that it works that way - see the example at the end. Why with socket it doesnt work?
16:34nsinghal(import '(java.net Socket))
16:34nsinghal(defn connect [host port eatex]
16:34nsinghal (try
16:34nsinghal (new Socket host port)
16:34nsinghal (catch Throwable ex
16:34nsinghal (if eatex nil (throw ex)))))
16:34nsinghal(connect "127.0.0.1" 3432 true)
16:38Chouseroh, sorry, I missed that that was the point of the final example. I thought it had to do with the number of arguments.
16:54Chouserwell, I'm stumped.
16:55rhickeyprobably a bug with local clearing prior to tail call - looking at it now
17:18rhickeynsinghal: fix is up
17:19nsinghalthx
17:31MarkJPhow would I do a sleep in the first binding of a let...recur loop
17:31MarkJPdon't ask
17:32MarkJPI could have a flag I guess
17:42Chouserinside the body of the loop, or up at the top?
17:48MarkJPI wanted to do it at the top for now I just added a new sleep flag to the let so the recur passes in a false
17:48MarkJPthe let binds it to true the first time
17:49Chouser(loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))
17:49Chouser(loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))
17:49Chouseroops, sorry for the double-post
17:50MarkJPcool is that what _ does?
17:50Chouser...and for the fact that it doesn't work.
17:50Chouseroh, no, it does work. (Thread.sleep 3000) is a bit more noticable, though.
17:51MarkJPcool thanks
17:51ChouserNo, _ is just a symbol. You could call it "blank" or "dummy" instead.
17:52Chouserthen again, I don't see how that's better than just (do (Thread.sleep 3000) (loop ...))
22:25Chouserso here's my attempt at map-reduce: http://paste.lisp.org/display/60005
22:26ChouserAlso my first real attempt at using agents.
22:31ChouserI wonder if ref-toggle will be useful for solving other problems.
22:34Chouserref-toggle, or the whole ugly mess?
22:35ChouserI don't really know how to comment code like this.
22:35rhickeyChouser: isn't there a partitioning between map and reduce, that makes each reduce job work on related data?
22:38rhickeyI think the agents doing mapping and interacting with reducer is too coupled
22:39Chouserin map-reduce3?
22:39rhickeyand 2
22:40rhickeynumber of agents should be independent of number of docs
22:40Chousercool. Help me understand why. The number of agents is already independent of the number of processors -- so what should determine the number of agents?
22:41rhickeymake a cycle of agents, distribute docs to each round robin...
22:41rhickeyawait them
22:41rhickeyconcat results, send to partitioner
22:41rhickeydistribute partitioned results to agent with reduce function
22:41Chouserwon't the thread pool cycle through the agents for me? Why should I write that logic again?
22:41rhickeyawait
22:41rhickeymerge results
22:42rhickeymore agents means more independent results to merge, should be a setting
22:44Chouserevery hash produced by a map run has to be merged, regardless of whether the agent producing it has produced other hashes already.
22:44rhickeywould you have an agent per word?
22:45ChouserNo, because there's no work to be done for a single word. I have one agent per parallizable map run.
22:46ChouserThe drawback of my solution is what, too much contention for the final-hash?
22:47rhickeyI don't think you are capturing the mapreduce model
22:48Chouserhm, it's been a while since I've read the paper. Perhaps I should look at that again.
22:48rhickeypartition input, divide among compute resources, partition intermediate results, divide reduction among compute resources, merge final result
22:48ChouserI see, and it's the partitioning at each step that I'm punting on.
22:49rhickeyhaving a settable number of agents allows you to model compute resources, independent of how Clojure distributes to processors
22:50rhickeybut it's likely numagents == numprocessors is optimal for this, since even in map they are doing combining (of counts)
22:53rhickey(assoc hash word (inc (get (hash word) 0))) more idiomatic
22:53rhickeyoops: (assoc hash word (inc (get hash word 0)))
22:54ChouserI don't like inc. Another keyword to learn, when (+ 1 ...) is just as succinct (usually) and more obvious.
22:54Chouser(get hash word 0) is nice.
22:54rhickeythe first partition is mechanical, the second logical
22:55rhickeyinc may be faster
22:55Chouserhm.
22:59rhickeysecond partition can be made mechanical, use rem of hash of key to index reducing agents
22:59Chouserwell, it was a lot of fun to write. I'll have to re-read the paper to understand the purpose of the partitioning.
23:00rhickeyyou need to end up with one result per key, so you need to send all of the data for a particular key to the same reducing agent
23:01rhickeyotherwise the final merger is doing reducing
23:01ChouserAaaaa...
23:02Chousershould the passed-in reducer then not be operating on two hashes, but on two values (ints in this case)?
23:03Chouserthe reducer should just be +
23:03rhickeyif you look at it as data flows...
23:03rhickeydocs flow into maps...
23:04rhickeykey/val maps flow out of maps into partition...
23:04rhickeykey/val pairs flow from partition to reducers
23:05rhickeymaps flow out of reducers and get concatenated as final result
23:05ChouserI think that Qt API I posted mislead me. The reduce function in Google's paper takes a key and a seq of values. Qt's takes two hashes.
23:08rhickeykey and seq of values means partition is doing more work, so batch is all together, makes sense for the distributed case
23:08rhickeythe main thing is to keep the stages/steps separate
23:10Chouserif the seq only had 2 items, that would match your "key/val pairs flow ... to reducers" description, right?
23:11rhickeylet's say there were 3 mapping agents
23:12rhickeythe first found "the" 1 time, the second 2 times and the third 3 times
23:12rhickeyone partitioning model determines that the "the" reducer is reducer # 2
23:13rhickeywhen it get the result from mapper one, it sends ["the" 1] to reducer #2
23:13rhickeylater when it gets the results from mapper two, it sends ["the" 2] to reducer #2
23:13rhickeyetc
23:14rhickeyanother model has it gather all the results from the mappers and send ["the" 1 2 3] to reducer #2
23:14foonesHi!
23:14foonesI'm trying to understand Clojure's support for concurrency
23:14foonesI think I've understand what agents and refs are
23:14ChouserOk. And that's when I think I should have one agent for each word until I read your Socratic question from 30 minues ago.
23:15foonesbut I'm not sure as to how I can execute two pieces of code in parallel
23:16rhickeyfoones: (send agent1 ajob) (send agent2 ajob) (await agent1 agent2)
23:17Chouserrhickey: bedtime for me. Thanks for walking through this with me, I really appreciate it.
23:17foonesAha, and if in one of the jobs I need to read input (e.g. user input)
23:17rhickeysure
23:17foonesit won't wait for it to end?
23:18rhickeyfoones: for blocking jobs use send-off, and don't await for it. It can put its result in a ref when they arrive
23:19foonesAhh, I understand
23:19foonesThanks for your help
23:24gomer-nychi, does anybody have experience with returning proxies from a macro?
23:25gomer-nycor rather, *a* proxy from a macro?
23:35Chousergomer-nyc: you missed rhickey by 5 minutes.
23:35ChouserAnd I'm going to bed. :-/ Maybe try the google discussion group?
23:36gomer-nycyeah, good idea; I've searched but not much around proxies & macros that I'm looking for. maybe I'll post a q there. good night :-)