2008-04-30
| 13:55 | Chouser | Here's a fun challenge (I'll try to do it myself if I get some time later)... |
| 13:56 | Chouser | Implement MapReduce (and of course the standard wordcount example) for SMP (shared memory, not a cluster) in Clojure. |
| 13:57 | Chouser | I'm betting the MapReduce "framework" on top of agents is no more than 10 lines of code, and the wordcount example would be another 3. |
| 13:57 | rhickey | Chouser: link? |
| 13:58 | Chouser | to MapReduce? |
| 13:59 | rhickey | I've read the MapReduce papers, wondering if there was a benchmark thingy |
| 13:59 | rhickey | or implementation in another language... |
| 13:59 | Chouser | oh, no. I just made up the challenge because a friend was being impressed by this: http://labs.trolltech.com/blogs/category/labs/threads/qt-concurrent/ |
| 14:11 | albino | Is that close to the widefinder description? |
| 15:16 | Chouser | is there some reason pmap manages its own thread pool instead of using agents? |
| 15:17 | Chouser | I'm not smart enough to know *how* it would use agents, I'm just assuming that's possible... somehow... |
| 15:47 | rhickey | Chouser: pmap may have predated agents |
| 15:47 | Chouser | ah, ok. |
| 16:25 | nsinghal | I have a question |
| 16:25 | nsinghal | http://paste.lisp.org/display/59989 |
| 16:28 | Chouser | your initial example works for me, replacing (new Socket) with (throw (new Exception)) |
| 16:32 | nsinghal | Thats the question that it works that way - see the example at the end. Why with socket it doesnt work? |
| 16:34 | nsinghal | (import '(java.net Socket)) |
| 16:34 | nsinghal | (defn connect [host port eatex] |
| 16:34 | nsinghal | (try |
| 16:34 | nsinghal | (new Socket host port) |
| 16:34 | nsinghal | (catch Throwable ex |
| 16:34 | nsinghal | (if eatex nil (throw ex))))) |
| 16:34 | nsinghal | (connect "127.0.0.1" 3432 true) |
| 16:38 | Chouser | oh, sorry, I missed that that was the point of the final example. I thought it had to do with the number of arguments. |
| 16:54 | Chouser | well, I'm stumped. |
| 16:55 | rhickey | probably a bug with local clearing prior to tail call - looking at it now |
| 17:18 | rhickey | nsinghal: fix is up |
| 17:19 | nsinghal | thx |
| 17:31 | MarkJP | how would I do a sleep in the first binding of a let...recur loop |
| 17:31 | MarkJP | don't ask |
| 17:32 | MarkJP | I could have a flag I guess |
| 17:42 | Chouser | inside the body of the loop, or up at the top? |
| 17:48 | MarkJP | I wanted to do it at the top for now I just added a new sleep flag to the let so the recur passes in a false |
| 17:48 | MarkJP | the let binds it to true the first time |
| 17:49 | Chouser | (loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x)))) |
| 17:49 | Chouser | (loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x)))) |
| 17:49 | Chouser | oops, sorry for the double-post |
| 17:50 | MarkJP | cool is that what _ does? |
| 17:50 | Chouser | ...and for the fact that it doesn't work. |
| 17:50 | Chouser | oh, no, it does work. (Thread.sleep 3000) is a bit more noticable, though. |
| 17:51 | MarkJP | cool thanks |
| 17:51 | Chouser | No, _ is just a symbol. You could call it "blank" or "dummy" instead. |
| 17:52 | Chouser | then again, I don't see how that's better than just (do (Thread.sleep 3000) (loop ...)) |
| 22:25 | Chouser | so here's my attempt at map-reduce: http://paste.lisp.org/display/60005 |
| 22:26 | Chouser | Also my first real attempt at using agents. |
| 22:31 | Chouser | I wonder if ref-toggle will be useful for solving other problems. |
| 22:34 | Chouser | ref-toggle, or the whole ugly mess? |
| 22:35 | Chouser | I don't really know how to comment code like this. |
| 22:35 | rhickey | Chouser: isn't there a partitioning between map and reduce, that makes each reduce job work on related data? |
| 22:38 | rhickey | I think the agents doing mapping and interacting with reducer is too coupled |
| 22:39 | Chouser | in map-reduce3? |
| 22:39 | rhickey | and 2 |
| 22:40 | rhickey | number of agents should be independent of number of docs |
| 22:40 | Chouser | cool. Help me understand why. The number of agents is already independent of the number of processors -- so what should determine the number of agents? |
| 22:41 | rhickey | make a cycle of agents, distribute docs to each round robin... |
| 22:41 | rhickey | await them |
| 22:41 | rhickey | concat results, send to partitioner |
| 22:41 | rhickey | distribute partitioned results to agent with reduce function |
| 22:41 | Chouser | won't the thread pool cycle through the agents for me? Why should I write that logic again? |
| 22:41 | rhickey | await |
| 22:41 | rhickey | merge results |
| 22:42 | rhickey | more agents means more independent results to merge, should be a setting |
| 22:44 | Chouser | every hash produced by a map run has to be merged, regardless of whether the agent producing it has produced other hashes already. |
| 22:44 | rhickey | would you have an agent per word? |
| 22:45 | Chouser | No, because there's no work to be done for a single word. I have one agent per parallizable map run. |
| 22:46 | Chouser | The drawback of my solution is what, too much contention for the final-hash? |
| 22:47 | rhickey | I don't think you are capturing the mapreduce model |
| 22:48 | Chouser | hm, it's been a while since I've read the paper. Perhaps I should look at that again. |
| 22:48 | rhickey | partition input, divide among compute resources, partition intermediate results, divide reduction among compute resources, merge final result |
| 22:48 | Chouser | I see, and it's the partitioning at each step that I'm punting on. |
| 22:49 | rhickey | having a settable number of agents allows you to model compute resources, independent of how Clojure distributes to processors |
| 22:50 | rhickey | but it's likely numagents == numprocessors is optimal for this, since even in map they are doing combining (of counts) |
| 22:53 | rhickey | (assoc hash word (inc (get (hash word) 0))) more idiomatic |
| 22:53 | rhickey | oops: (assoc hash word (inc (get hash word 0))) |
| 22:54 | Chouser | I don't like inc. Another keyword to learn, when (+ 1 ...) is just as succinct (usually) and more obvious. |
| 22:54 | Chouser | (get hash word 0) is nice. |
| 22:54 | rhickey | the first partition is mechanical, the second logical |
| 22:55 | rhickey | inc may be faster |
| 22:55 | Chouser | hm. |
| 22:59 | rhickey | second partition can be made mechanical, use rem of hash of key to index reducing agents |
| 22:59 | Chouser | well, it was a lot of fun to write. I'll have to re-read the paper to understand the purpose of the partitioning. |
| 23:00 | rhickey | you need to end up with one result per key, so you need to send all of the data for a particular key to the same reducing agent |
| 23:01 | rhickey | otherwise the final merger is doing reducing |
| 23:01 | Chouser | Aaaaa... |
| 23:02 | Chouser | should the passed-in reducer then not be operating on two hashes, but on two values (ints in this case)? |
| 23:03 | Chouser | the reducer should just be + |
| 23:03 | rhickey | if you look at it as data flows... |
| 23:03 | rhickey | docs flow into maps... |
| 23:04 | rhickey | key/val maps flow out of maps into partition... |
| 23:04 | rhickey | key/val pairs flow from partition to reducers |
| 23:05 | rhickey | maps flow out of reducers and get concatenated as final result |
| 23:05 | Chouser | I think that Qt API I posted mislead me. The reduce function in Google's paper takes a key and a seq of values. Qt's takes two hashes. |
| 23:08 | rhickey | key and seq of values means partition is doing more work, so batch is all together, makes sense for the distributed case |
| 23:08 | rhickey | the main thing is to keep the stages/steps separate |
| 23:10 | Chouser | if the seq only had 2 items, that would match your "key/val pairs flow ... to reducers" description, right? |
| 23:11 | rhickey | let's say there were 3 mapping agents |
| 23:12 | rhickey | the first found "the" 1 time, the second 2 times and the third 3 times |
| 23:12 | rhickey | one partitioning model determines that the "the" reducer is reducer # 2 |
| 23:13 | rhickey | when it get the result from mapper one, it sends ["the" 1] to reducer #2 |
| 23:13 | rhickey | later when it gets the results from mapper two, it sends ["the" 2] to reducer #2 |
| 23:13 | rhickey | etc |
| 23:14 | rhickey | another model has it gather all the results from the mappers and send ["the" 1 2 3] to reducer #2 |
| 23:14 | foones | Hi! |
| 23:14 | foones | I'm trying to understand Clojure's support for concurrency |
| 23:14 | foones | I think I've understand what agents and refs are |
| 23:14 | Chouser | Ok. And that's when I think I should have one agent for each word until I read your Socratic question from 30 minues ago. |
| 23:15 | foones | but I'm not sure as to how I can execute two pieces of code in parallel |
| 23:16 | rhickey | foones: (send agent1 ajob) (send agent2 ajob) (await agent1 agent2) |
| 23:17 | Chouser | rhickey: bedtime for me. Thanks for walking through this with me, I really appreciate it. |
| 23:17 | foones | Aha, and if in one of the jobs I need to read input (e.g. user input) |
| 23:17 | rhickey | sure |
| 23:17 | foones | it won't wait for it to end? |
| 23:18 | rhickey | foones: for blocking jobs use send-off, and don't await for it. It can put its result in a ref when they arrive |
| 23:19 | foones | Ahh, I understand |
| 23:19 | foones | Thanks for your help |
| 23:24 | gomer-nyc | hi, does anybody have experience with returning proxies from a macro? |
| 23:25 | gomer-nyc | or rather, *a* proxy from a macro? |
| 23:35 | Chouser | gomer-nyc: you missed rhickey by 5 minutes. |
| 23:35 | Chouser | And I'm going to bed. :-/ Maybe try the google discussion group? |
| 23:36 | gomer-nyc | yeah, good idea; I've searched but not much around proxies & macros that I'm looking for. maybe I'll post a q there. good night :-) |