#clojure – 2008-04-30

2008-04-30

13:55	Chouser	Here's a fun challenge (I'll try to do it myself if I get some time later)...
13:56	Chouser	Implement MapReduce (and of course the standard wordcount example) for SMP (shared memory, not a cluster) in Clojure.
13:57	Chouser	I'm betting the MapReduce "framework" on top of agents is no more than 10 lines of code, and the wordcount example would be another 3.
13:57	rhickey	Chouser: link?
13:58	Chouser	to MapReduce?
13:59	rhickey	I've read the MapReduce papers, wondering if there was a benchmark thingy
13:59	rhickey	or implementation in another language...
13:59	Chouser	oh, no. I just made up the challenge because a friend was being impressed by this: http://labs.trolltech.com/blogs/category/labs/threads/qt-concurrent/
14:11	albino	Is that close to the widefinder description?
15:16	Chouser	is there some reason pmap manages its own thread pool instead of using agents?
15:17	Chouser	I'm not smart enough to know how it would use agents, I'm just assuming that's possible... somehow...
15:47	rhickey	Chouser: pmap may have predated agents
15:47	Chouser	ah, ok.
16:25	nsinghal	I have a question
16:25	nsinghal	http://paste.lisp.org/display/59989
16:28	Chouser	your initial example works for me, replacing (new Socket) with (throw (new Exception))
16:32	nsinghal	Thats the question that it works that way - see the example at the end. Why with socket it doesnt work?
16:34	nsinghal	(import '(java.net Socket))
16:34	nsinghal	(defn connect [host port eatex]
16:34	nsinghal	(try
16:34	nsinghal	(new Socket host port)
16:34	nsinghal	(catch Throwable ex
16:34	nsinghal	(if eatex nil (throw ex)))))
16:34	nsinghal	(connect "127.0.0.1" 3432 true)
16:38	Chouser	oh, sorry, I missed that that was the point of the final example. I thought it had to do with the number of arguments.
16:54	Chouser	well, I'm stumped.
16:55	rhickey	probably a bug with local clearing prior to tail call - looking at it now
17:18	rhickey	nsinghal: fix is up
17:19	nsinghal	thx
17:31	MarkJP	how would I do a sleep in the first binding of a let...recur loop
17:31	MarkJP	don't ask
17:32	MarkJP	I could have a flag I guess
17:42	Chouser	inside the body of the loop, or up at the top?
17:48	MarkJP	I wanted to do it at the top for now I just added a new sleep flag to the let so the recur passes in a false
17:48	MarkJP	the let binds it to true the first time
17:49	Chouser	(loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))
17:49	Chouser	(loop [_ (Thread.sleep 3) x [:a :b :c]] (prn x) (when x (recur _ (rest x))))
17:49	Chouser	oops, sorry for the double-post
17:50	MarkJP	cool is that what _ does?
17:50	Chouser	...and for the fact that it doesn't work.
17:50	Chouser	oh, no, it does work. (Thread.sleep 3000) is a bit more noticable, though.
17:51	MarkJP	cool thanks
17:51	Chouser	No, _ is just a symbol. You could call it "blank" or "dummy" instead.
17:52	Chouser	then again, I don't see how that's better than just (do (Thread.sleep 3000) (loop ...))
22:25	Chouser	so here's my attempt at map-reduce: http://paste.lisp.org/display/60005
22:26	Chouser	Also my first real attempt at using agents.
22:31	Chouser	I wonder if ref-toggle will be useful for solving other problems.
22:34	Chouser	ref-toggle, or the whole ugly mess?
22:35	Chouser	I don't really know how to comment code like this.
22:35	rhickey	Chouser: isn't there a partitioning between map and reduce, that makes each reduce job work on related data?
22:38	rhickey	I think the agents doing mapping and interacting with reducer is too coupled
22:39	Chouser	in map-reduce3?
22:39	rhickey	and 2
22:40	rhickey	number of agents should be independent of number of docs
22:40	Chouser	cool. Help me understand why. The number of agents is already independent of the number of processors -- so what should determine the number of agents?
22:41	rhickey	make a cycle of agents, distribute docs to each round robin...
22:41	rhickey	await them
22:41	rhickey	concat results, send to partitioner
22:41	rhickey	distribute partitioned results to agent with reduce function
22:41	Chouser	won't the thread pool cycle through the agents for me? Why should I write that logic again?
22:41	rhickey	await
22:41	rhickey	merge results
22:42	rhickey	more agents means more independent results to merge, should be a setting
22:44	Chouser	every hash produced by a map run has to be merged, regardless of whether the agent producing it has produced other hashes already.
22:44	rhickey	would you have an agent per word?
22:45	Chouser	No, because there's no work to be done for a single word. I have one agent per parallizable map run.
22:46	Chouser	The drawback of my solution is what, too much contention for the final-hash?
22:47	rhickey	I don't think you are capturing the mapreduce model
22:48	Chouser	hm, it's been a while since I've read the paper. Perhaps I should look at that again.
22:48	rhickey	partition input, divide among compute resources, partition intermediate results, divide reduction among compute resources, merge final result
22:48	Chouser	I see, and it's the partitioning at each step that I'm punting on.
22:49	rhickey	having a settable number of agents allows you to model compute resources, independent of how Clojure distributes to processors
22:50	rhickey	but it's likely numagents == numprocessors is optimal for this, since even in map they are doing combining (of counts)
22:53	rhickey	(assoc hash word (inc (get (hash word) 0))) more idiomatic
22:53	rhickey	oops: (assoc hash word (inc (get hash word 0)))
22:54	Chouser	I don't like inc. Another keyword to learn, when (+ 1 ...) is just as succinct (usually) and more obvious.
22:54	Chouser	(get hash word 0) is nice.
22:54	rhickey	the first partition is mechanical, the second logical
22:55	rhickey	inc may be faster
22:55	Chouser	hm.
22:59	rhickey	second partition can be made mechanical, use rem of hash of key to index reducing agents
22:59	Chouser	well, it was a lot of fun to write. I'll have to re-read the paper to understand the purpose of the partitioning.
23:00	rhickey	you need to end up with one result per key, so you need to send all of the data for a particular key to the same reducing agent
23:01	rhickey	otherwise the final merger is doing reducing
23:01	Chouser	Aaaaa...
23:02	Chouser	should the passed-in reducer then not be operating on two hashes, but on two values (ints in this case)?
23:03	Chouser	the reducer should just be +
23:03	rhickey	if you look at it as data flows...
23:03	rhickey	docs flow into maps...
23:04	rhickey	key/val maps flow out of maps into partition...
23:04	rhickey	key/val pairs flow from partition to reducers
23:05	rhickey	maps flow out of reducers and get concatenated as final result
23:05	Chouser	I think that Qt API I posted mislead me. The reduce function in Google's paper takes a key and a seq of values. Qt's takes two hashes.
23:08	rhickey	key and seq of values means partition is doing more work, so batch is all together, makes sense for the distributed case
23:08	rhickey	the main thing is to keep the stages/steps separate
23:10	Chouser	if the seq only had 2 items, that would match your "key/val pairs flow ... to reducers" description, right?
23:11	rhickey	let's say there were 3 mapping agents
23:12	rhickey	the first found "the" 1 time, the second 2 times and the third 3 times
23:12	rhickey	one partitioning model determines that the "the" reducer is reducer # 2
23:13	rhickey	when it get the result from mapper one, it sends ["the" 1] to reducer #2
23:13	rhickey	later when it gets the results from mapper two, it sends ["the" 2] to reducer #2
23:13	rhickey	etc
23:14	rhickey	another model has it gather all the results from the mappers and send ["the" 1 2 3] to reducer #2
23:14	foones	Hi!
23:14	foones	I'm trying to understand Clojure's support for concurrency
23:14	foones	I think I've understand what agents and refs are
23:14	Chouser	Ok. And that's when I think I should have one agent for each word until I read your Socratic question from 30 minues ago.
23:15	foones	but I'm not sure as to how I can execute two pieces of code in parallel
23:16	rhickey	foones: (send agent1 ajob) (send agent2 ajob) (await agent1 agent2)
23:17	Chouser	rhickey: bedtime for me. Thanks for walking through this with me, I really appreciate it.
23:17	foones	Aha, and if in one of the jobs I need to read input (e.g. user input)
23:17	rhickey	sure
23:17	foones	it won't wait for it to end?
23:18	rhickey	foones: for blocking jobs use send-off, and don't await for it. It can put its result in a ref when they arrive
23:19	foones	Ahh, I understand
23:19	foones	Thanks for your help
23:24	gomer-nyc	hi, does anybody have experience with returning proxies from a macro?
23:25	gomer-nyc	or rather, a proxy from a macro?
23:35	Chouser	gomer-nyc: you missed rhickey by 5 minutes.
23:35	Chouser	And I'm going to bed. :-/ Maybe try the google discussion group?
23:36	gomer-nyc	yeah, good idea; I've searched but not much around proxies & macros that I'm looking for. maybe I'll post a q there. good night :-)