#clojure logs

2008-04-17

08:48ChouserHm, perhaps slurp should use FileChannel.map
08:49rhickeyThat could be cool - never used the NIO stuff
08:51rhickeycan you get it into a string without it being in memory twice?
08:54ChouserI dunno, but it might be possible.
08:56ChouserDocs say read is faster for files smaller than "a few tens of kilobytes"
08:57ChouserYou get a subclass of ByteBuffer, not a string.
09:02rhickeyare you proposing slurp return a ByteBuffer?
09:03rhickeyChouser: did you log yesterday's irc? I was offline
09:04Chouseryou didn't miss much: http://n01se.net/chouser/clojure-log/2008-04-16.html
09:04Chousershould I just let google loose on those logs? We can always move them later.
09:05rhickeysure
09:05rhickeythanks
09:05Chouserok
09:14drewrWorks in Safari, but not FF3.
09:17jteotrue
09:25ChouserI've had people complain about other of my sites in FF3. Works fine for me, though.
09:26drewrYou on OS X?
09:26Chousernope, Linux.
09:26ChouserI can try OS X. That's where it's failing for you?
09:27drewrYup. Just get a blank gradient background.
09:27cgrandChouser: I think FF3 chokes on the empty script tag (FF3/win xp) (if I edit it it works)
09:27jteosame here. FF3.
09:30Chouserhm!
09:31ChouserFF2 on OS X works fine. I don't have FF3 on that machine yet.
09:32ChouserThe script tag has a src="". Without that you're missing the navigation links, right?
09:33drewrI'm experimenting with this tracing code that rhickey posted to the list. I've evaled all the forms at the REPL, but when I get to (trace fact), I get "no such var: clojure/traced." What's the deal?
09:33cgranduse <script type="text/javascript" src="irc.js"></script> instead of <script type="text/javascript" chouser: src="irc.js"/> (<script> should not be "collapsed")
09:33drewrIt should be in the user ns.
09:34Chousercgrand: oh, of course. thanks.
09:39Chouserthere, how's that?
09:39drewrChouser: :-)
09:40cgrandchouser: works
09:40Chousergreat, thanks for your help.
09:44ChouserByteBuffer seems pretty hard to deal with, compared to String.
09:45cgrandchouser: will you regenerate logs anterior to 2008-04-13?
09:47Chousercgrand: I don't have them.
09:48ChouserMy IRC client is generating the raw logs that I'm using, so I've got nothing from before I joined the channel.
09:48Chouseroh, wait.
09:48Chousersorry, misundertood
09:48cgrand:-)
09:49Chouserhm, those should have been done already.
09:56cgrandchouser: last modification time says 13-Apr-2008 12:52 :-(
10:03Chouserthere, try that.
10:10cgrandchouser: perfect
10:58rhickeydrewr: trace working now?
11:08drewrrhickey: No, not yet.
11:09drewrCan't figure out why it wants traced to be in clojure's namespace.
11:13ChouserHow do I write a type hint for a Java byte[]? #^byte[] doesn't work. ;-)
11:15rhickeyyou can type hint arrays with the java.lang.Class.getName format as a String: #^"[B"
11:16Chouseryum! ok.
11:18Chouserhehe. this syntax highlighter totally wigs out on that.
11:19rhickeydrewr: want to try the latest (819) with clean/build?
11:19drewrHm, I'm at 818. Let me try that.
11:21Chouserre-seq on a 3MB file: with slurp 1016 msecs, with map-slurp 375 msecs
11:23drewrrhickey: Didn't help.
11:24drewr...for the Repl. It works now in Script.
11:24Chouserre-seq on a 13MB file: with map-slurp 1285 msecs, with slurp OutOfMemoryError: Java heap space
11:25rhickeyare you using asCharBuffer?
11:27Chouserno, I couldn't get that to work for me.
11:27rhickeydrewr: just did the same thing here, works in Repl fine, hmm...
11:27ChouserIt looks like maybe asCharBuffer is interpreting the bytes as UTF-16 or something.
11:27rhickeyChouser: so what does map-slurp do?
11:27ChouserI wrote my own CharSequence proxy
11:28drewrAnother related question. TRACE and UNTRACE both RESOLVE the function that gets passed in, however, when I do that directly, I get a ClassCastException. What's the difference between (resolve fact) at the REPL and that inside the macro?
11:28rhickey(resolve 'fact)
11:29drewrYes, but TRACE doesn't call (resolve (quote f)), it calls (resolve f).
11:29vincenzdrewr: but it's a macro, so it's passed in the symbol
11:29vincenzdrewr: that had me scratching my head a long time too, why is trace a macro
11:29drewrvincenz: Ah, thanks.
11:30vincenzit's to get the name of the function in there, not the value
11:30drewrOf course.
11:30rhickeylike doc, trace is a macro because it's really a repl-user convenience thing
11:31rhickeyI don't recommend doing that generally for things that take symbols
11:31drewrrhickey: BTW, I did a C-c C-k to compile and load the file, and it worked doing that. I'm not sure why C-M-x on the forms didn't work originally.
11:32rhickeyI used C-M-x on each form
11:32drewrInteresting.
11:33rhickeybut get same error as you when I C-M-x on all the forms!
11:33drewrHm. What's the difference between your last two comments?
11:34rhickeyone-at-at-time vs block
11:34vincenzrather odd
11:34drewrCan you do C-M-x on a region? What do you mean by block?
11:35rhickeyregion, I don't know if it is supposed to work
11:36ChouserThere's probably a better way to do this, but here's my map-slurp: http://n01se.net/paste/F0I
11:37drewrI generally do this: M-< to get to the top of the buffer, and then C-M-x, C-M-e all the way down. That's what failed me the first time with the trace stuff. Not sure why that's different.
11:37drewrI had a clean JVM because I restarted after I rebuilt clojure.jar.
11:40rhickeyChouser: interesting. I'm not sure the length times matter, but the re-seq time diff is something. Are you running -server, multiple tries?
11:41rhickeyI've found generally that laziness has provided a whole additional set of benefits in performance due to reduced heap pressure, in spite of the ephemeral garbage it generates
11:42Chouserrhickey: no -server, and "multiple" only on the order of 4 or 5 times, but the results seem stable.
11:42rhickey-server rocks
11:43rhickeysome parts of Clojure can be 4-10x faster
11:43cgrandchouser: .length is unfair: with map-slurp it returns the byte-size and slurp the character size
11:44Chousercgrand: good point! I hadn't thought of that.
11:44rhickeyyeah, running through is all that matters
11:45ChouserI included the length example mainly to show slurp just falls over at that size.
11:47ChouserI also assume there are things you might want to do with slurp where you really want a String, where a CharSequence won't cut it.
11:47rhickeythat's the diff between eager and lazy
11:48ChouserUsing the toString method there would presumably destroy the benefit of map-slurp
11:48rhickeyfor the memory usage related benefits
11:48Chouserrhickey: yeah, I guess that's true. I hadn't thought of it that way, but this is lazy right through the OS down to the disk.
11:49rhickeyI think it is really interesting, need to look more at CharSequence
11:53rhickeya lot of the bridging Clojure does to String in API funcs could be done at CharSequence level
11:53rhickeyseq/nth/get/count
11:56ChouserI wonder if there's a better way to do toString there, too. I'm copying from the mapped buffer into an array, and then I think String makes another copy.
11:56Chousercan I proxy byte[]?
11:58cgrandchouser: I don't think so
12:03cgrandchouser: whatever (except another String) you pass to a String constructor will get copied because it's mutable (char[] or byte[] or StringBuilder/StringBuffer)...
12:08Chousersure, but I'd like to copy once instead of twice
12:08ChouserI'd like to hand a CharSequence directly to String, for example, instead of having to copy into an array first.
12:13cgrandthe better you can do is to build a char[] from the ByteBuffer and pass it to String :-( (If you pass a byte[], this array is copied before decoding (!) and then a char[] is allocated...)
12:15Chouser:-(
12:23cgrandhave you tried Charset.forName("UTF-8").newDecoder().decode(bytebuffer).subSequence... to get chouser: a CharSequence with correct charAt and length?
12:24cgrand(oops "chouser:" should have been at the start of the message...)
12:43Chouserheh. no, I didn't. What I've got already really stretched my Java (and JavaDoc) abilities.
12:43ChouserLet me try it...
12:51Chouserin that expression, decode() is eager, isn't it?
12:52cgrandI think it works by chunks, let me check
13:00cgrander.. You're right it's eager... if you want to process the input lazily you'll have to split it into multiple bytebuffers and write a charsequence proxy which delegates to the subsequences etc. :-<
13:02cgrandthe good news are tha CharsetDecoder is stateful and hence should work even if you split the inpu in the middle of a multibyte character... pfff... no pain no gain I guess
13:03ChouserI just realized that chatAt in my proxy isn't right for multibyte encodings anyway.
13:04ChouserFor the same reason you already pointed out for length.
13:04rhickeyasCharBuffer doesn't do the right things?
13:04ChouserI can't figure out how to tell asCharBuffer to use a specific encoding, and it's default appears to be incorrect for an ASCII file.
13:08Chouserit's weird to me that the docs for CharBuffer make no mention of encoding at all.
13:08cgrandjust looked at the source for asCharBuffer: it's not pretty: UTF-16 is hardcoded (so to bypass all the encoding stuff)
13:09Chouserok, that's the impression I was starting to get from the docs.
13:10cgrandThe only way to go from a ByteBuffer to a CharBuffer with a specific encoding is through CharsetDecoder...
13:11Chouserthere's really a fundamental problem for lazy file reading here. For fixed width encodings, it's easy(UTF-16 for asCharBuffer, ASCII for my proxy)
13:12rhickeyvariable byte chars stink and always have
13:12Chouserfor variable-width (like UTF-8), what do you do when some asks for charAt 55? To do it correctly, you must scan to that point.
13:13ChouserI bet re-seq is scanning the inpput in order anyway, though, so there may still be speed improvements over slurp available here.
13:15cgrandtrue random access in strings is not that common (you always scan in one sense or the other)
13:16cgrand(interseting post on that stuff http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html)
13:16cgrand(string representation, not java.nio)
13:27Chousernice post, thanks.
13:37rhickeyit's a shame you can't piggyback on CharsetDecoder lazily, but it's hardwired to decode to CharBuffer, which is not an interface, but a large abstract class I imagine decode uses very little of. The java.io guys really need to learn about interfaces from the java.util guys
13:39ChouserI imagine it won't be too terribly hard to decode chunks of an input ByteBuffer into CharBuffers lazily, and make available as a CharSequence.
13:39ChouserSome clever use of lazy-cons may even make it somewhat attractive.
13:41rhickeyrelying on the consuming code not calling charAt very far ahead, or length?
13:41Chouserooh, I hadn't thought of length.
13:41ChouserBut yes, O(n) access for charAt.
13:41cgrandor very far behind unless you retain the head...
13:42ChouserI guess it would have to be O(n) for length too.
13:42rhickeybut charAt is all you've got in CharSequence
13:42rhickeyit's not much of a sequence
13:43Chouserbah, re-seq uses the results of length.
13:44rhickeyI think all of the coolness of FileChannel.map disappears for variable-byte char files
13:44ChouserHmph. mmap is still generally more efficient than buffered reads.
13:45ChouserBut that doesn't mean you're wrong.
13:45cgrandwhat about doing a first decodong pass (by chunks) to remember some (char-offset, byte-offset) pairs (eg every 1k bytes) and then using this info to make charAt O(1)?
13:45rhickeyis 2 passes still more efficient than buffered reads?
13:46ChouserWhat about re-writing Java regex so it doesn't need length?
13:46rhickeythere's an amazing lack of connection between java.io and java.nio
13:47cgrandchouser: or better composing the encoding and the regex to get a regex on bytes... :-)
13:47rhickeycan't strap readers onto channels?
13:48Chouserrhickey: what would that buy you?
13:48rhickeythey must have the decoder logic built int
13:51Chouserio.InputReader seems to, yes.
13:55rhickeycould extend InputStream for ByteBuffer, to test your mmap vs buffered io theory
13:56Chouseryep, halfway done. ;-)
13:56Chousercan I proxy methods with the same name and different arities?
13:56rhickeyone function handles all arities, just use the normal Clojure arity overloading
13:58Chouserok
14:01rhickeynot sure if this is useful: http://www.exampledepot.com/egs/java.nio/Buffer2Stream.html
14:09Chouseryep, I didn't realize I could provide such a small subset of InputStream methods.
14:42Chouserhttp://n01se.net/paste/8Oq -- cgrand's suggestion
14:42ChouserIt's not lazy, but it's apparently pretty efficient.
16:20ChouserIssues with mmap in Java: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038
16:24rhickeyouch
16:25Chouserbut that's only on Windows. Does anyone still use that?
16:26rhickeyis it only Windows?
16:27rhickeyor just where reported?
16:29ChouserI'm not sure, but I think the reason it's a problem to keep the file mapped is that Windows lock access to it while open.
16:30rhickeyI guess I shouldn't change slurp just yet :)
16:31ChouserLinux would probably allow you to unlink the file, for example, and keep the blocks on disk until the map is GC'ed, while you can go ahead and reuse the filename.
20:34Chouserif I have a list of pairs [[1 2] [3 4] ...], it's very natural to consume them lazily using (for [[a b] lst] ...)
20:35Chouserif instead I have a list that I want to consume as pairs [1 2 3 4 ...], I can't think of any natural way to consume them lazily.
20:36Chouser(for [[a b] (apply array-map lst)] ...) ; convenient, but eager
20:39rhickey(map vector (take-nth 2 x) (take-nth 2 (rest x)))
20:41Chouserhm. better than the recursive lazy-cons thing I was growing...
20:42ChouserI don't suppose that's something that could be shimmed into deconstruction?
20:42rhickeyI wanted to write a take-ns that would do that...
20:43rhickeythinking about destructuring now...
20:44ChouserI can't even think of how the syntax would work for destructuring, let alone how to implement it.
20:44rhickeyit's not really a good fit for destructuring
20:44Chousergenerally destrucuring takes what it wants and throws away the rest.
20:44rhickeyright
20:44Chousertake-ns sounds nice, though. And you already wrote it. ;-)
20:46rhickey(defn take-ns [n xs]
20:46rhickey (when (seq xs)
20:46rhickey (lazy-cons (take n xs) (take-ns n (drop n xs)))))
20:58Chouserbeautiful
21:03jonathan_ncie
21:03jonathan_nice