2008-04-17
| 08:48 | Chouser | Hm, perhaps slurp should use FileChannel.map |
| 08:49 | rhickey | That could be cool - never used the NIO stuff |
| 08:51 | rhickey | can you get it into a string without it being in memory twice? |
| 08:54 | Chouser | I dunno, but it might be possible. |
| 08:56 | Chouser | Docs say read is faster for files smaller than "a few tens of kilobytes" |
| 08:57 | Chouser | You get a subclass of ByteBuffer, not a string. |
| 09:02 | rhickey | are you proposing slurp return a ByteBuffer? |
| 09:03 | rhickey | Chouser: did you log yesterday's irc? I was offline |
| 09:04 | Chouser | you didn't miss much: http://n01se.net/chouser/clojure-log/2008-04-16.html |
| 09:04 | Chouser | should I just let google loose on those logs? We can always move them later. |
| 09:05 | rhickey | sure |
| 09:05 | rhickey | thanks |
| 09:05 | Chouser | ok |
| 09:14 | drewr | Works in Safari, but not FF3. |
| 09:17 | jteo | true |
| 09:25 | Chouser | I've had people complain about other of my sites in FF3. Works fine for me, though. |
| 09:26 | drewr | You on OS X? |
| 09:26 | Chouser | nope, Linux. |
| 09:26 | Chouser | I can try OS X. That's where it's failing for you? |
| 09:27 | drewr | Yup. Just get a blank gradient background. |
| 09:27 | cgrand | Chouser: I think FF3 chokes on the empty script tag (FF3/win xp) (if I edit it it works) |
| 09:27 | jteo | same here. FF3. |
| 09:30 | Chouser | hm! |
| 09:31 | Chouser | FF2 on OS X works fine. I don't have FF3 on that machine yet. |
| 09:32 | Chouser | The script tag has a src="". Without that you're missing the navigation links, right? |
| 09:33 | drewr | I'm experimenting with this tracing code that rhickey posted to the list. I've evaled all the forms at the REPL, but when I get to (trace fact), I get "no such var: clojure/traced." What's the deal? |
| 09:33 | cgrand | use <script type="text/javascript" src="irc.js"></script> instead of <script type="text/javascript" chouser: src="irc.js"/> (<script> should not be "collapsed") |
| 09:33 | drewr | It should be in the user ns. |
| 09:34 | Chouser | cgrand: oh, of course. thanks. |
| 09:39 | Chouser | there, how's that? |
| 09:39 | drewr | Chouser: :-) |
| 09:40 | cgrand | chouser: works |
| 09:40 | Chouser | great, thanks for your help. |
| 09:44 | Chouser | ByteBuffer seems pretty hard to deal with, compared to String. |
| 09:45 | cgrand | chouser: will you regenerate logs anterior to 2008-04-13? |
| 09:47 | Chouser | cgrand: I don't have them. |
| 09:48 | Chouser | My IRC client is generating the raw logs that I'm using, so I've got nothing from before I joined the channel. |
| 09:48 | Chouser | oh, wait. |
| 09:48 | Chouser | sorry, misundertood |
| 09:48 | cgrand | :-) |
| 09:49 | Chouser | hm, those should have been done already. |
| 09:56 | cgrand | chouser: last modification time says 13-Apr-2008 12:52 :-( |
| 10:03 | Chouser | there, try that. |
| 10:10 | cgrand | chouser: perfect |
| 10:58 | rhickey | drewr: trace working now? |
| 11:08 | drewr | rhickey: No, not yet. |
| 11:09 | drewr | Can't figure out why it wants traced to be in clojure's namespace. |
| 11:13 | Chouser | How do I write a type hint for a Java byte[]? #^byte[] doesn't work. ;-) |
| 11:15 | rhickey | you can type hint arrays with the java.lang.Class.getName format as a String: #^"[B" |
| 11:16 | Chouser | yum! ok. |
| 11:18 | Chouser | hehe. this syntax highlighter totally wigs out on that. |
| 11:19 | rhickey | drewr: want to try the latest (819) with clean/build? |
| 11:19 | drewr | Hm, I'm at 818. Let me try that. |
| 11:21 | Chouser | re-seq on a 3MB file: with slurp 1016 msecs, with map-slurp 375 msecs |
| 11:23 | drewr | rhickey: Didn't help. |
| 11:24 | drewr | ...for the Repl. It works now in Script. |
| 11:24 | Chouser | re-seq on a 13MB file: with map-slurp 1285 msecs, with slurp OutOfMemoryError: Java heap space |
| 11:25 | rhickey | are you using asCharBuffer? |
| 11:27 | Chouser | no, I couldn't get that to work for me. |
| 11:27 | rhickey | drewr: just did the same thing here, works in Repl fine, hmm... |
| 11:27 | Chouser | It looks like maybe asCharBuffer is interpreting the bytes as UTF-16 or something. |
| 11:27 | rhickey | Chouser: so what does map-slurp do? |
| 11:27 | Chouser | I wrote my own CharSequence proxy |
| 11:28 | drewr | Another related question. TRACE and UNTRACE both RESOLVE the function that gets passed in, however, when I do that directly, I get a ClassCastException. What's the difference between (resolve fact) at the REPL and that inside the macro? |
| 11:28 | rhickey | (resolve 'fact) |
| 11:29 | drewr | Yes, but TRACE doesn't call (resolve (quote f)), it calls (resolve f). |
| 11:29 | vincenz | drewr: but it's a macro, so it's passed in the symbol |
| 11:29 | vincenz | drewr: that had me scratching my head a long time too, why is trace a macro |
| 11:29 | drewr | vincenz: Ah, thanks. |
| 11:30 | vincenz | it's to get the name of the function in there, not the value |
| 11:30 | drewr | Of course. |
| 11:30 | rhickey | like doc, trace is a macro because it's really a repl-user convenience thing |
| 11:31 | rhickey | I don't recommend doing that generally for things that take symbols |
| 11:31 | drewr | rhickey: BTW, I did a C-c C-k to compile and load the file, and it worked doing that. I'm not sure why C-M-x on the forms didn't work originally. |
| 11:32 | rhickey | I used C-M-x on each form |
| 11:32 | drewr | Interesting. |
| 11:33 | rhickey | but get same error as you when I C-M-x on all the forms! |
| 11:33 | drewr | Hm. What's the difference between your last two comments? |
| 11:34 | rhickey | one-at-at-time vs block |
| 11:34 | vincenz | rather odd |
| 11:34 | drewr | Can you do C-M-x on a region? What do you mean by block? |
| 11:35 | rhickey | region, I don't know if it is supposed to work |
| 11:36 | Chouser | There's probably a better way to do this, but here's my map-slurp: http://n01se.net/paste/F0I |
| 11:37 | drewr | I generally do this: M-< to get to the top of the buffer, and then C-M-x, C-M-e all the way down. That's what failed me the first time with the trace stuff. Not sure why that's different. |
| 11:37 | drewr | I had a clean JVM because I restarted after I rebuilt clojure.jar. |
| 11:40 | rhickey | Chouser: interesting. I'm not sure the length times matter, but the re-seq time diff is something. Are you running -server, multiple tries? |
| 11:41 | rhickey | I've found generally that laziness has provided a whole additional set of benefits in performance due to reduced heap pressure, in spite of the ephemeral garbage it generates |
| 11:42 | Chouser | rhickey: no -server, and "multiple" only on the order of 4 or 5 times, but the results seem stable. |
| 11:42 | rhickey | -server rocks |
| 11:43 | rhickey | some parts of Clojure can be 4-10x faster |
| 11:43 | cgrand | chouser: .length is unfair: with map-slurp it returns the byte-size and slurp the character size |
| 11:44 | Chouser | cgrand: good point! I hadn't thought of that. |
| 11:44 | rhickey | yeah, running through is all that matters |
| 11:45 | Chouser | I included the length example mainly to show slurp just falls over at that size. |
| 11:47 | Chouser | I also assume there are things you might want to do with slurp where you really want a String, where a CharSequence won't cut it. |
| 11:47 | rhickey | that's the diff between eager and lazy |
| 11:48 | Chouser | Using the toString method there would presumably destroy the benefit of map-slurp |
| 11:48 | rhickey | for the memory usage related benefits |
| 11:48 | Chouser | rhickey: yeah, I guess that's true. I hadn't thought of it that way, but this is lazy right through the OS down to the disk. |
| 11:49 | rhickey | I think it is really interesting, need to look more at CharSequence |
| 11:53 | rhickey | a lot of the bridging Clojure does to String in API funcs could be done at CharSequence level |
| 11:53 | rhickey | seq/nth/get/count |
| 11:56 | Chouser | I wonder if there's a better way to do toString there, too. I'm copying from the mapped buffer into an array, and then I think String makes another copy. |
| 11:56 | Chouser | can I proxy byte[]? |
| 11:58 | cgrand | chouser: I don't think so |
| 12:03 | cgrand | chouser: whatever (except another String) you pass to a String constructor will get copied because it's mutable (char[] or byte[] or StringBuilder/StringBuffer)... |
| 12:08 | Chouser | sure, but I'd like to copy once instead of twice |
| 12:08 | Chouser | I'd like to hand a CharSequence directly to String, for example, instead of having to copy into an array first. |
| 12:13 | cgrand | the better you can do is to build a char[] from the ByteBuffer and pass it to String :-( (If you pass a byte[], this array is copied before decoding (!) and then a char[] is allocated...) |
| 12:15 | Chouser | :-( |
| 12:23 | cgrand | have you tried Charset.forName("UTF-8").newDecoder().decode(bytebuffer).subSequence... to get chouser: a CharSequence with correct charAt and length? |
| 12:24 | cgrand | (oops "chouser:" should have been at the start of the message...) |
| 12:43 | Chouser | heh. no, I didn't. What I've got already really stretched my Java (and JavaDoc) abilities. |
| 12:43 | Chouser | Let me try it... |
| 12:51 | Chouser | in that expression, decode() is eager, isn't it? |
| 12:52 | cgrand | I think it works by chunks, let me check |
| 13:00 | cgrand | er.. You're right it's eager... if you want to process the input lazily you'll have to split it into multiple bytebuffers and write a charsequence proxy which delegates to the subsequences etc. :-< |
| 13:02 | cgrand | the good news are tha CharsetDecoder is stateful and hence should work even if you split the inpu in the middle of a multibyte character... pfff... no pain no gain I guess |
| 13:03 | Chouser | I just realized that chatAt in my proxy isn't right for multibyte encodings anyway. |
| 13:04 | Chouser | For the same reason you already pointed out for length. |
| 13:04 | rhickey | asCharBuffer doesn't do the right things? |
| 13:04 | Chouser | I can't figure out how to tell asCharBuffer to use a specific encoding, and it's default appears to be incorrect for an ASCII file. |
| 13:08 | Chouser | it's weird to me that the docs for CharBuffer make no mention of encoding at all. |
| 13:08 | cgrand | just looked at the source for asCharBuffer: it's not pretty: UTF-16 is hardcoded (so to bypass all the encoding stuff) |
| 13:09 | Chouser | ok, that's the impression I was starting to get from the docs. |
| 13:10 | cgrand | The only way to go from a ByteBuffer to a CharBuffer with a specific encoding is through CharsetDecoder... |
| 13:11 | Chouser | there's really a fundamental problem for lazy file reading here. For fixed width encodings, it's easy(UTF-16 for asCharBuffer, ASCII for my proxy) |
| 13:12 | rhickey | variable byte chars stink and always have |
| 13:12 | Chouser | for variable-width (like UTF-8), what do you do when some asks for charAt 55? To do it correctly, you must scan to that point. |
| 13:13 | Chouser | I bet re-seq is scanning the inpput in order anyway, though, so there may still be speed improvements over slurp available here. |
| 13:15 | cgrand | true random access in strings is not that common (you always scan in one sense or the other) |
| 13:16 | cgrand | (interseting post on that stuff http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html) |
| 13:16 | cgrand | (string representation, not java.nio) |
| 13:27 | Chouser | nice post, thanks. |
| 13:37 | rhickey | it's a shame you can't piggyback on CharsetDecoder lazily, but it's hardwired to decode to CharBuffer, which is not an interface, but a large abstract class I imagine decode uses very little of. The java.io guys really need to learn about interfaces from the java.util guys |
| 13:39 | Chouser | I imagine it won't be too terribly hard to decode chunks of an input ByteBuffer into CharBuffers lazily, and make available as a CharSequence. |
| 13:39 | Chouser | Some clever use of lazy-cons may even make it somewhat attractive. |
| 13:41 | rhickey | relying on the consuming code not calling charAt very far ahead, or length? |
| 13:41 | Chouser | ooh, I hadn't thought of length. |
| 13:41 | Chouser | But yes, O(n) access for charAt. |
| 13:41 | cgrand | or very far behind unless you retain the head... |
| 13:42 | Chouser | I guess it would have to be O(n) for length too. |
| 13:42 | rhickey | but charAt is all you've got in CharSequence |
| 13:42 | rhickey | it's not much of a sequence |
| 13:43 | Chouser | bah, re-seq uses the results of length. |
| 13:44 | rhickey | I think all of the coolness of FileChannel.map disappears for variable-byte char files |
| 13:44 | Chouser | Hmph. mmap is still generally more efficient than buffered reads. |
| 13:45 | Chouser | But that doesn't mean you're wrong. |
| 13:45 | cgrand | what about doing a first decodong pass (by chunks) to remember some (char-offset, byte-offset) pairs (eg every 1k bytes) and then using this info to make charAt O(1)? |
| 13:45 | rhickey | is 2 passes still more efficient than buffered reads? |
| 13:46 | Chouser | What about re-writing Java regex so it doesn't need length? |
| 13:46 | rhickey | there's an amazing lack of connection between java.io and java.nio |
| 13:47 | cgrand | chouser: or better composing the encoding and the regex to get a regex on bytes... :-) |
| 13:47 | rhickey | can't strap readers onto channels? |
| 13:48 | Chouser | rhickey: what would that buy you? |
| 13:48 | rhickey | they must have the decoder logic built int |
| 13:51 | Chouser | io.InputReader seems to, yes. |
| 13:55 | rhickey | could extend InputStream for ByteBuffer, to test your mmap vs buffered io theory |
| 13:56 | Chouser | yep, halfway done. ;-) |
| 13:56 | Chouser | can I proxy methods with the same name and different arities? |
| 13:56 | rhickey | one function handles all arities, just use the normal Clojure arity overloading |
| 13:58 | Chouser | ok |
| 14:01 | rhickey | not sure if this is useful: http://www.exampledepot.com/egs/java.nio/Buffer2Stream.html |
| 14:09 | Chouser | yep, I didn't realize I could provide such a small subset of InputStream methods. |
| 14:42 | Chouser | http://n01se.net/paste/8Oq -- cgrand's suggestion |
| 14:42 | Chouser | It's not lazy, but it's apparently pretty efficient. |
| 16:20 | Chouser | Issues with mmap in Java: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4724038 |
| 16:24 | rhickey | ouch |
| 16:25 | Chouser | but that's only on Windows. Does anyone still use that? |
| 16:26 | rhickey | is it only Windows? |
| 16:27 | rhickey | or just where reported? |
| 16:29 | Chouser | I'm not sure, but I think the reason it's a problem to keep the file mapped is that Windows lock access to it while open. |
| 16:30 | rhickey | I guess I shouldn't change slurp just yet :) |
| 16:31 | Chouser | Linux would probably allow you to unlink the file, for example, and keep the blocks on disk until the map is GC'ed, while you can go ahead and reuse the filename. |
| 20:34 | Chouser | if I have a list of pairs [[1 2] [3 4] ...], it's very natural to consume them lazily using (for [[a b] lst] ...) |
| 20:35 | Chouser | if instead I have a list that I want to consume as pairs [1 2 3 4 ...], I can't think of any natural way to consume them lazily. |
| 20:36 | Chouser | (for [[a b] (apply array-map lst)] ...) ; convenient, but eager |
| 20:39 | rhickey | (map vector (take-nth 2 x) (take-nth 2 (rest x))) |
| 20:41 | Chouser | hm. better than the recursive lazy-cons thing I was growing... |
| 20:42 | Chouser | I don't suppose that's something that could be shimmed into deconstruction? |
| 20:42 | rhickey | I wanted to write a take-ns that would do that... |
| 20:43 | rhickey | thinking about destructuring now... |
| 20:44 | Chouser | I can't even think of how the syntax would work for destructuring, let alone how to implement it. |
| 20:44 | rhickey | it's not really a good fit for destructuring |
| 20:44 | Chouser | generally destrucuring takes what it wants and throws away the rest. |
| 20:44 | rhickey | right |
| 20:44 | Chouser | take-ns sounds nice, though. And you already wrote it. ;-) |
| 20:46 | rhickey | (defn take-ns [n xs] |
| 20:46 | rhickey | (when (seq xs) |
| 20:46 | rhickey | (lazy-cons (take n xs) (take-ns n (drop n xs))))) |
| 20:58 | Chouser | beautiful |
| 21:03 | jonathan_ | ncie |
| 21:03 | jonathan_ | nice |