2008-02-22
| 14:14 | Chouser | rhickey: fyi, I adjusted xml.clj slightly to use tagsoup instead of java's sax parser, and it's working quite nicely. |
| 14:15 | rhickey | cool, want to put it on the group? |
| 14:19 | rhickey | The author of TagSoup, John Cowan, is on the Clojure group |
| 14:23 | Chouser | heh, coo. |
| 14:23 | Chouser | cool |
| 14:23 | Chouser | well, what do you think of adding an optional parameter to xml.clj's parse to allow specifying a parser? |
| 14:24 | Chouser | I haven't tried to do that yet, but I assume it would be easy |
| 14:24 | rhickey | Is that all it takes? sure |
| 14:24 | Chouser | ok, if I get that working, I'll post it to the group. |
| 14:24 | rhickey | great |
| 14:51 | Chouser | rhickey: there. what could be easier? |
| 14:52 | rhickey | thanks |
| 16:08 | Chouser | huh. I think I just found a bug in xml.clj |
| 16:09 | Chouser | <td>some <b>bold</b> text</td> when parsed includes neither "some" nor "text", only "bold" |
| 16:18 | rhickey | I'll look at it |
| 16:18 | Chouser | ok, thanks. I can see the problem, but I'm not sure how best to fix it. |
| 16:19 | Chouser | charachters can be called when *state* is :between, and usually that should be just fine. |
| 16:20 | Chouser | startElement would have to handle pushing an *sb* like endElement does |
| 16:23 | rhickey | yes on the startElement |
| 16:24 | rhickey | between is kind of broken notion, I put it in to deal with junk ws/nl stuff which I get from the SAX parser where no one would consider there to be interleaved text, and didn't want to create content entries for it |
| 16:24 | Chouser | ok |
| 16:24 | rhickey | I'll have to dump ws-only character content to avoid that |
| 16:31 | Chouser | well, I don't mind the whitespace for now. |
| 16:31 | Chouser | I've got a sufficiently patched-up version I can proceed... |
| 16:37 | Chouser | whee! Ok, so to do the equivalent of the xpath: //td[b = 'Listing #']/node()[position() = last()] |
| 16:37 | Chouser | I can say: (seq-filter html flatten :td [:b "Listing #"] #(first (reverse (% :content)))) |
| 16:39 | rhickey | seq-filter? |
| 16:39 | Chouser | where "flatten" is a function that means "//" |
| 16:39 | albino | rhickey: Are you the principal creator of clojure? |
| 16:39 | Chouser | Um, yeah, lousy name. All the names are lousy, but it works. |
| 16:39 | rhickey | yes |
| 16:40 | albino | rhickey: do you get paid to do it? |
| 16:40 | rhickey | no |
| 16:41 | albino | rhickey: does anyone else make core contributions are you pretty much on your own? |
| 16:41 | rhickey | just me |
| 16:42 | albino | rhickey: very impressive, thanks for letting me take some of your time |
| 16:42 | rhickey | sure |
| 16:43 | Chouser | seq-filter is a macro that mainly applies mapcat to each expr, passing the result to the next expr. |
| 16:44 | Chouser | then sprinkle in a little sugar for tag names (:td), sub-queries ([...]), and content-matching for strings ("Listing #"), and you've got most of what you need for a flexible query system for xml.clj-produced vector/maps. |
| 16:45 | Chouser | and it's all lazy |
| 16:45 | rhickey | neat |
| 16:45 | Chouser | yeah, once I actually use it a bit more so as to wear down the rough edges, I hope to share it. |
| 16:45 | Chouser | Got any better ideas for the name? |
| 16:46 | Chouser | mapcat-> |
| 16:59 | rhickey | attempted fix for xml.clj is up |
| 17:03 | Chouser | thanks! |
| 17:05 | Chouser | works for me, and thanks for including my little patch. :-) |