While indexing my document base, I stumble upon a document with a control char (8) in the path (don't ask...)

I had to modify the apachesolr_node_to_document() function as in the code below (sorry not providing a patch, but I'm not currenlty lined up with *latest* CVS (but I always keep an eye on that)... In need to test Solr 1.4 first)

...
      if ($output && $output != $path) {
        $document->path = $output;
      }
...

became

...
      if ($output && $output != $path) {
        $document->path = apachesolr_strip_ctl_chars($output);
      }
...
CommentFileSizeAuthor
#1 clean-path-360227-1.patch683 bytespwolanin

Comments

pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new683 bytes

seems like a reasonable change - ideally we might actually do this is the underlying PHP library for all fields added to a document.

Please check this patch.

flexer’s picture

The patch is OK.

You're right... Anyway, I'm indexing 700'000 documents (comments from a very old phpbb2 forum) and I got this problem for ONE document only.

BTW, I wrote a custom apachesolr_node_to_document() to index every comment as a single SOLR document and using an isfield as the $cid... it works very nicely :)

flexer’s picture

Status: Needs review » Reviewed & tested by the community
pwolanin’s picture

Status: Reviewed & tested by the community » Fixed

committed to 6.x

pwolanin’s picture

Status: Fixed » Closed (fixed)