Even text that Drupal considers to be valid UTF-8 may contain control characters (or maybe high-byte characters) that cause Solr indexing to fail and jetty to return an error 500.

See, for example: http://www.mail-archive.com/solr-user@lucene.apache.org/msg13727.html

note, the apachesolr_attachements module already strips one control character due to this problem, but there are others that cause the same issue.

CommentFileSizeAuthor
#2 utf8-fix-342728-2.patch4.79 KBpwolanin

Comments

pwolanin’s picture

I can make node bodies that cause these errors using code like this:

$node = node_load(8);

for ($i=1; $i<10;$i++){
  $node->body .= md5($node->body.mt_rand(), TRUE);
}
$node->body = iconv("utf-8", "utf-8//IGNORE", $node->body);

node_save($node);

or like this (from the link above):

echo "Andr\005é 3000";
pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new4.79 KB

this patch eliminates the errors I can produce via the above testing methods.

pwolanin’s picture

Version: 6.x-1.x-dev » 5.x-1.x-dev
Status: Needs review » Patch (to be ported)

committed to 6.x

pwolanin’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev
Status: Patch (to be ported) » Closed (fixed)