HTTP Status 500 - Illegal character (NULL, unicode 0) encountered
Jeremy - June 27, 2009 - 03:33
| Project: | Apache Solr Search Integration |
| Version: | 6.x-1.0-beta11 |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
Description
Having recently upgraded a large website to 6.x-1.0-beta11, indexing has stopped and the following error is being reported in the watchdog each time cron runs:
HTTP Status 500 - Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]The full error is reported as follows:
"500" Status: Internal Server Error<html><head><title>Apache Tomcat/5.5.27 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 500 - Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
java.io.IOException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:73)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:640)
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:669)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:278)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
... 19 more
</h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u>Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
java.io.IOException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:73)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:640)
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:669)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:278)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
... 19 more
</u></p><p><b>description</b> <u>The server encountered an internal error (Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
java.io.IOException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:73)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:172)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:875)
at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character (NULL, unicode 0) encountered: not valid in any content
at [row,col {unknown-source}]: [119,109]
at com.ctc.wstx.sr.StreamScanner.constructNullCharException(StreamScanner.java:640)
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:669)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:278)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
... 19 more
) that prevented it from fulfilling this request.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/5.5.27</h3></body></html>We have tried dropping the entire index and rebuilding it from scratch, however it runs into the same problems.

#1
That's odd - maybe there is a problem with the code intended to strip all such chars?
return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text);Is it possible some odd multi-byte char could have the 1st byte stripped and then leave a null byte?
Can you isolate the offention portion(s) of that node?
#2
We're trying to isolate the offending node. Can you give us some direction to that end?
Here's what the search index page tells us:
The search index is generated by running cron. 3% of the site content has been sent to the server. There are 88328 items left to send.
The server has a 4 min delay before updates are processed.
Number of documents in index: 3266 (66 sent but not yet processed)
Numer of pending deletions: 0
.... We're thinking 3266 (or 3277) might be the offending node but we're not sure what order Solr indexes in.
Thanks,
Kevin
#3
When Solr hits a bug like this, the module will not continue indexing as if everything is fine.
So I would recommend setting the nodes per cron run to 100 or something, and keep running it. When it craps out, change it to 50 or something. If it still craps out, reduce to 20, etc until you narrow in on the one which is causing the problem (in which case you are runnin 1 per cron run).
Alternately, you can facilitate this process a little if you can write a little code by getting into the error handling during the index operation, and spitting out whatever vars will help you identify the node or the raw XML that was sent.
HTH,
Jacob
#4
Okay, so I'm guessing Solr just starts at the beginning of the node table (nid =1) and works it's way down the list. (apachesolr_search_node lists nodes in a similar fashion).
So, guessing again, when the index page says it was able to index 3266 documents, that should mean the problem is with the 3267th node in the node table.
But wait, Solr doesn't index documents with status = 0 so I'll need to exclude those from my count.
I think I understand. I'll work through the logic(?) listed above and see if I can pinpoint the problematic node.
#5
I think I was able to find the problematic node and set it to unpublished.
The first try (counting nodes with status =1) did not work so my presumption above about Solr ignoring documents with status =0 was probably incorrect.
When I tried a new query, and ignored status, I was able to disable a range of nodes (just to be safe) that appears to include the one with the offending character.
I found the "range" with this query:
mysql> set @num = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> select (@num:=@num+1) AS row, nid from apachesolr_search_node order by nid limit 3300;
+------+------+
| row | nid |
+------+------+
| 1 | 4 |
| 2 | 7 |
...
| 3264 | 5218 |
| 3265 | 5219 |
| 3266 | 5220 |
| 3267 | 5221 |
| 3268 | 5222 |
| 3269 | 5223 |
...
| 3299 | 5263 |
| 3300 | 5264 |
+------+------+
3300 rows in set (0.00 sec)
Then I "unpublished" nodes in the range by setting their status = 0:
mysql> update node set status = 0 where nid IN(5223,5222,5221,5220,5219,5218) limit 6;Finally, I deleted the entire Solr index and waited for the magical 3267th node to be indexed. After 24 hours, we're now indexing node 41,000 so I believe the above steps were a success.
After the site has completed indexing (couple more days) I'll try to isolate the exact node that was causing the problems.
#6
We did have similar problems previously with odd character encodings (hence the regex I posted above). Obviously we'd be happy to make it more robust if there is a way to do so.
#7
Thanks pwolanin,
Do you know of a way that we can tell the sequence documents are indexed and which ones have (or have not) been indexed?
We're running into the same error at "Number of documents in index: 52345" but we're having a devil of a time locating the offending node.
The steps we took before (described above) are not working this time.
Hopefully re-indexing is not a step that we need to take ... having made it this far into the process, we're reluctant to start over.
TIA,
Kevin
#8
The apachesolr module maintains its own table and an index into that table as a Drupal variable. Use the devel module or look in the variables table to find the value of 'apachesolr_index_last'. Assuming you can limit to a few nodes indexed per cron run this should let you narrow closely to the problem node, e.g.:
SELECT asn.nid, asn.changed FROM apachesolr_search_node asn WHERE (asn.changed > %d OR (asn.changed = %d AND asn.nid > %d)) AND asn.status = 1 ORDER BY asn.changed ASC, asn.nid ASC LIMIT 0,5;Where you have to sub in for %d the values that are stored in that variable. This will tell you then next 5 nodes in the queue to be indexed.
Note, that if you want to unpublish nodes and have apachesolr respect that, you must either use the UI or use node_save(). Touching {node} directly means that the {apachesolr_search} table is not updated.
#9
> Note, that if you want to unpublish nodes and have apachesolr respect that, you must
> either use the UI or use node_save(). Touching {node} directly means that the
> {apachesolr_search} table is not updated.
Reviewing the code, it seems that _apachesolr_nodeapi_update() is called each time cron runs. Wouldn't that remove content from apachesolr_search if content is unpublished manually?
#10
Oh, yeah, it will get fixed up on cron, thouh it was really intended for the case where an update to solr fails.
#11
I applied the attached patch to track down the problematic node -- any chance of applying it or something like it? It dumps a list of up to 20 nodes in the chunk last sent to Solr when indexing fails.
(To track down the specific bad node, I temporarily hacked the following line in the same function:
$docs_chunk = array_chunk($documents, 20);changing 20 to 1 -- this resulted in a very specific error:
"Indexing failed on one of the following nodes: #" where # was the problematic node that we were unable to index.)
Indexing apparently failed due to the following characters in a comment:
"à uß%������˜‡í ����������������"
The characters do not appear to be copying and pasting correctly in my browser, however, so the above string may work for you in your testing.
#12
Yes, something like that patch looks like a good idea, though I might just concat that text with the watchdog call we are already making?
Try adding this additional processing to the indexing step prior to the step where we strip low bytes chars:
$text = iconv("UTF-8", "UTF-8//IGNORE", $text)We tried this at one point but it did not actually strip low byte chars (e.g. char 30). If your problem is due to some funky encoding or multi-byte char, this might be useful as an initial clean-up to insure that the text looks reasonably like the expected utf-8
#13
We will test this within the next few days and let you know whether or not it fixes this problem.
As for whether you concat the text or display it in its own watchdog is not important, just that the data is available to quickly debug and track down problematic nodes.
#14
Where is the best place to put that code? I tried adding it to the top of apachesolr_strip_ctl_chars(), but it did not help. We still get the "500 - Illegal character" error.
#15
There are 2 places it might be needed. See patch.
#16
Actually this might be better - it's a little silly to get the preg_replace in 2 places.
#17
I tried both in my previous testing, but neither worked. I also tried numerous variations involving combinations of iconv(), mb_convert_encoding() and mb_detect_encoding(), but none solved the problem.
Recently I unpublished the offending comment, and in doing so the problematic binary content was evidently stripped away. Re-publishing it we were able to index it without problems. I suspect now that this comment was either originally imported from another CMS, or created with a much earlier version of Drupal. In any case, I unfortunately no longer have the content available for easy testing.
#18
Ok, forget those patches. d.o is having the same issue and I think I tracked it down - if there is an encoded character like #031 we decode it with html_entity_decode() and it slips though because we removed those characters before the decode step, rather than after.
#19
I just confirmed the fix locally for one known problem from d.o.
#20
Confirm this patch works.
Added to a node, upon saving the node ApacheSolr tries to index it, and watchdog reports the Java exception.
Adding in the patch and indexing that same node fixes the problem.
#21
updated with the additional logging
#22
committed to 6.x
#23
Deployed on drupal.org, seems to work as expected.
#24
Automatically closed -- issue fixed for 2 weeks with no activity.