I'm using the geshifilter module on my site to highlight code written in an uncommon language. One problem I have noticed is that when people copy and paste the highlighted text from their browser to the compiler for this language, depending on the browser the compiler gives errors. I've traced this back to the fact that GeSHi adds   entities wherever there are empty lines. Since &nbsp is not a valid ASCII character, it seems that different browsers use different ASCII character codes when copying and pasting this text, and the compiler chokes on any ASCII >127.

I am having GeSHi use the <pre> container to highlight my code, and that minimizes the number of &nbsp; characters, but does not eliminate them.

Do you have any ideas about how I could get rid of these, either in GeSHi itself or via the geshifilter module? I can't think of a way in the 5.x version, which is what I'm using. I haven't looked at the 6.x version, but it seems unlikely that you've added this feature since it's probably necessary only rarely.

If it's not possible to do right now, would you be willing to accept a patch that added this kind of functionality? For my purposes, just adding the following line at the end of _geshifilter_process() worked for me, but this is clearly not a robust solution:

$text = str_replace('&nbsp;', '', $text);

It might be best for me to just write my own small filter to do this, but I thought I'd see if this would be something you'd be interested in doing in the geshifilter module first.

Comments

soxofaan’s picture

Status: Active » Closed (won't fix)

The nbsp's are needed to make indentation work, so str_replace('&nbsp;', '', $text); will break code indentation, which is not nice ;)

Also, I can't reproduce with the browsers (Firefox, safari) and editors (textedit, komodo, even directly copypaste to command line) I tried, so I think the problem is tied to that particular compiler.

Which language and compiler are you talking about actually?

This seems to be a very exotic corner case. I don't think this "bug" should be fixed in the GeSHi library or the GeSHi filter module, as it is a problem in your copypaste-to-compiler-pipeline.

Like you suggested, writing a small post processing filter (or maybe using your str_replace hack) seems the best solution for your use case.

aclight’s picture

Sorry to bring up an older issue, but there was a bug in project_issue on d.o which caused me not to get an email when you replied to this.

If you want to see this happen for yourself, a page where this behavior can be seen is http://www.igorexchange.com/node/664

I believe that every empty line in the highlighted code has a &nbsp, but I know for sure that the first empty line (right after the block of red comments) does. From my tests only Safari on the mac will copy the &nbsp as an &nbsp instead of a normal space. When you paste the code into most text editors, the text editor itself will convert the &nbsp into a regular space.

However, the editor/compiler in question here, Igor Pro, does not do this, and the reason I've been given is that some people paste megabytes of data from the clipboard into text windows, and they don't want to make this slower by having to look at all code and replace invalid characters with spaces, for example. If you're so inclined you can download a demo version of Igor Pro for yourself at http://www.wavemetrics.com/support/demos.htm, but I suspect you'll trust me :)

Here's the output of a little script I wrote that, for each character in the clipboard, displays the ASCII representation of the character and the numerical value of the character.

When copied in safari on a mac, I get this:

  char:-->t<--, num:116
  char:-->i<--, num:105
  char:-->c<--, num:99
  char:-->
<--, num:13
  char:--> <--, num:-54       <-- -54 is the &nbsp
  char:-->
<--, num:13
  char:-->	<--, num:9
  char:-->V<--, num:86
  char:-->a<--, num:97
  char:-->r<--, num:114

When copied from FF on the mac, I get this:

  char:-->t<--, num:116
  char:-->i<--, num:105
  char:-->c<--, num:99
  char:-->
<--, num:13
  char:--> <--, num:32     <-- FF has converted to regular space
  char:-->
<--, num:13
  char:-->	<--, num:9
  char:-->V<--, num:86
  char:-->a<--, num:97
  char:-->r<--, num:114

Thanks at least for taking a look. I realize this is probably a pretty edge case, since it requires two programs to not convert &nbsp into a space.

soxofaan’s picture

Status: Closed (won't fix) » Active

For the moment I don't have much time to look into it.

Anyway, consider that the GeSHi filter module is just a wrapper around the third party GeSHi library.
Not trying to point fingers, but all the syntax highlighting heavy lifting happens inside the GeSHi library, so the source of your problem is probably inside the GeSHi library. The GeSHi filter module contains just some Drupal flavored UI stuff basically.
I'm not so familiar with the inner workings of the GeSHi library itself, so maybe you should also try your luck with submitting a support request to the GeSHi library developpers.

aclight’s picture

Status: Active » Closed (won't fix)

won't fix is fine with me. I just wrote a simple filter that fixes things up after geshifilter does its processing.

As you said earlier, just removing &nbsp entities will mess up the formatting in some cases, so reporting this as a bug in the geshi library itself probably isn't the right thing to do.

Thanks for taking a look.