Hi,

I'm developing a module that does some scraping. It does this by executing some shell commands from php to get a web page as a text string (using Curl on Ubuntu).

The problem I'm having is sites with foreign characters, those characters get encoded as a black diamond with a question mark inside in the variable itself (ie when I print_r /echo the variable - one example is that the portuguese character 'ô' from a web page gets converted to a '�'). I'd say it was to do with the server?

However, that being the case, the additional problem this causes me is that when I scrape the page to populate fields on a form, any element field with the invalid characters are simply empty (I'm guessing because it's an invalid string?).

If anyone has any advice with getting the text of a web page without this issue, or at least a work-around so that I can convert the invalid string to a valid one, it'd be appreciated

Cheers!

Comments

dazmcg’s picture

Actually after playing around a bit, let me clarify the problem I'm having:

Using Curl /wget in either of the following ways:

wget http://www.google.com.br/
curl http://www.google.com.br > t.txt

I then view either of the text files using text editor (have tried 'vim' and 'text editor') and international characters display 'normally'...eg:
"Preferências"

However when viewed using utilities like 'less' the same word is displayed as: "Preferncias".

Now it's not really 'less' that I have a problem with but when I try to load the text file via PHP or execute a command like:
$page = shell_exec("curl www.google.com.br")

it seems to mess up the international characters, which display as a black diamond with a question mark in them....

I have tried "html_entity_decode()" but that doesn't help...

Any tips?

dazmcg’s picture

for posterity and anyone having a similar issue....

I had tried "html_entity_decode" and "htmlspecialchars" functions thinknig they should have handled utf8 issues - solved by applying "utf8_encode($string)" function! DOH!