Hi,
I'm developing a module that does some scraping. It does this by executing some shell commands from php to get a web page as a text string (using Curl on Ubuntu).
The problem I'm having is sites with foreign characters, those characters get encoded as a black diamond with a question mark inside in the variable itself (ie when I print_r /echo the variable - one example is that the portuguese character 'ô' from a web page gets converted to a '�'). I'd say it was to do with the server?
However, that being the case, the additional problem this causes me is that when I scrape the page to populate fields on a form, any element field with the invalid characters are simply empty (I'm guessing because it's an invalid string?).
If anyone has any advice with getting the text of a web page without this issue, or at least a work-around so that I can convert the invalid string to a valid one, it'd be appreciated
Cheers!
Comments
clarification
Actually after playing around a bit, let me clarify the problem I'm having:
Using Curl /wget in either of the following ways:
I then view either of the text files using text editor (have tried 'vim' and 'text editor') and international characters display 'normally'...eg:
"Preferências"
However when viewed using utilities like 'less' the same word is displayed as: "Preferncias".
Now it's not really 'less' that I have a problem with but when I try to load the text file via PHP or execute a command like:
$page = shell_exec("curl www.google.com.br")
it seems to mess up the international characters, which display as a black diamond with a question mark in them....
I have tried "html_entity_decode()" but that doesn't help...
Any tips?
for posterity and anyone
for posterity and anyone having a similar issue....
I had tried "html_entity_decode" and "htmlspecialchars" functions thinknig they should have handled utf8 issues - solved by applying "utf8_encode($string)" function! DOH!