Hi All,
I have written a little application that renames a TV episode by scraping a TV listing site for the episode name. It is written in SWT and works great apart from on small problem. When getting the html back from the site, it sometimes contains special ASCII characters that are not in the ISO8859 (Windows filesystem) character set.
For example, this is the line that I have to parse:
<td style='padding-left: 6px;' class='b2'><a href='/Prison_Break/episodes/569183/03x01'>Orientaci??n</a></td>
When viewing it in a browser, it is:
<td style="padding-left: 6px;" class="b2"><a href="/Prison_Break/episodes/569183/03x01">Orientaci?n</a></td>
Notice that the o in the title has an accent on it. While researching this problem I stumbled across 'HTML Entities to ISO 8859-1 Converter' at http://www.inweb.de/chetan/English/Resources/Java/HTML%202%20ISO.html. This open source project takes in an html entity like
&
and returns '&'.
So that is not quite what I want, as my BufferedReader is converting the html entity into the ASCII representation already. I need a way of detecting a non ISO8859 character within an ASCII string, and hopefully replacing its natural 'equivalent' (would be o in this case).
Does anyone know how I could do it without having to check for every special char and replacing (not really an option unless someone has done it before!!)
If not that then, perhaps another way to attack the problem?
Any help greatly appreciated ;)
Dave