participate


Java Programming - How can I convert ASCII characters to ISO8859?
This question is not answered.

<<   Back to Forum  |   Give us Feedback
This topic has 1 reply on 1 page.
NZ_Dave
Posts:1
Registered: 11/6/07
How can I convert ASCII characters to ISO8859?   
Nov 6, 2007 12:48 AM
 
 
Hi All,

I have written a little application that renames a TV episode by scraping a TV listing site for the episode name. It is written in SWT and works great apart from on small problem. When getting the html back from the site, it sometimes contains special ASCII characters that are not in the ISO8859 (Windows filesystem) character set.

For example, this is the line that I have to parse:
<td style='padding-left: 6px;' class='b2'><a href='/Prison_Break/episodes/569183/03x01'>Orientaci??n</a></td>


When viewing it in a browser, it is:
<td style="padding-left: 6px;" class="b2"><a href="/Prison_Break/episodes/569183/03x01">Orientaci?n</a></td>


Notice that the o in the title has an accent on it. While researching this problem I stumbled across 'HTML Entities to ISO 8859-1 Converter' at http://www.inweb.de/chetan/English/Resources/Java/HTML%202%20ISO.html. This open source project takes in an html entity like
&amp;
and returns '&'.

So that is not quite what I want, as my BufferedReader is converting the html entity into the ASCII representation already. I need a way of detecting a non ISO8859 character within an ASCII string, and hopefully replacing its natural 'equivalent' (would be o in this case).

Does anyone know how I could do it without having to check for every special char and replacing (not really an option unless someone has done it before!!)

If not that then, perhaps another way to attack the problem?

Any help greatly appreciated ;)

Dave
 
martin@work
Posts:126
Registered: 2/13/07
Re: How can I convert ASCII characters to ISO8859?   
Nov 6, 2007 1:34 AM (reply 1 of 1)  (In reply to original post )
 
 
Hi,

NZ_Dave wrote:
For example, this is the line that I have to parse:
<td style='padding-left: 6px;' class='b2'><a href='/Prison_Break/episodes/569183/03x01'>Orientaci??n</a></td>

This is coded in UTF-8. If you convert the bytes to a String using the UTF-8 encoding, then you will have the correct characters "Orientaci?n" in the string.

Check your parser where it converts the bytes (coming from e.g. an InputStream) to characters. Use UTF-8 as the charset when doing that conversion.
 
This topic has 1 reply on 1 page.
Back to Forum
 
Read the Developer Forums Code of Conduct

Click to email this message Email this Topic

Edit this Topic
  
 
 
Forums Statistics
    Users Online : 26
  • Guests : 129

About Sun forums
  • Sun Forums is a large collection of user generated discussions. It is here to help you ask questions, find answers, and participate in discussions.

    Check out our guide on Getting started with Sun Forums for a full walkthrough of how to best leverage the benefits of this community.

Powered by Jive Forums