Character Issues

Back Story

I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.

The Issue

These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.

"Je?ro^me" //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
"Jéróme" //code-points 6.   Code-points: 74,   233,   114,    243,  109, 101

The question

How do I get "Je?ro^me" back to "Jéróme"?

Things that I have tried

Used Notepad++ to convert the encoding to or from UTF8, ANSI, and WINDOWS-1252.
Created a Map that looks for things like e? and convert them to é.

Issues with the two attempts to solve the problem

a. The issue still existed after trying different conversions.

b. Two issues here:

I don’t know all of the potential e?, o^, etc to look for. There are over 20,000 files that may cover many languages.
What if I have a sentence that ends in e?

Things I researched to gain a better understanding of the issue

MCVE

import java.util.HashMap;
import java.util.Map;

/**
 *https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
 *https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
 *https://www.w3.org/International/questions/qa-what-is-encoding
 *https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
 * @author sedri
 */
public class App {
    
    static String outputString; 
    
    public static void main(String[] args) {
        
        //May approach to fix the issue
        //Use a map to replace string issue with the correct character
        //The output looks good, but I would need to include all special characters for many languages.
        //What if I have a sentence like: How old are thee? 
        Map<String, String> map = new HashMap();
        map.put("e?", "é");
        map.put("o^", "ó");
        
        final String string = "Je?ro^me";
        final String accentString = "Jéróme";
        outputString = string;
        map.forEach((t, u) -> {
            if(outputString.contains(t))
            {
                outputString = outputString.replace(t, u);
            }
        });
        System.out.println("Fixed output: " + outputString);        
        System.out.println("");                    
        //End of my attempt at a solution.
        
        System.out.println("code points: " + string.codePoints().count());                
        for(int i = 0; i < string.length(); i++)
        {
            System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
        }
        System.out.println("");    
        
        System.out.println("code points: " + accentString.codePoints().count());                
        for(int i = 0; i < accentString.length(); i++)
        {
            System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
        }
        System.out.println("");    
          
        System.out.println("code points: " + outputString.codePoints().count());  
        for(int i = 0; i < outputString.length(); i++)
        {
            System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
        }        
        System.out.println("");  
    }
}

Answer

The fact that one of your code points is 63 (a question mark) means that you won’t be able to reliably revert that data to the original format. The ? can represent many different characters that weren’t properly decoded, which means you’ve lost vital information for restoring the original characters.

What you need to do is establish the correct encoding to use when you read from your database in the first place. Since you haven’t posted the code where you read these strings, I can’t tell you exactly how or where to do that.

Hopefully the data in the DB itself hasn’t already been corrupted by bad character encoding, or else you’ve already lost the information you need.

You might be able to partially repair such damage by doing things like replacing “o^” with “ó”, but if, say, both “è” and “é” turn into “e?”, you can never be sure which was which.

Advertisement

Answer