Skip to content
Advertisement

Converting a string to a hash integer

I have to convert a string of length k with 4 possible characters – {A, C, G, T} – into an integer between 0 and 4^k. The advice is to convert the {A, C, G, T} into {0, 1, 2, 3} respectively, but I do not know how to convert those numbers into a number between 0 and 4^k. For example, if the string “ACT” is given, I have to convert that to a number between 0 and 64.

Advertisement

Answer

You can do it like this:

char[] chars=yourString.toCharArray();
int result=0;
for(char c:chars){
    result*=4;
    switch(c){
        case 'A':
            result+=0;
            break;
        case 'C':
            result+=1;
            break;
        case 'G':
             result+=2;
             break;
        case 'T':
             result+=3;
             break;
    }
}

This takes every character and adds a value from 0 to 3 to the result (depending on the character).

After that, the value is multiplied with 4 in order to leace space for the next value.

Note that this is not hashing because it can be reversed easily.

A one-line-version of the code would be:

Integer.parseInt(yourString.replace('A','0').replace('C','1').replace('G','2').replace('T','3'),4);

This replaces A/C/G/T to 0/1/2/3 and reads it as a base 4 number.


You can also get the original String from the converted int:

int intVal;//whatever it is
StringBuilder sb=new StringBuilder();
while(intVal!=0){
    switch(intVal%4){
        case 0:
            sb.append('A');
            break;
        case 1:
            sb.append('C');
            break;
        case 2:
            sb.append('G');
            break;
        case 3:
            sb.append('T');
            break;
    }
    intVal=intVal/4;
}
String result=intVal.reverse().toString();

This gets each digit one after another and adds the corresponding value to the StringBuilder. Because it starts with the last digit, a reversal is needed.

It is also possible to create a one-liner for this:

Integer.toString(intVal,4).replace('0','A').replace('1',C').replace('2','G').replace('3','T');

Note that you might want to use long/BigInteger for longer sequences as you would reach the integer limit for those.

Since int has 32 bits of data, you could use sequences up to 16 characters. With long, you could have sequences up to 32 characters and with BigInteger, you would likely reach the memory limit of JVM with your sequence string or the char[] used in the calculation before the limit of BigInteger becomes a problem (the limit of BigInteger is 2 to the power of Integer.MAX_VALUE).

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement