Find top k frequent words in real time data stream



I am trying to solve an algorithms problem using java tree set.

The problem as follows:

Find top k frequent words in realtime data stream.

Implement three methods for Topk Class:

TopK(k). The constructor.
add(word). Add a new word.
topk(). Get the current top k frequent words.

And my thought was to use a hashmap to remember frequencies and a treeset as a buffer.

My implementation passed most of the case, except one:

TopK(10)
add("aw")
add("fb")
add("fb")
topk()

The answer supposed to be [fb,aw] but now it’s [fb,aw, fb] However, my code passed test case like:

    TopK(10)
    add("iiiiii")
    add("fb")
    add("fb")
    topk()

and

TopK(10)
add("fb")
add("fb")
topk()

I have no idea what’s wrong, so I printed some value when the comparator is called. It gave me this:

 aw aw
11111111
fb aw
33333333
fb aw
33333333
fb aw
222222222
fb aw
222222222

Which means, the second “fb” was compared to “aw” twice and the comparator is done. I spent hours to debug and I have found nothing so far.

Here is my implementation:

public class TopK {
    int size;
    HashMap<String, Integer> map;
    TreeSet<String> seen;
    public TopK(int k) {
        // do intialization if necessary
        size = k;
        seen = new TreeSet<String>(new Comparator<String>(){
            @Override
            public int compare(String str1, String str2){
                System.out.println(str1 + " "+ str2);
                if (str1.equals(str2)){
                    System.out.println("11111111");
                    return 0;
                }
                // important !https://www.jiuzhang.com/qa/7646/
                // 128 以后integer就不同了
                int number1 = map.get(str1);
                int number2 = map.get(str2);
                if (number1 != number2){
                    System.out.println("222222222");
                    return map.get(str1) - map.get(str2);
                } else {
                    System.out.println("33333333");
                    return str2.compareTo(str1);
                }
            }
        });
        map = new HashMap<String, Integer>();
    }

    /*
     * @param word: A string
     * @return: nothing
     */
    public void add(String word) {
        // write your code here
        if (!map.containsKey(word)){
            map.put(word, 0);
        }
        map.put(word, map.get(word) + 1);

        if (seen.contains(word)){
            seen.remove(word);
            seen.add(word);
        } else {
            seen.add(word);
            if (seen.size() > size){
                seen.pollFirst();
            }
        }
    }

    /*
     * @return: the current top k frequent words.
     */
    public List<String> topk() {
        // Write your code here
        List<String> results = new ArrayList<String>();
        Iterator it = seen.iterator();
        while(it.hasNext()) {
             String str = (String)it.next();
             results.add(0, str);
        }
        return results;
    }
}

Answer

Our first clue is that case:

aw
fb
fb

Fails but:

iiiii
fb
fb

Succeed.

This can happen only because of the line: return str2.compareTo(str1); – if the number of appearances are different order by string compare (this could be check easily – please do that).

The only explanation I can think of is the contains function of java TreeSet has “optimization” of searching only until where the element should be – if you have order and the element is not where it should be then consider it as none exist in the TreeSet (think about array that should be sort for checking for number you run in log(n) but no on all array – so if he exist in wrong position you will miss him).

Notice that you change the place where the element should be before checking the contains function. So let look at the 2th iteration – we have fb and aw both with value of 1 in the map. On the TreeSet the are as [fb,aw] (because string compare as explain before). Now you change the map and fb have value of 2 -> it should be in the last place but the contains function compare to aw and think it should be after it – but he is the last element so it assume fb does not exist and just add him -> That why you see 2 compares between fb and aw – one for the contain and one for the add.

Hope that was understandable….



Source: stackoverflow