I am trying to solve an algorithms problem using java tree set.
The problem as follows:
Find top k frequent words in realtime data stream. Implement three methods for Topk Class: TopK(k). The constructor. add(word). Add a new word. topk(). Get the current top k frequent words.
And my thought was to use a hashmap to remember frequencies and a treeset as a buffer.
My implementation passed most of the case, except one:
TopK(10) add("aw") add("fb") add("fb") topk()
The answer supposed to be [fb,aw] but now it’s [fb,aw, fb] However, my code passed test case like:
TopK(10) add("iiiiii") add("fb") add("fb") topk()
and
TopK(10) add("fb") add("fb") topk()
I have no idea what’s wrong, so I printed some value when the comparator is called. It gave me this:
aw aw 11111111 fb aw 33333333 fb aw 33333333 fb aw 222222222 fb aw 222222222
Which means, the second “fb” was compared to “aw” twice and the comparator is done. I spent hours to debug and I have found nothing so far.
Here is my implementation:
public class TopK { int size; HashMap<String, Integer> map; TreeSet<String> seen; public TopK(int k) { // do intialization if necessary size = k; seen = new TreeSet<String>(new Comparator<String>(){ @Override public int compare(String str1, String str2){ System.out.println(str1 + " "+ str2); if (str1.equals(str2)){ System.out.println("11111111"); return 0; } // important !https://www.jiuzhang.com/qa/7646/ // 128 以后integer就不同了 int number1 = map.get(str1); int number2 = map.get(str2); if (number1 != number2){ System.out.println("222222222"); return map.get(str1) - map.get(str2); } else { System.out.println("33333333"); return str2.compareTo(str1); } } }); map = new HashMap<String, Integer>(); } /* * @param word: A string * @return: nothing */ public void add(String word) { // write your code here if (!map.containsKey(word)){ map.put(word, 0); } map.put(word, map.get(word) + 1); if (seen.contains(word)){ seen.remove(word); seen.add(word); } else { seen.add(word); if (seen.size() > size){ seen.pollFirst(); } } } /* * @return: the current top k frequent words. */ public List<String> topk() { // Write your code here List<String> results = new ArrayList<String>(); Iterator it = seen.iterator(); while(it.hasNext()) { String str = (String)it.next(); results.add(0, str); } return results; } }
Advertisement
Answer
Our first clue is that case:
aw fb fb
Fails but:
iiiii fb fb
Succeed.
This can happen only because of the line: return str2.compareTo(str1);
– if the number of appearances are different order by string compare (this could be check easily – please do that).
The only explanation I can think of is the contains
function of java TreeSet has “optimization” of searching only until where the element should be – if you have order and the element is not where it should be then consider it as none exist in the TreeSet (think about array that should be sort for checking for number you run in log(n) but no on all array – so if he exist in wrong position you will miss him).
Notice that you change the place where the element should be before checking the contains
function. So let look at the 2th iteration – we have fb
and aw
both with value of 1 in the map. On the TreeSet the are as [fb,aw]
(because string compare as explain before). Now you change the map and fb
have value of 2 -> it should be in the last place but the contains
function compare to aw
and think it should be after it – but he is the last element so it assume fb
does not exist and just add him -> That why you see 2 compares between fb
and aw
– one for the contain
and one for the add
.
Hope that was understandable….