Data set got merged incorrectly/randomly?



I have a .csv file that looks something like this:

123,1-1-2020,[Apple]
123,1-2-2020,[Apple]
123,1-2-2020,[Beer]
345,1-3-2020,[Bacon]
345,1-4-2020,[Cheese]
345,1-4-2020,[Sausage]
345,1-5-2020,[Bacon]

I made a function that finds if both the number and date of any line is similar, the items of the lines will appended with a number next to it to show how many items are there in the Set<String> of items:

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese,Sausage]
345,1-5-2020,1,[Bacon]

While this is the intended result, the actual result with a larger set of data using my algorithm, items are sometimes randomly not counted and goes missing (the entire line disappears). The above example sometimes would becomes:

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese,Sausage]
345,1-5-2020,1,[Bacon]
// Any one of those output lines would sometimes disappear entirely.

I am very confused why is this happening. Below is the algorithm I implemented:

protected List<Receipt> convert(List<Receipt> list) {

        List<Receipt> receipts = new ArrayList<>();
        for (int i = 0; i < list.size(); i++) {
            List<Receipt> temp = new ArrayList<>();
            temp.add(list.get(i));
            for (int j = i + 1; j < list.size(); j++) {
                if (list.get(i).getNumber().equals(list.get(j).getNumber())) {
                    temp.add(list.get(j));
                }
            }

            Map<String, Set<String>> map = new HashMap<>();
            for (Receipt r : temp) {
                if (!map.containsKey(r.getDate())) {
                    map.put(r.getDate(), r.getItems());
                } else {
                    map.replace(r.getDate(), merge(map.get(r.getDate()), r.getItems()));
                }
            }

            for (Map.Entry<String, Set<String>> m : map.entrySet()) {
                receipts.add(new Receipt(list.get(i).getNumber(), m.getKey(), m.getValue()));
            }
            i = i + temp.size();
        }
        return receipts;
    }

// Merge function used above to append two item sets.
private static Set<String> merge(Set<String> a, Set<String> b) {

        return new HashSet<>() {
            {
                addAll(a);
                addAll(b);
            }
        };
    }

Where each line in the .csv is made into a Receipt object that has (String) getNumber(), (String) getDate(), and (Set<String>) getItems() method.

For a .csv with a similar format that has hundreds of thousands of lines, in the original data set, for example, the item Bacon was found 1334 times but the output of my algorithm, Bacon only appears randomly somewhere between 1160 – 1240 times. This happens to all other items as well. There are also some strange behaviors where a few random lines got the items appended wrong as well (same number, different date but still appended).

What could possibly be the cause of this randomness?

Edit : as suggestions in the comment section mentioned, the cause seems to be at i++ and i = i + temp.size() incrementing the i value incorrectly.

Answer

First, you’re using double-brace syntax in the merge() method, which is highly discouraged. See e.g. this answer: What is Double Brace initialization in Java?

Second, your code is iterating the list too much, you can (and should) do it in a single iteration.

Before going on, we need to look at the Receipt class, which is not shown in the question, but we can infer that it looks like this:

public class Receipt {
    private final String number;
    private final String date;
    private final Set<String> items;
    public Receipt(String number, String date, Set<String> items) {
        this.number = number;
        this.date = date;
        this.items = items;
    }
    public String getNumber() {
        return this.number;
    }
    public String getDate() {
        return this.date;
    }
    public Set<String> getItems() {
        return this.items;
    }
}

The goal is to merge receipts that share number and date. To do that in a single iteration, we need a Map, keyed by the combination of number and date. There are 3 ways to do that:

  1. Implement equals() and hashCode() to be based on only those two fields, ignoring the items field. This is the easiest and simplest solution, but requires modifying the Receipt class, and to define “equality” to not include the item list, which is a debatable decision, so let us not do this.

  2. Implement a dedicated class with just the two fields, and implement equals() and hashCode(). This has the advantage of leaving Receipt alone, but does require a new class.

  3. Use the Receipt class as the key, but give the Map class a custom Comparator that only compares the two fields. With Java 8, we can do this without creating a new class, so let us try that. We’ll need to use a TreeMap in order to supply a custom Comparator.

So, we’ll create a TreeMap<Receipt, Set<String>>, using a custom Comparator, iterate the list once, merging the items as we go, then finally iterating the map to build the new list of merged receipts.

protected static List<Receipt> convert(List<Receipt> list) {
    Map<Receipt, Set<String>> map = new TreeMap<>(Comparator.comparing(Receipt::getDate)
                                                        .thenComparing(Receipt::getNumber));
    for (Receipt receipt : list) {
        map.merge(receipt, receipt.getItems(), (v1, v2) -> {
            Set<String> newSet = new TreeSet<>(v1);
            newSet.addAll(v2);
            return newSet;
        });
    }
    List<Receipt> result = new ArrayList<>();
    for (Entry<Receipt, Set<String>> entry : map.entrySet()) {
        result.add(new Receipt(entry.getKey().getNumber(),
                               entry.getKey().getDate(),
                               entry.getValue()));
    }
    return result;
}

It can of course also be done with Stream logic:

protected static List<Receipt> convert(List<Receipt> list) {
    return list.stream()
        .collect(Collectors.groupingBy(
                Function.identity(),
                () -> new TreeMap<>(Comparator.comparing(Receipt::getDate)
                                              .thenComparing(Receipt::getNumber)),
                Collectors.flatMapping(r -> r.getItems().stream(),
                        Collectors.toCollection(TreeSet::new))))
        .entrySet().stream()
        .map(e -> new Receipt(e.getKey().getNumber(), e.getKey().getDate(), e.getValue()))
        .collect(Collectors.toList());
}

Test
Uses Java 9 of methods.

List<Receipt> list = List.of(
        new Receipt("123", "1-1-2020", Set.of("Apple")),
        new Receipt("123", "1-2-2020", Set.of("Apple")),
        new Receipt("123", "1-2-2020", Set.of("Beer")),
        new Receipt("345", "1-3-2020", Set.of("Bacon")),
        new Receipt("345", "1-4-2020", Set.of("Cheese")),
        new Receipt("345", "1-4-2020", Set.of("Sausage")),
        new Receipt("345", "1-5-2020", Set.of("Bacon")));

List<Receipt> converted = convert(list);
converted.forEach(r -> System.out.println(r.getNumber() + "," + r.getDate() + "," +
                                          r.getItems().size() + "," + r.getItems()));

Output

123,1-1-2020,1,[Apple]
123,1-2-2020,2,[Apple, Beer]
345,1-3-2020,1,[Bacon]
345,1-4-2020,2,[Cheese, Sausage]
345,1-5-2020,1,[Bacon]


Source: stackoverflow