I have a .csv
file that looks something like this:
123,1-1-2020,[Apple] 123,1-2-2020,[Apple] 123,1-2-2020,[Beer] 345,1-3-2020,[Bacon] 345,1-4-2020,[Cheese] 345,1-4-2020,[Sausage] 345,1-5-2020,[Bacon]
I made a function that finds if both the number
and date
of any line is similar, the items of the lines will appended with a number next to it to show how many items are there in the Set<String>
of items:
123,1-1-2020,1,[Apple] 123,1-2-2020,2,[Apple, Beer] 345,1-3-2020,1,[Bacon] 345,1-4-2020,2,[Cheese,Sausage] 345,1-5-2020,1,[Bacon]
While this is the intended result, the actual result with a larger set of data using my algorithm, items are sometimes randomly not counted and goes missing (the entire line disappears). The above example sometimes would becomes:
123,1-1-2020,1,[Apple] 123,1-2-2020,2,[Apple, Beer] 345,1-3-2020,1,[Bacon] 345,1-4-2020,2,[Cheese,Sausage] 345,1-5-2020,1,[Bacon] // Any one of those output lines would sometimes disappear entirely.
I am very confused why is this happening. Below is the algorithm I implemented:
protected List<Receipt> convert(List<Receipt> list) { List<Receipt> receipts = new ArrayList<>(); for (int i = 0; i < list.size(); i++) { List<Receipt> temp = new ArrayList<>(); temp.add(list.get(i)); for (int j = i + 1; j < list.size(); j++) { if (list.get(i).getNumber().equals(list.get(j).getNumber())) { temp.add(list.get(j)); } } Map<String, Set<String>> map = new HashMap<>(); for (Receipt r : temp) { if (!map.containsKey(r.getDate())) { map.put(r.getDate(), r.getItems()); } else { map.replace(r.getDate(), merge(map.get(r.getDate()), r.getItems())); } } for (Map.Entry<String, Set<String>> m : map.entrySet()) { receipts.add(new Receipt(list.get(i).getNumber(), m.getKey(), m.getValue())); } i = i + temp.size(); } return receipts; } // Merge function used above to append two item sets. private static Set<String> merge(Set<String> a, Set<String> b) { return new HashSet<>() { { addAll(a); addAll(b); } }; }
Where each line in the .csv
is made into a Receipt
object that has (String)
getNumber()
, (String)
getDate()
, and (Set<String>)
getItems()
method.
For a .csv with a similar format that has hundreds of thousands of lines, in the original data set, for example, the item Bacon
was found 1334 times but the output of my algorithm, Bacon
only appears randomly somewhere between 1160 – 1240 times. This happens to all other items as well. There are also some strange behaviors where a few random lines got the items appended wrong as well (same number, different date but still appended).
What could possibly be the cause of this randomness?
Edit : as suggestions in the comment section mentioned, the cause seems to be at i++
and i = i + temp.size()
incrementing the i
value incorrectly.
Advertisement
Answer
First, you’re using double-brace syntax in the merge()
method, which is highly discouraged. See e.g. this answer: What is Double Brace initialization in Java?
Second, your code is iterating the list too much, you can (and should) do it in a single iteration.
Before going on, we need to look at the Receipt
class, which is not shown in the question, but we can infer that it looks like this:
public class Receipt { private final String number; private final String date; private final Set<String> items; public Receipt(String number, String date, Set<String> items) { this.number = number; this.date = date; this.items = items; } public String getNumber() { return this.number; } public String getDate() { return this.date; } public Set<String> getItems() { return this.items; } }
The goal is to merge receipts that share number
and date
. To do that in a single iteration, we need a Map
, keyed by the combination of number
and date
. There are 3 ways to do that:
Implement
equals()
andhashCode()
to be based on only those two fields, ignoring theitems
field. This is the easiest and simplest solution, but requires modifying theReceipt
class, and to define “equality” to not include the item list, which is a debatable decision, so let us not do this.Implement a dedicated class with just the two fields, and implement
equals()
andhashCode()
. This has the advantage of leavingReceipt
alone, but does require a new class.Use the
Receipt
class as the key, but give theMap
class a customComparator
that only compares the two fields. With Java 8, we can do this without creating a new class, so let us try that. We’ll need to use aTreeMap
in order to supply a customComparator
.
So, we’ll create a TreeMap<Receipt, Set<String>>
, using a custom Comparator
, iterate the list once, merging the items as we go, then finally iterating the map to build the new list of merged receipts.
protected static List<Receipt> convert(List<Receipt> list) { Map<Receipt, Set<String>> map = new TreeMap<>(Comparator.comparing(Receipt::getDate) .thenComparing(Receipt::getNumber)); for (Receipt receipt : list) { map.merge(receipt, receipt.getItems(), (v1, v2) -> { Set<String> newSet = new TreeSet<>(v1); newSet.addAll(v2); return newSet; }); } List<Receipt> result = new ArrayList<>(); for (Entry<Receipt, Set<String>> entry : map.entrySet()) { result.add(new Receipt(entry.getKey().getNumber(), entry.getKey().getDate(), entry.getValue())); } return result; }
It can of course also be done with Stream logic:
protected static List<Receipt> convert(List<Receipt> list) { return list.stream() .collect(Collectors.groupingBy( Function.identity(), () -> new TreeMap<>(Comparator.comparing(Receipt::getDate) .thenComparing(Receipt::getNumber)), Collectors.flatMapping(r -> r.getItems().stream(), Collectors.toCollection(TreeSet::new)))) .entrySet().stream() .map(e -> new Receipt(e.getKey().getNumber(), e.getKey().getDate(), e.getValue())) .collect(Collectors.toList()); }
Test
Uses Java 9 of
methods.
List<Receipt> list = List.of( new Receipt("123", "1-1-2020", Set.of("Apple")), new Receipt("123", "1-2-2020", Set.of("Apple")), new Receipt("123", "1-2-2020", Set.of("Beer")), new Receipt("345", "1-3-2020", Set.of("Bacon")), new Receipt("345", "1-4-2020", Set.of("Cheese")), new Receipt("345", "1-4-2020", Set.of("Sausage")), new Receipt("345", "1-5-2020", Set.of("Bacon"))); List<Receipt> converted = convert(list); converted.forEach(r -> System.out.println(r.getNumber() + "," + r.getDate() + "," + r.getItems().size() + "," + r.getItems()));
Output
123,1-1-2020,1,[Apple] 123,1-2-2020,2,[Apple, Beer] 345,1-3-2020,1,[Bacon] 345,1-4-2020,2,[Cheese, Sausage] 345,1-5-2020,1,[Bacon]