How to use iterator pattern in java to load file into hashmap in batches

Question

I have a large file containing two million lines . I'm looking to traverse through each line of the file and, process it into a key value pair and store it into a hashmap to make comparisons later on. However, I do not want to have a hashmap with 2 million key value pairs in the interest of space complexity.

Accepted Answer

You can do something like this:public class Temp { public static void main(String[] args) throws Exception { BufferedReader reader = new BufferedReader(new InputStreamReader(get file input stream here)); int maxSize = 3; Map map = new HashMap<>(maxSize); String line = reader.readLine(); while (line != null) { String[] data = line.split("\s+");//some dummy parsing String key = data[1]; String value = data[2]; map.put(key, value); if (map.size() == maxSize) { //do whatever operations you need System.out.println(map); map.clear(); } line = reader.readLine(); } if (map.size() > 0) { //deal with leftovers System.out.println(map); } }}Basically read and parse until map reaches max size, then operate on the contents and empty it. Keep doing until you have read entire file. At the end operate on the leftover contents, if there are any.Edit: You need to wrap reading inside iterator and also keep the max number of lines to read at once.public class CountingFileLineIterator implements Iterator> { private final BufferedReader reader; private String line; private final int maxDataCount; public CountingFileLineIterator(InputStream inputStream, int maxDataCount) { this.reader = new BufferedReader(new InputStreamReader(inputStream)); this.setLine(); this.maxDataCount = maxDataCount; } private void setLine() { try { this.line = this.reader.readLine(); } catch (IOException exc) { //handle however you need throw new RuntimeException("Error reading", exc); } } @Override public boolean hasNext() { return this.line != null; } @Override public Iterable next() { List next = new ArrayList<>(this.maxDataCount); for (int i = 0; i < this.maxDataCount; i++) { if (!this.hasNext()) { break; } next.add(this.line); this.setLine(); } return next; }}This is only one possible solution, which returns an Iterable(a List in this exact implementation), containing up to the maximum number of lines to be processed at once. This was what i came up with in order to keep the processing done by the iterator to the absolute minimum. You can (and should) have another class, which actually handles the processing of the data(parse it to a Map and so on). The thing is, even like this, the iterator has more responsibility than it should – creating the batches of data.My proposition would be to have the iterator only return the next line, no processing at all – this is exactly what it should be doing.public class FileLineIterator implements Iterator { private final BufferedReader reader; private String line; public FileLineIterator(InputStream inputStream) { this.reader = new BufferedReader(new InputStreamReader(inputStream)); this.setLine(); } private void setLine() { try { this.line = this.reader.readLine(); } catch (IOException exc) { //handle however you need throw new RuntimeException("Error reading", exc); } } @Override public boolean hasNext() { return this.line != null; } @Override public String next() { String line = this.line; this.setLine(); return line; }}Then create an abstraction, which will prepare data for handling:public interface DataPreparer { boolean hasMoreData(); DataHandler prepareData();}Like this you can have implementations to prepare data in batches(your case), or line by line, or all at once, however you need. Exact implementation for batches may be:public class BatchDataPreparer implements DataPreparer { private final Iterator iterator; private final int batchSize; public BatchDataPreparer(Iterator iterator, int batchSize) { this.iterator = iterator; this.batchSize = batchSize; } @Override public boolean hasMoreData() { return this.iterator.hasNext(); } @Override public DataHandler prepareData() { Map data = new LinkedHashMap<>(this.batchSize); while (this.iterator.hasNext()) { String line = this.iterator.next(); //parse in whatever way you need String[] parsed = line.split("\s+"); data.put(parsed[0] + " - " + parsed[1], parsed[2]); if (data.size() == this.batchSize) { break; } } return new SimpleDataHandler(data); }}Data parsing should be done separately(you can create abstraction for this as well), but for this example i won’t to do it.The DataHandler interface from above:public interface DataHandler { void handleData();}And simple implementation:public class SimpleDataHandler implements DataHandler { private final Map data; public SimpleDataHandler(Map data) { this.data = data; } @Override public void handleData() { //handle however you need System.out.println(this.data); this.data.clear(); }}And combining all into one:public class Temp { public static void main(String[] args) { int maxDataCount = 3; System.out.println("------------------------Concern Separation Result-------------------------------"); Iterator iterator = new FileLineIterator(InputStream); DataPreparer dataPreparer = new BatchDataPreparer(iterator, maxDataCount); while (dataPreparer.hasMoreData()) { DataHandler dataHandler = dataPreparer.prepareData(); dataHandler.handleData(); } System.out.println("------------------------Counting Iterator Result--------------------------------"); System.out.println("------------------------Just to test it works-----------------------------------"); Iterator> anotherIterator = new CountingFileLineIterator(InputStream, maxDataCount); while (anotherIterator.hasNext()) { Iterable lines = anotherIterator.next(); //some handling System.out.println(lines); } }}Your main(or however your program is structured) does not care how data is prepared, DataPreparer implementations are concerned about thatNeither preparer, nor main, are concerned how data is handled, only DataHandler isIt’s easy to change behaviour, fix bugs and not break something else, extend functionality, etc.

Advertisement

Answer