I have a large file containing two million lines . I’m looking to traverse through each line of the file and, process it into a key value pair and store it into a hashmap to make comparisons later on. However, I do not want to have a hashmap with 2 million key value pairs in the interest of space complexity. Instead , I would like to iterate through N lines of the files and load their key value pairs in the hashmap , make comparisons and then load the next N lines into the hashmap and so on.
An example of the use case :
File.txt:
1 Jack London 2 Mary Boston 3 Jay Chicago 4 Mia Amsterdam 5 Leah New York 6 Bob Denver . . .
Assuming N=3 as the size of my hashmap, at the first iteration my hashmap would store key value pairs for the first three lines of the file i.e
1 Jack London 2 Mary Boston 3 Jay Chicago
After making comparisons on these key value pairs , the next 3 lines are loaded into the hashmap as key value pairs:
4 Mia Amsterdam 5 Leah New York 6 Bob Denver
and so on until all the lines in the file have been iterated over. How do I implement this using the iterator design pattern in java?
Advertisement
Answer
You can do something like this:
public class Temp { public static void main(String[] args) throws Exception { BufferedReader reader = new BufferedReader(new InputStreamReader(get file input stream here)); int maxSize = 3; Map<String, String> map = new HashMap<>(maxSize); String line = reader.readLine(); while (line != null) { String[] data = line.split("\s+");//some dummy parsing String key = data[1]; String value = data[2]; map.put(key, value); if (map.size() == maxSize) { //do whatever operations you need System.out.println(map); map.clear(); } line = reader.readLine(); } if (map.size() > 0) { //deal with leftovers System.out.println(map); } } }
Basically read and parse until map reaches max size, then operate on the contents and empty it. Keep doing until you have read entire file. At the end operate on the leftover contents, if there are any.
Edit: You need to wrap reading inside iterator and also keep the max number of lines to read at once.
public class CountingFileLineIterator implements Iterator<Iterable<String>> { private final BufferedReader reader; private String line; private final int maxDataCount; public CountingFileLineIterator(InputStream inputStream, int maxDataCount) { this.reader = new BufferedReader(new InputStreamReader(inputStream)); this.setLine(); this.maxDataCount = maxDataCount; } private void setLine() { try { this.line = this.reader.readLine(); } catch (IOException exc) { //handle however you need throw new RuntimeException("Error reading", exc); } } @Override public boolean hasNext() { return this.line != null; } @Override public Iterable<String> next() { List<String> next = new ArrayList<>(this.maxDataCount); for (int i = 0; i < this.maxDataCount; i++) { if (!this.hasNext()) { break; } next.add(this.line); this.setLine(); } return next; } }
This is only one possible solution, which returns an Iterable(a List
in this exact implementation), containing up to the maximum number of lines to be processed at once. This was what i came up with in order to keep the processing done by the iterator to the absolute minimum. You can (and should) have another class, which actually handles the processing of the data(parse it to a Map
and so on). The thing is, even like this, the iterator has more responsibility than it should – creating the batches of data.
My proposition would be to have the iterator only return the next line, no processing at all – this is exactly what it should be doing.
public class FileLineIterator implements Iterator<String> { private final BufferedReader reader; private String line; public FileLineIterator(InputStream inputStream) { this.reader = new BufferedReader(new InputStreamReader(inputStream)); this.setLine(); } private void setLine() { try { this.line = this.reader.readLine(); } catch (IOException exc) { //handle however you need throw new RuntimeException("Error reading", exc); } } @Override public boolean hasNext() { return this.line != null; } @Override public String next() { String line = this.line; this.setLine(); return line; } }
Then create an abstraction, which will prepare data for handling:
public interface DataPreparer { boolean hasMoreData(); DataHandler prepareData(); }
Like this you can have implementations to prepare data in batches(your case), or line by line, or all at once, however you need. Exact implementation for batches may be:
public class BatchDataPreparer implements DataPreparer { private final Iterator<String> iterator; private final int batchSize; public BatchDataPreparer(Iterator<String> iterator, int batchSize) { this.iterator = iterator; this.batchSize = batchSize; } @Override public boolean hasMoreData() { return this.iterator.hasNext(); } @Override public DataHandler prepareData() { Map<String, String> data = new LinkedHashMap<>(this.batchSize); while (this.iterator.hasNext()) { String line = this.iterator.next(); //parse in whatever way you need String[] parsed = line.split("\s+"); data.put(parsed[0] + " - " + parsed[1], parsed[2]); if (data.size() == this.batchSize) { break; } } return new SimpleDataHandler(data); } }
Data parsing should be done separately(you can create abstraction for this as well), but for this example i won’t to do it.
The DataHandler
interface from above:
public interface DataHandler { void handleData(); }
And simple implementation:
public class SimpleDataHandler implements DataHandler { private final Map<String, String> data; public SimpleDataHandler(Map<String, String> data) { this.data = data; } @Override public void handleData() { //handle however you need System.out.println(this.data); this.data.clear(); } }
And combining all into one:
public class Temp { public static void main(String[] args) { int maxDataCount = 3; System.out.println("------------------------Concern Separation Result-------------------------------"); Iterator<String> iterator = new FileLineIterator(InputStream); DataPreparer dataPreparer = new BatchDataPreparer(iterator, maxDataCount); while (dataPreparer.hasMoreData()) { DataHandler dataHandler = dataPreparer.prepareData(); dataHandler.handleData(); } System.out.println("------------------------Counting Iterator Result--------------------------------"); System.out.println("------------------------Just to test it works-----------------------------------"); Iterator<Iterable<String>> anotherIterator = new CountingFileLineIterator(InputStream, maxDataCount); while (anotherIterator.hasNext()) { Iterable<String> lines = anotherIterator.next(); //some handling System.out.println(lines); } } }
- Your main(or however your program is structured) does not care how data is prepared,
DataPreparer
implementations are concerned about that - Neither preparer, nor main, are concerned how data is handled, only
DataHandler
is - It’s easy to change behaviour, fix bugs and not break something else, extend functionality, etc.