Skip to content
Advertisement

How to use iterator pattern in java to load file into hashmap in batches

I have a large file containing two million lines . I’m looking to traverse through each line of the file and, process it into a key value pair and store it into a hashmap to make comparisons later on. However, I do not want to have a hashmap with 2 million key value pairs in the interest of space complexity. Instead , I would like to iterate through N lines of the files and load their key value pairs in the hashmap , make comparisons and then load the next N lines into the hashmap and so on.

An example of the use case :

File.txt:

1 Jack London
2 Mary Boston
3 Jay  Chicago
4 Mia  Amsterdam
5 Leah New York
6 Bob  Denver
.
.
.

Assuming N=3 as the size of my hashmap, at the first iteration my hashmap would store key value pairs for the first three lines of the file i.e

1 Jack London
2 Mary Boston
3 Jay  Chicago

After making comparisons on these key value pairs , the next 3 lines are loaded into the hashmap as key value pairs:

4 Mia  Amsterdam
5 Leah New York
6 Bob  Denver

and so on until all the lines in the file have been iterated over. How do I implement this using the iterator design pattern in java?

Advertisement

Answer

You can do something like this:

public class Temp {

  public static void main(String[] args) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(get file input stream here));
    int maxSize = 3;
    Map<String, String> map = new HashMap<>(maxSize);
    String line = reader.readLine();
    while (line != null) {
      String[] data = line.split("\s+");//some dummy parsing
      String key = data[1];
      String value = data[2];
      map.put(key, value);
      if (map.size() == maxSize) {
        //do whatever operations you need
        System.out.println(map);
        map.clear();
      }
      line = reader.readLine();
    }
    if (map.size() > 0) {
      //deal with leftovers
      System.out.println(map);
    }
  }
}

Basically read and parse until map reaches max size, then operate on the contents and empty it. Keep doing until you have read entire file. At the end operate on the leftover contents, if there are any.

Edit: You need to wrap reading inside iterator and also keep the max number of lines to read at once.

public class CountingFileLineIterator implements Iterator<Iterable<String>> {

  private final BufferedReader reader;
  private String line;
  private final int maxDataCount;

  public CountingFileLineIterator(InputStream inputStream, int maxDataCount) {
    this.reader = new BufferedReader(new InputStreamReader(inputStream));
    this.setLine();
    this.maxDataCount = maxDataCount;
  }

  private void setLine() {
    try {
      this.line = this.reader.readLine();
    } catch (IOException exc) {
      //handle however you need
      throw new RuntimeException("Error reading", exc);
    }
  }

  @Override
  public boolean hasNext() {
    return this.line != null;
  }

  @Override
  public Iterable<String> next() {
    List<String> next = new ArrayList<>(this.maxDataCount);
    for (int i = 0; i < this.maxDataCount; i++) {
      if (!this.hasNext()) {
        break;
      }
      next.add(this.line);
      this.setLine();
    }
    return next;
  }
}

This is only one possible solution, which returns an Iterable(a List in this exact implementation), containing up to the maximum number of lines to be processed at once. This was what i came up with in order to keep the processing done by the iterator to the absolute minimum. You can (and should) have another class, which actually handles the processing of the data(parse it to a Map and so on). The thing is, even like this, the iterator has more responsibility than it should – creating the batches of data.

My proposition would be to have the iterator only return the next line, no processing at all – this is exactly what it should be doing.

public class FileLineIterator implements Iterator<String> {

  private final BufferedReader reader;
  private String line;

  public FileLineIterator(InputStream inputStream) {
    this.reader = new BufferedReader(new InputStreamReader(inputStream));
    this.setLine();
  }

  private void setLine() {
    try {
      this.line = this.reader.readLine();
    } catch (IOException exc) {
      //handle however you need
      throw new RuntimeException("Error reading", exc);
    }
  }

  @Override
  public boolean hasNext() {
    return this.line != null;
  }

  @Override
  public String next() {
    String line = this.line;
    this.setLine();
    return line;
  }
}

Then create an abstraction, which will prepare data for handling:

public interface DataPreparer {

  boolean hasMoreData();
  DataHandler prepareData();
}

Like this you can have implementations to prepare data in batches(your case), or line by line, or all at once, however you need. Exact implementation for batches may be:

public class BatchDataPreparer implements DataPreparer {

  private final Iterator<String> iterator;
  private final int batchSize;

  public BatchDataPreparer(Iterator<String> iterator, int batchSize) {
    this.iterator = iterator;
    this.batchSize = batchSize;
  }

  @Override
  public boolean hasMoreData() {
    return this.iterator.hasNext();
  }

  @Override
  public DataHandler prepareData() {
    Map<String, String> data = new LinkedHashMap<>(this.batchSize);
    while (this.iterator.hasNext()) {
      String line = this.iterator.next();
      //parse in whatever way you need
      String[] parsed = line.split("\s+");
      data.put(parsed[0] + " - " + parsed[1], parsed[2]);
      if (data.size() == this.batchSize) {
        break;
      }
    }
    return new SimpleDataHandler(data);
  }
}

Data parsing should be done separately(you can create abstraction for this as well), but for this example i won’t to do it.

The DataHandler interface from above:

public interface DataHandler {

  void handleData();
}

And simple implementation:

public class SimpleDataHandler implements DataHandler {

  private final Map<String, String> data;

  public SimpleDataHandler(Map<String, String> data) {
    this.data = data;
  }

  @Override
  public void handleData() {
    //handle however you need
    System.out.println(this.data);
    this.data.clear();
  }
}

And combining all into one:

public class Temp {

  public static void main(String[] args) {
    int maxDataCount = 3;
    System.out.println("------------------------Concern Separation Result-------------------------------");
    Iterator<String> iterator = new FileLineIterator(InputStream);
    DataPreparer dataPreparer = new BatchDataPreparer(iterator, maxDataCount);
    while (dataPreparer.hasMoreData()) {
      DataHandler dataHandler = dataPreparer.prepareData();
      dataHandler.handleData();
    }
    System.out.println("------------------------Counting Iterator Result--------------------------------");
    System.out.println("------------------------Just to test it works-----------------------------------");
    Iterator<Iterable<String>> anotherIterator = new CountingFileLineIterator(InputStream, maxDataCount);
    while (anotherIterator.hasNext()) {
      Iterable<String> lines = anotherIterator.next();
      //some handling
      System.out.println(lines);
    }
  }
}
  • Your main(or however your program is structured) does not care how data is prepared, DataPreparer implementations are concerned about that
  • Neither preparer, nor main, are concerned how data is handled, only DataHandler is
  • It’s easy to change behaviour, fix bugs and not break something else, extend functionality, etc.
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement