Skip to content
Advertisement

Java Programmingtechnic to check files performant?

I’m searching for a pattern or programming technic in Java for my service.

The Input of the service is an Array of Strings. These Strings represent a file and some information.

The output of the service is a Map of the exact String and a Boolean. The Boolean represents that the information exists in the file.

e.g. Input

["file1/dog","file2/cat","file1/rabbit"]

Output

{"file1/dog":"false","file2/cat":"true","file1/rabbit":"true"}

I want to open files only once and search for all information in this file. e.g. Open file one and search for Dog and rabbit.

How to do this fast in Java?

Should I use a Map with the file as a key to gathering the information? e.g.

{"file1": ["dog","rabbit"], "file2": ["cat"]}

The next step is to loop over the keys to check every file.

Advertisement

Answer

To make an efficient solution, the assumptions and preconditions of the system must be considered.

Assuming the file sizes are small

If each file is going to be relatively small, then I’d recommend caching the contents of the files.

When search patterns are unpredictable

If you cannot predict the possible search terms or the contents of the files cannot be split into searchable items, then simply checking the cached file contents with String#contains will have to do.

However, if search terms are predictable, or files contain a set number of search terms, then you can attempt to index these terms using the methods discussed below (each have different computation requirements).

O(m) time complexity approach – where m is number of files

Assuming each search term is going to be a word in the file, it’s a good idea to store the file contents as a HashSet (to take advantage of the O(1) time complexity of HashSet#contains), with each element being a word or potential search term.

That way, you could have a Map like so:

//the following code runs once at startup
//map containing file names against their words as a HashSet
Map<String, HashSet<String>> fileContents = new HashMap<>();

//add files to Map (assuming parseFileContents returns a HashSet<String>)
fileContents.put("file1", parseFileContents("file1"));
fileContents.put("file2", parseFileContents("file2"));
fileContents.put("file3", parseFileContents("file3"));

//the following code runs every time a request is made
//get files containing the word "dog" (time complexity is O(m) where m is the number of files)
fileContents.entrySet()
    .stream()
    .filter(e -> e.getValue().contains("dog"))
    .map(Map.Entry::getKey)
    .collect(LinkedList::new);

O(1) time complexity approach

If the search patterns were predictable and there weren’t many, then you could create a Map where search terms are keys and a List of files are the values:

//the following code runs once at startup
//map containing possible search patterns against files that they appear in
Map<String, List<String>> patternsAgainstFiles = new HashMap<>();

//then loop through all files and each term found in the file as a key to the map,
//with the file being added to the list in the value

//the following code runs every time a request is made
//get files containing the word "dog" (time complexity is O(1) after initial computation at startup)
patternsAgainstFiles.getOrDefault("dog", new LinkedList<>());

Other option

Use this option when:

  • Files are too large to store in memory
  • Files could change during program execution

In this case, you have no choice but to actually search the file in secondary storage every time you wish to search for the given pattern in the file.

The best method I can think for doing this at the time of writing, is opening a FileInputStream (or something similar) for each file, then reading one byte at a time and appending it to a ‘current buffer’ which is never longer than the length of the search term in bytes. Whenever the ‘current buffer’ has the same length (in bytes) as the search term, it is evaluated against the search term, and if they match then the search term is in the file, but if they don’t match pop the first byte in the ‘current buffer’ and continue.

The above is all my suggestions and there may be more efficient ways to achieve the task at hand. These are just the best approaches I could think of as of writing.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement