My method would read from a text file and find the word “the” inside of each line and count how many lines contain the word. My method does work but the issue is that I need only lines that contain the word by itself, not a substring of the word as well
For example, I wouldn’t want “therefore” even though it contains “the” it’s not by itself.
I’m trying to find a way to limit the lines to those that contain “the” and have the length of the word be exactly 3 but I’m unable to do that.
Here is my method right now:
public static long findThe(String filename) { long count = 0; try { Stream<String> lines = Files.lines(Paths.get(filename)); count = lines.filter(w->w.contains("the")) .count(); } catch (IOException x) { // TODO Auto-generated catch block System.out.println("File: " + filename + " not found"); } System.out.println(count); return count; }
For example, if a text file contains these lines:
This is the first line This is the second line This is the third line This is the fourth line Therefore, this is a name.
The method would return 4
Advertisement
Answer
Use regex to enforce word boundaries:
count = lines.filter(w -> w.matches("(?i).*\bthe\b.*")).count();
or for the general case:
count = lines.filter(w -> w.matches("(?i).*\b" + search + "\b.*")).count();
Details:
b
means “word boundary”(?i)
means “ignore case”
Using word boundaries prevents "Therefore"
matching.
Note that in java, unlike many other languages, String#matches()
must match the entire string (not just find a match within the string) to return true
, hence the .*
at either end of the regex.