How can I find the number of lines that contain a certain word in java using Java Stream?

My method would read from a text file and find the word “the” inside of each line and count how many lines contain the word. My method does work but the issue is that I need only lines that contain the word by itself, not a substring of the word as well

For example, I wouldn’t want “therefore” even though it contains “the” it’s not by itself.

I’m trying to find a way to limit the lines to those that contain “the” and have the length of the word be exactly 3 but I’m unable to do that.

Here is my method right now:

public static long findThe(String filename) {
    long count = 0;
    
    try {
        Stream<String> lines = Files.lines(Paths.get(filename));
         count = lines.filter(w->w.contains("the"))
                .count();
        
        } 
    catch (IOException x)
    {
        // TODO Auto-generated catch block
        System.out.println("File: " + filename + " not found");
    }

    
    System.out.println(count);
    return count;
}

JavaScript
​x
 
public static long findThe(String filename) {    long count = 0;        try {        Stream<String> lines = Files.lines(Paths.get(filename));         count = lines.filter(w->w.contains("the"))                .count();                }     catch (IOException x)    {        // TODO Auto-generated catch block        System.out.println("File: " + filename + " not found");    }​        System.out.println(count);    return count;}​

For example, if a text file contains these lines:

This is the first line
This is the second line
This is the third line
This is the fourth line
Therefore, this is a name.

JavaScript
 
This is the first lineThis is the second lineThis is the third lineThis is the fourth lineTherefore, this is a name.​

The method would return 4

Answer

Use regex to enforce word boundaries:

count = lines.filter(w -> w.matches("(?i).*\bthe\b.*")).count();

JavaScript
 
count = lines.filter(w -> w.matches("(?i).*\bthe\b.*")).count();​

or for the general case:

count = lines.filter(w -> w.matches("(?i).*\b" + search + "\b.*")).count();

JavaScript
 
count = lines.filter(w -> w.matches("(?i).*\b" + search + "\b.*")).count();​

Details:

b means “word boundary”
(?i) means “ignore case”

Using word boundaries prevents "Therefore" matching.

Note that in java, unlike many other languages, String#matches() must match the entire string (not just find a match within the string) to return true, hence the .* at either end of the regex.

Advertisement

Answer