Skip to content

How do we split words from a html file using string manipulations in java?

I need to create a method that reads a html file then display the number of word occurrence.

for example: String [] words = {“happy”, “nice”, “good”};

The word happy was used 7 times. The word nice was used 1 times. The word happy was used 2 times.

This is what I did:

public static void ReadWriteDisplay() {
    
 Path in = Paths.get("E:\TextToHTML.html");
 Path out = Paths.get("E:\HTMLToText.txt");
 String s = "";
 String str = "";
 try {
    InputStream input = new BufferedInputStream(Files.newInputStream(in));
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
        
    OutputStream output = new BufferedOutputStream(Files.newOutputStream(out, CREATE, WRITE, TRUNCATE_EXISTING));
    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output));
        
    s = reader.readLine();
    while(s != null) {
      str += s;
      writer.write(s);
      writer.newLine();
      s = reader.readLine();
    }
reader.close();
writer.close();
        
String a[] = str.split(" ");
System.out.println("str: "+str);
String [] positive = {"happy", "nice", "good", "joy", "love"};
int [] count = {0, 0, 0, 0, 0};
for (int i = 0; i < a.length; i++) {
    if(positive[0].equalsIgnoreCase(a[i]))
                count[0]++;
    if(positive[1].equalsIgnoreCase(a[i]))
                count[1]++;
    if(positive[2].equalsIgnoreCase(a[i]))
                count[2]++;
    if(positive[3].equalsIgnoreCase(a[i]))
                count[3]++;
    if(positive[4].equalsIgnoreCase(a[i]))
                count[4]++;
}
        
for (int x = 0; x < 5; x++) {
    System.out.println("The word "+positive[x]+" was used "+count[x]+" times.");
}
        
} catch(Exception e) {
    System.err.println("Message: "+ e);
  } 
}

My method runs but it does not provide accurate number of occurrence. The reason because some words in html are enclosed in <> which caused <>Hello<> to be stored in my string array instead of the word Hello.

Here is the sample output:

str: <!DOCTYPE html><html lang="en"><head>    <meta charset="utf-8">    <meta http-equiv="X-UA-Compatible" content="IE=edge">    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>    <meta http-equiv="content-language" content="en" />    <meta name="viewport" content="width=device-width, initial-scale=1">    <meta name="google-site-verification" content="rUp8isOBygjhxPJ2qyy6QtBi9vWRFhIboMXucJsCtrE" />    <title>JustPaste.it - Share Text &amp; Images the Easy Way</title>    <link rel="preload" href="/static/img/jp_logo_1_en_v4.png" as="image" />                <meta name="robots" content="noindex, nofollow" />        <meta name="googlebot" content="noindex, nofollow" />                                <link rel="preload" href="/build/global.395f53d0.css" as="style" />            <link rel="stylesheet" type="text/css"  href="/build/global.395f53d0.css" />                    <link rel="shortcut icon" href="/static/other/fav.ico" />             <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->        <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->        <!--[if lt IE 9]>            <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>            <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>        <![endif]-->        <script>      window.article = {"id":42017684,"url":"https://justpaste.it/6fn9m","shortUrl":"https://jpst.it/2wiek","pdfUrl":"https://justpaste.it/6fn9m/pdf","qrCodeData":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFcAAABXCAIAAAD+qk47AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACCklEQVR4nO2by27DMAwEx0X//5fTAwFdaNB8SEmB7BzjSDEWy4ikpOv1evH1/Hz6Bf4FUgGkgiEVQCoYv/6j67omM65FJzOPX6HWKD9PaebSj8oLIBWMm4hYlBIq79Jg+Pqyd3vpR4dvuJAXQCoYUUQsAi9lPOlt74dnloZzbygvgFQwUhExpJft9EKjh7wAUsF4R0QE+Bh5g/898gJIBSMVEUNzDjOiDMN55AWQCkYUEcOWTqlrtL18KCEvgFQwbiJie7qSMXkpELa/obwAUsFI7UcEpXHw397bmMh0cXtJVzBKXgCpYFyB3xYlT/Ye3bzZ7q264EflBZAKRmqHLmPyYJR/5IeXEqrt8SgvgFQwojoiY9feEpN5VCLo4maQF0AqGLVzTcM/50UpEdpVj+sUxwNSAao7dJk6erHrhN65umYhL4BUMGoRUTJ56TsBw/UoM0peAKlg1CrrRamgLnEu6VLW9IBUgLj7Ouz/DJePHr16RF4AqWA096yDc92lCXs3hjzDyJIXQCoYB+/Q9Q4vDS9cBPOojnhAKsDRO3R+nl3dp94uhrKmB6QCHL1Dlznp1GsWbUdeAKlgvOPGUK8juqt5mymx5QWQCsbBiCglS5+9KCEvgFQwDt6hO3djdHtfV14AqWAcvEO36B1M6mVNvQpFXgCpYNzs0H0h8gJIBUMqgFQwpALAH/JvmLtnlWjnAAAAAElFTkSuQmCC"};      window.statsUrl = 'httpsu003A//stats.justpaste.it';      window.viewKey = 'x6ER';      window.barOptions = {"isLoggedIn":false,"hasPublicProfile":false,"displayOwnership":false,"isArticleOwner":false,"isPasswordProtected":false,"isCaptchaRequired":null,"isCaptchaEntered":false,"captchaSettings":null,"premiumUserData":null,"isPrivate":false,"isExpired":false,"expireAfterRead":false,"isShared":false,"defaultAvatar":"/static/img/avatar60.jpg","createdText":"6h","showLastEdit":false,"modifiedText":"6h","isInTrash":false,"viewsText":"2","favouritesCount":0,"onlineText":"1","getFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article/42017684","addFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article","removeFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article-delete/42017684","apiShowArticleDynamicUrl":"/api/v1/article-dynamic","voteUrl":"/api/account/v1/vote","contentLang":"en","positiveVotes":0,"negativeVotes":0,"currentVote":"empty","linkSharingUrl":null,"linkSharingSecret":null};          </script>        <script src="/build/runtime.a1e5a72a.js" async></script>        <script src="/build/1676.2c557867.js" async></script>        <script src="/build/8452.a9a1e0c5.js" async></script>        <script src="/build/5936.ad26e56d.js" async></script>        <script src="/build/9412.4a605741.js" async></script>        <script src="/build/showarticlewidget.3bbca334.js" async></script>        </head><body marginwidth="0" dir="ltr" marginheight="0"><!-- Static navbar --><div class="navbar navbar-default navbar-static-top mainTableTopMiddle" role="navigation">    <div class="container">        <div class="navbar-header pull-left">            <a href="/"><img src="/static/img/jp_logo_1_en_v4.png" width="186px" height="54px" alt="JustPaste.it" /></a>        </div>        <div class="navbar-header pull-left">            <div class="nav navbar-nav mainTableTopMiddleRight hidden-xs hidden-sm">                <img src="/static/img/jp_logo_2_en_v5.png" width="390px" height="54px" />            </div>        </div>        <div class="navbar-header pull-right" style="padding-top:8px">            <div id="mainPanelButtons"></div>        </div>    </div><!--/.nav-collapse --></div><div id="headContainer" class="container" style="max-width: 960px">    <div class="row">        <div class="col-md-12">            <div id="mainTableContent">                <div style="max-width: 960px; vertical-align: top">            <div id="showArticleWidget"><div class="showArticleWidgetPlaceholder"></div></div>        <div id="articleContent">        <p>happy</p> <p>nice nice</p> <p>good good good</p> <p>joy Joy joy Joy joy</p> <p>Love love Love love Love</p>    </div>            <div id="showArticleBottomWidget"><div class="articleBottomWidgetPlaceholder"></div></div>    <span style="visibility:hidden" class="glyphicon glyphicon-link"></span></div>            </div>        </div>    </div> <!-- /row --></div> <!-- /container --><div id="footer" style="min-height: 30px;">    <div class="container" style="vertical-align: middle">        <div class="col-md-3 col-xs-5 col-sm-4 text-muted" style="font-size: 95%;" align="left">            &copy; 2021 <span class="hidden-xs">justpaste.it</span>        </div>        <div class="col-md-9 col-xs-7 col-sm-8 text-muted"  align="right">            <ul class="list-inline basePageFooterList">                <li class="hidden-xs">                    <a href="/login">Account</a>                </li>                <li class="hidden-xs">                    <a href="/terms">Terms</a>                </li>                <li class="hidden-xs">                    <a href="/privacypolicy">Privacy</a>                </li>                <li class="hidden-xs">                    <a href="/cookies">Cookies</a>                </li>                <li>                    <a href="/u/justpasteit">Blog</a>                </li>                <li>                    <a href="/about">About</a>                </li>            </ul>        </div>    </div></div>        <script>      window.mainPanelOptions = {        addArticleUrl: '/',        loginUrl: '/login',        logoutUrl: '/logout',        favouriteArticlesUrl: '/account/favourite',        subscribedArticlesUrl: '/account/subscribed',        sharedArticlesUrl: '/account/shared',        manageAccountUrl: '/account/manage',        messagesUrl: '/account/messages',        articlesStatsUrl: '/account/articles-stats',        premiumUrl: '/premium/subscription',        unreadMessagesUrl: 'https://msg.justpaste.it/api/v1/conversation/unread',        profileSettings: '/account/settings',        isLoggedIn: false,        userEmail: null,        userPermalink: null,        userProfileIsPublic: false,        userProfileLink: null      };          </script>        <script src="/build/mainpanelwidget.80530742.js" async></script>        </body></html>

    The word happy was used 0 times.
    The word nice was used 0 times.
    The word good was used 1 times.
    The word joy was used 3 times.
    The word love was used 3 times.

How do I properly split or count the number of occurrence? Thank you!

Answer

This will help you to remove special characters, this will only allow alphabets for example : <>Hello<> will be replaced like Hello

String alphaOnly = input.replaceAll(“[^a-zA-Z]+”,””);