I need to create a method that reads a html file then display the number of word occurrence.
for example: String [] words = {“happy”, “nice”, “good”};
The word happy was used 7 times. The word nice was used 1 times. The word happy was used 2 times.
This is what I did:
public static void ReadWriteDisplay() { Path in = Paths.get("E:\TextToHTML.html"); Path out = Paths.get("E:\HTMLToText.txt"); String s = ""; String str = ""; try { InputStream input = new BufferedInputStream(Files.newInputStream(in)); BufferedReader reader = new BufferedReader(new InputStreamReader(input)); OutputStream output = new BufferedOutputStream(Files.newOutputStream(out, CREATE, WRITE, TRUNCATE_EXISTING)); BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(output)); s = reader.readLine(); while(s != null) { str += s; writer.write(s); writer.newLine(); s = reader.readLine(); } reader.close(); writer.close(); String a[] = str.split(" "); System.out.println("str: "+str); String [] positive = {"happy", "nice", "good", "joy", "love"}; int [] count = {0, 0, 0, 0, 0}; for (int i = 0; i < a.length; i++) { if(positive[0].equalsIgnoreCase(a[i])) count[0]++; if(positive[1].equalsIgnoreCase(a[i])) count[1]++; if(positive[2].equalsIgnoreCase(a[i])) count[2]++; if(positive[3].equalsIgnoreCase(a[i])) count[3]++; if(positive[4].equalsIgnoreCase(a[i])) count[4]++; } for (int x = 0; x < 5; x++) { System.out.println("The word "+positive[x]+" was used "+count[x]+" times."); } } catch(Exception e) { System.err.println("Message: "+ e); } }
My method runs but it does not provide accurate number of occurrence. The reason because some words in html are enclosed in <> which caused <>Hello<> to be stored in my string array instead of the word Hello.
Here is the sample output:
str: <!DOCTYPE html><html lang="en"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <meta http-equiv="content-language" content="en" /> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="google-site-verification" content="rUp8isOBygjhxPJ2qyy6QtBi9vWRFhIboMXucJsCtrE" /> <title>JustPaste.it - Share Text & Images the Easy Way</title> <link rel="preload" href="/static/img/jp_logo_1_en_v4.png" as="image" /> <meta name="robots" content="noindex, nofollow" /> <meta name="googlebot" content="noindex, nofollow" /> <link rel="preload" href="/build/global.395f53d0.css" as="style" /> <link rel="stylesheet" type="text/css" href="/build/global.395f53d0.css" /> <link rel="shortcut icon" href="/static/other/fav.ico" /> <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries --> <!-- WARNING: Respond.js doesn't work if you view the page via file:// --> <!--[if lt IE 9]> <script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script> <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script> <![endif]--> <script> window.article = {"id":42017684,"url":"https://justpaste.it/6fn9m","shortUrl":"https://jpst.it/2wiek","pdfUrl":"https://justpaste.it/6fn9m/pdf","qrCodeData":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFcAAABXCAIAAAD+qk47AAAACXBIWXMAAA7EAAAOxAGVKw4bAAACCklEQVR4nO2by27DMAwEx0X//5fTAwFdaNB8SEmB7BzjSDEWy4ikpOv1evH1/Hz6Bf4FUgGkgiEVQCoYv/6j67omM65FJzOPX6HWKD9PaebSj8oLIBWMm4hYlBIq79Jg+Pqyd3vpR4dvuJAXQCoYUUQsAi9lPOlt74dnloZzbygvgFQwUhExpJft9EKjh7wAUsF4R0QE+Bh5g/898gJIBSMVEUNzDjOiDMN55AWQCkYUEcOWTqlrtL18KCEvgFQwbiJie7qSMXkpELa/obwAUsFI7UcEpXHw397bmMh0cXtJVzBKXgCpYFyB3xYlT/Ye3bzZ7q264EflBZAKRmqHLmPyYJR/5IeXEqrt8SgvgFQwojoiY9feEpN5VCLo4maQF0AqGLVzTcM/50UpEdpVj+sUxwNSAao7dJk6erHrhN65umYhL4BUMGoRUTJ56TsBw/UoM0peAKlg1CrrRamgLnEu6VLW9IBUgLj7Ouz/DJePHr16RF4AqWA096yDc92lCXs3hjzDyJIXQCoYB+/Q9Q4vDS9cBPOojnhAKsDRO3R+nl3dp94uhrKmB6QCHL1Dlznp1GsWbUdeAKlgvOPGUK8juqt5mymx5QWQCsbBiCglS5+9KCEvgFQwDt6hO3djdHtfV14AqWAcvEO36B1M6mVNvQpFXgCpYNzs0H0h8gJIBUMqgFQwpALAH/JvmLtnlWjnAAAAAElFTkSuQmCC"}; window.statsUrl = 'httpsu003A//stats.justpaste.it'; window.viewKey = 'x6ER'; window.barOptions = {"isLoggedIn":false,"hasPublicProfile":false,"displayOwnership":false,"isArticleOwner":false,"isPasswordProtected":false,"isCaptchaRequired":null,"isCaptchaEntered":false,"captchaSettings":null,"premiumUserData":null,"isPrivate":false,"isExpired":false,"expireAfterRead":false,"isShared":false,"defaultAvatar":"/static/img/avatar60.jpg","createdText":"6h","showLastEdit":false,"modifiedText":"6h","isInTrash":false,"viewsText":"2","favouritesCount":0,"onlineText":"1","getFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article/42017684","addFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article","removeFavouriteArticleUrl":"https://justpaste.it/api/account/v1/favourite-article-delete/42017684","apiShowArticleDynamicUrl":"/api/v1/article-dynamic","voteUrl":"/api/account/v1/vote","contentLang":"en","positiveVotes":0,"negativeVotes":0,"currentVote":"empty","linkSharingUrl":null,"linkSharingSecret":null}; </script> <script src="/build/runtime.a1e5a72a.js" async></script> <script src="/build/1676.2c557867.js" async></script> <script src="/build/8452.a9a1e0c5.js" async></script> <script src="/build/5936.ad26e56d.js" async></script> <script src="/build/9412.4a605741.js" async></script> <script src="/build/showarticlewidget.3bbca334.js" async></script> </head><body marginwidth="0" dir="ltr" marginheight="0"><!-- Static navbar --><div class="navbar navbar-default navbar-static-top mainTableTopMiddle" role="navigation"> <div class="container"> <div class="navbar-header pull-left"> <a href="/"><img src="/static/img/jp_logo_1_en_v4.png" width="186px" height="54px" alt="JustPaste.it" /></a> </div> <div class="navbar-header pull-left"> <div class="nav navbar-nav mainTableTopMiddleRight hidden-xs hidden-sm"> <img src="/static/img/jp_logo_2_en_v5.png" width="390px" height="54px" /> </div> </div> <div class="navbar-header pull-right" style="padding-top:8px"> <div id="mainPanelButtons"></div> </div> </div><!--/.nav-collapse --></div><div id="headContainer" class="container" style="max-width: 960px"> <div class="row"> <div class="col-md-12"> <div id="mainTableContent"> <div style="max-width: 960px; vertical-align: top"> <div id="showArticleWidget"><div class="showArticleWidgetPlaceholder"></div></div> <div id="articleContent"> <p>happy</p> <p>nice nice</p> <p>good good good</p> <p>joy Joy joy Joy joy</p> <p>Love love Love love Love</p> </div> <div id="showArticleBottomWidget"><div class="articleBottomWidgetPlaceholder"></div></div> <span style="visibility:hidden" class="glyphicon glyphicon-link"></span></div> </div> </div> </div> <!-- /row --></div> <!-- /container --><div id="footer" style="min-height: 30px;"> <div class="container" style="vertical-align: middle"> <div class="col-md-3 col-xs-5 col-sm-4 text-muted" style="font-size: 95%;" align="left"> © 2021 <span class="hidden-xs">justpaste.it</span> </div> <div class="col-md-9 col-xs-7 col-sm-8 text-muted" align="right"> <ul class="list-inline basePageFooterList"> <li class="hidden-xs"> <a href="/login">Account</a> </li> <li class="hidden-xs"> <a href="/terms">Terms</a> </li> <li class="hidden-xs"> <a href="/privacypolicy">Privacy</a> </li> <li class="hidden-xs"> <a href="/cookies">Cookies</a> </li> <li> <a href="/u/justpasteit">Blog</a> </li> <li> <a href="/about">About</a> </li> </ul> </div> </div></div> <script> window.mainPanelOptions = { addArticleUrl: '/', loginUrl: '/login', logoutUrl: '/logout', favouriteArticlesUrl: '/account/favourite', subscribedArticlesUrl: '/account/subscribed', sharedArticlesUrl: '/account/shared', manageAccountUrl: '/account/manage', messagesUrl: '/account/messages', articlesStatsUrl: '/account/articles-stats', premiumUrl: '/premium/subscription', unreadMessagesUrl: 'https://msg.justpaste.it/api/v1/conversation/unread', profileSettings: '/account/settings', isLoggedIn: false, userEmail: null, userPermalink: null, userProfileIsPublic: false, userProfileLink: null }; </script> <script src="/build/mainpanelwidget.80530742.js" async></script> </body></html> The word happy was used 0 times. The word nice was used 0 times. The word good was used 1 times. The word joy was used 3 times. The word love was used 3 times.
How do I properly split or count the number of occurrence? Thank you!
Advertisement
Answer
This will help you to remove special characters, this will only allow alphabets for example : <>Hello<> will be replaced like Hello
String alphaOnly = input.replaceAll(“[^a-zA-Z]+”,””);