Skip to content
Advertisement

Tag: web-crawler

How to parse a sitemap index that has compressed links

I’ve made a program that reads the /robots.txt and the /sitemap.xml of a page and substracts the available sitemaps and stores them on the siteMapsUnsorted list. Once there I use crawler-commons library to analyze if the links are SiteMaps or SiteMapIndexes (cluster of SiteMaps). When I use it on a normal siteMapIndex it works, the problem occurs in some cases

Advertisement