Skip to content
Advertisement

How to parse a sitemap index that has compressed links

I’ve made a program that reads the /robots.txt and the /sitemap.xml of a page and substracts the available sitemaps and stores them on the siteMapsUnsorted list. Once there I use crawler-commons library to analyze if the links are SiteMaps or SiteMapIndexes (cluster of SiteMaps).

When I use it on a normal siteMapIndex it works, the problem occurs in some cases where bigger sites have the list of SiteMapIndexes on a compressed format, e.g:


The code I’m using:

SiteMapParser sitemapParser = new SiteMapParser();

for (String sitemapURLStr : siteMapsUnsorted) {
    AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap(new URL(sitemapURLStr));
//AbstractSiteMap siteMapCandidate = sitemapParser.parseSiteMap("xml", content , new URL(sitemapURLStr));
    
    // Check if the elements inside the list are SiteMapIndexes or SiteMaps, if they are SiteMapINDEXES, we need to break them down into SiteMaps
    if (siteMapCandidate instanceof SiteMapIndex){
        SiteMapIndex siteMapIndex = (SiteMapIndex) siteMapCandidate;

        for (AbstractSiteMap aSiteMap : siteMapIndex.getSitemaps()){
            if (aSiteMap instanceof  SiteMap){
                String siteMapString = aSiteMap.getUrl().toString();
                System.out.println(siteMapString);
                siteMaps.add(siteMapString);
            } else{
                LOG.warn("ignoring site map index inside site map index: " + aSiteMap.getUrl());
            }
        }
    }
    // If the elements inside the list are individual SiteMaps we add them to the SiteMaps list
    else {
        siteMaps.add(siteMapCandidate.getUrl().toString());
    }
}

I’ve noticed that the method parseSitemap changes depending the parameters you pass to it, but after trying multiple times I couldnt find a way to handle the compressed elements.

My last alternative would be to program a method that downloads every .tar.gz, decompresses it, reads the decompressed list of links, store them and finally deletes the directory; but that would be extremelly slow and inefficient, so first I came here to see if anyone has a better idea/could help me with the parseSitemap().

Thanks to anyone helping in advance.

Advertisement

Answer

The reason this is failing is that Tripadvisor doesn’t set the correct mime type on its sitemaps:

$ curl --head https://www.tripadvisor.es/sitemap/2/es/sitemap-1662847-es-articles-1644753222.xml.gz
...
content-type: text/plain; charset=utf-8

and the library that are using only decodes with gzip when the content type is one of:

private static String[] GZIP_MIMETYPES = new String[] { 
  "application/gzip",
  "application/gzip-compressed",
  "application/gzipped",
  "application/x-gzip",
  "application/x-gzip-compressed",
  "application/x-gunzip",
  "gzip/document"
};

You could probably work around this by implementing better detection of gzip and xml (like the URL ends in .xml.gz) and call the processGzippedXML method directly after downloading the sitemap to a byte[].

Advertisement