StormCrawler: The URL Database Specifications

Tags: , ,



I am quite new to StormCrawler – as I have been exploring the documentation, as well as the READMEs and additional resources, I have noticed that it is often referred to a “URL database” which should handle storing information concerning the the URLs from the run of the crawler (for example here).

I have, however, not found anywhere of what type this database is, nor how to customize it or replace it with custom modules. I have been following the code and got to IOOutputController, which has some quite confusing methods and with the lack of docstrings, it is quite challenging to actually even determine the class responsible for handling this.

I would be very grateful for any guidance!

Thank you for your time, Matyáš

Answer

The most commonly used storage for the URLs in StormCrawler is Elasticsearch. This is illustrated in the tutorials. There are other ones available such as SQL or SOLR, see enter link description here; StormCrawler is not limited to a specific database.
In most cases, people just use an existing backend implementation such as the Elasticsearch one.



Source: stackoverflow