DigitalPebble's Blog: plugins

Thursday 26 October 2023

Focus on protocol improvements in StormCrawler 2.10

StormCrawler 2.10 was released yesterday and, as usual, it contains loads of improvements, dependency upgrades and bug fixes. Instead of going through each one of them, we will focus specifically on what was done for protocols.

First, every protocol implementation can now easily be tested on the command line, even FileProtocol or DelegatorProtocol thanks to #1097. For instance,

storm local target/xxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol -f crawler-conf.yaml https://storm.apache.org/ -b

which configures the RemoteDriverProtocol with the content of crawler-conf.yaml and display info on the console.

You might have noticed that the option to specify a configuration file has changed from -c to -f as the former conflicted with a Storm operator. We also added an option

-b which dumps the content of the URL to a file in the temp folder, making it very easy to check what the protocol actually retrieved for a given configuration.

Using the command above in combination with debugging is particularly powerful. This can easily be done with

export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"

One of the main changes is about the Selenium module. It had been a while since it had any work done to it and its configuration was pretty obsolete. With #1093, we added some much needed unit tests and removed some incompatible configuration. #1100 added an option to deactivate tracing and we fixed the user agent substitution (#1109).

An important and incompatible change in the Selenium module is about the way the timeouts are configured. The previous mechanism was opaque and error prone. This has been replaced in #1101, the timeouts are now configured with a map

selenium.timeouts:
script: -1
pageLoad: -1
implicit: -1

with -1 preserving the Selenium default values.

The DelegatorProtocol has also been greatly improved. If you are not familiar with it, it allows you to determine which protocol implementation should be used for a URL given the metadata it has. For instance,

  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

will use the OKHTTP protocol for a URL if is has a key isSitemap in its metadata with a value of true. Otherwise it will use the Selenium implementation.

With #1098, we added an operator indicating whether the conditions should be treated as an AND or OR. We also added the possibility to triage based on regular expressions on the URL itself (#1110).

You can now express more complex configurations such as

# use the normal protocol for sitemaps, robots and if asked explicitly

protocol.delegator.config:

- className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"

operator: OR

filters:

isSitemap: "true"

robots.txt:

skipSelenium:

regex:

- \.pdf

- \.doc

- className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

As a result, we removed the deprecated class DelegatorRemoteDriverProtocol.

The DelegatorProtocol is of course particularly useful for avoiding sending URLs to the Selenium implementation unnecessarily, as illustrated above.

StormCrawler 2.10 contains of course other changes and dependency updates and, as usual, we recommend that you switch to it. As we have seen today, the improvements we added to make protocol implementations easier to test and configure should be a reason to upgrade.

We would like to thank all the users and contributors to the 2.10 release.

Happy crawling!

Friday 8 March 2013

Free your Nutch crawls with pluggable indexers

I have just committed what should be a very important new feature of the next 1.x release of Apache Nutch, namely the possibility to implement indexing backends via plugins. This is currently on the trunk only but should hopefully be ported to 2.x at some point. The Nutch-1047 JIRA issue contains a history of patches and discussions for this feature.

As you'll see by reading the explanations below, this is not the same thing as the indexing filters or the storage backends in Nutch 2.x.

Historically, Nutch used to manage its own Lucene indices itself and provide a web interface for querying them. Support for SOLR was added much later in the 1.0 release (NUTCH-442) and users had two separate commands for indexing directly with Lucene or sending the documents to SOLR, in which case the search could be done outside the Nutch search servers and directly with SOLR. We then decided to drop the Nutch search servers and the Lucene-based indexing altogether in Nutch 1.3 (NUTCH-837) and let the SOLR indexer become the only option. This was an excellent move as it greatly reduced the amount of code we had to look after and meant that we could focus on the crawling while benefiting from the advances in SOLR.

One of the nice things about Nutch is that most of its components are based on plugins. The actual plugin mechanism was borrowed from Eclipse and allows to have endpoints and extensions. Nutch has extension points for URLFilters, URLNormalizers, Parsers, Protocols, etc... The full list of Nutch extensions can be found here. Basically pretty much everything in Nutch is done via plugins and I found that most customisations of Nutch I do for my clients are usually implemented via plugins only.

As you've guessed, NUTCH-1047 is about having generic commands for indexing and handling the backend implementations via plugins. Instead of piggybacking the SOLR indexer code to send the documents to a different backend, one can now use the brand new generic IndexingJob and isolate the logic of how the documents are sent to the backend via an extension of the new IndexWriter endpoint in a custom plugin.

The IndexWriter interface is pretty straightforward :

public String describe();

public void open(JobConf job, String name) throws IOException;

public void write(NutchDocument doc) throws IOException;

public void delete(String key) throws IOException;

public void update(NutchDocument doc) throws IOException;

public void commit() throws IOException;

public void close() throws IOException;

Having this mechanism allows us to move most of the SOLR-specific code to the new indexer-solr plugin (and hopefully all of it as soon as we have a generic de-duplicator which could use the IndexWriter plugins) but more importantly will facilitate the implementation of popular indexing backends such as ElasticSearch or Amazon's CloudSearch service without making the core code of Nutch more complex. We frequently get people on the mailing list asking how to store the Nutch documents on such or such database and being able to do that in a plugin will definitely make it easier. It will also be a good way of storing Nutch documents as files, etc...

This is quite a big change to the architecture of Nutch but we tried to make it as transparent as possible for end users. The only indexer plugin currently available is a port of the existing code for SOLR and is activated by default. We left the old solr* commands and modified them so that they use the generic commands with the indexing plugins in the background so from a user point of view there should be no difference at all.

There is already a JIRA for a text-based CSV indexing plugin and I expect that the ElasticSearch one will get rapid adoption.

I had been willing to find the time to work on this for quite some time and I'm very pleased it is now committed, thanks to the comments and reviews I got from my fellow Nutch developers. I look forward to getting more feedback and seeing it being used, extended, improved, etc...