Showing posts with label noSQL. Show all posts
Showing posts with label noSQL. Show all posts

Friday 9 September 2016

Index the web with StormCrawler (revisited)


I hope you all had an enjoyable summer. I can't believe it's not even a year since I published the (relatively popular) post on Index the web with AWS CloudSearch! At the time we had just released the version 0.6 of StormCrawler and the post explained how to use SC to crawl a website and index it with CloudSearch. The tutorial also covered the same operations with Apache Nutch and helped users understand how the two projects differ.

StormCrawler has evolved a lot in just one year! In won't go into much details as this is explained in the previous posts but we are now at version 1.0, have a proper website for the project (albeit in constant need of improvements), a logo, many new resources, including a Maven archetype. The latest of these new resources is a new module for the popular Redis data structure store.

Last year's post is now quite outdated as a result so we'll now revisit the same use case (crawling http://www.tescobank.com/) but this time using Redis to store the crawl frontier and URL information and bootstrapping the project with the Maven archetype. This time, we won't index its content with Cloudsearch (or anything else) to keep the configuration to a minimum.

If you are new to StormCrawler, please read  the Cloudsearch post for an introduction as well as the material on the website. Please bear in mind that although this short tutorial covers a single website processed with a single machine, StormCrawler is distributed by nature (thanks to Apache Storm) and can run on a cluster to deal with millions of pages.

Prerequisites

The instructions below are based on a Linux distribution. You will need to install the following software :
The Redis module is currently a PR with the code stored in a separate branch. This might be merged based on user feedback and be available in the next release but until then you can download the code for the redis branch or clone it with Git. Unzip the archive and from its root dir call 'mvn clean install', this will put all the necessary jars in your local repository.

The Storm command must be on your PATH, you don't need its servers to be running if you only want to run the crawls in local (non-distributed) mode.

Redis

Apache Storm as a framework is 'source agnostic' (if such a term exists), all it expects is that the Spout implementations provide the topologies with a steady stream of tuples. In the case of StormCrawler, these tuples are of the form <URL, Metadata>. Depending on the use case, they might come from a distributed queue (e.g. RabbitMQ), a database (MySQL) or a search system (Elasticsearch, SOLR).

The choice of tool to use depends on the following factors :
  • do you follow outlinks?
  • if so, is the crawl recursive i.e. can you get to the same URL via different pages?
  • do you need to revisit the pages?
StormCrawler provides a number of resources in its external plugins and so does Apache Storm itself.

Often the same data structure is used to both persist the information we have about the URLs and queue the URLs to be fetched (crawl frontier). With a key / (structured) value store like Redis we can use a slightly different strategy and separate the crawl frontier from the status of the URLs. For the frontier, we use keys with the prefix 'q_' followed by the host or domain name with a List of URLs to fetch as value. The Spout iterates on the queue entries and removes the head of the queue to send them as Tuples in the topology. This has the advantage of guaranteeing a perfect diversity of URLs in the topology and hence optimal performance. We also the information about the URLs with the prefix 's_' followed by the URL, the value associated with such keys is a String containing the status and metadata of the URLs. When discovering new URLs in recursive crawls, we can check whether the URL is already known, in which case we won't add it to the queues again.

One limitation of our Redis spout and updater is that we can't reschedule URLs for revisiting them but for many use cases, this is absolutely fine.

Let's get started. With the Redis server running, open a client session with redis-cli and type

FLUSHALL RPUSH q_www.tescobank.com http://www.tescobank.com/

We'll skip the creation of the corresponding s_http://www.tescobank.com/ entry out of pure laziness. It will get created once the URL is fetched and its status updated. What we just did is that we created a queue for the host tescobank.com with a list as value which contains a single URL to fetch.

Boostrap a project with an archetype

Instead of having to build everything from scratch we'll use our Maven archetype to bootstrap our crawl project. From anywhere you want on your filesystem do :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1-SNAPSHOT -DgroupId=net.stormcrawler -DartifactId=redis-crawler -Dversion=1.0

and press enter to confirm. Change the directory to redis-crawler, you should see a basic set of config and resource files.

├── crawler-conf.yaml ├── crawler.flux ├── pom.xml ├── README.md └── src └── main ├── java │   └── net │   └── stormcrawler │   └── CrawlTopology.java └── resources ├── default-regex-filters.txt ├── default-regex-normalizers.xml ├── parsefilters.json └── urlfilters.json

This will be the starting point for our modifications.

Customisation of resources

Edit the pom.xml file and add the redis module to the list of dependencies

<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-redis</artifactId> <version>1.1-SNAPSHOT</version> </dependency>

Next, we'll edit crawler-conf.yaml and add the following configuration parameters :

http.content.limit: -1
fetcher.server.delay: 2.0
redis.status.max.urls.per.bucket: 5

Let's now edit urlfilters.json by setting ignoreOutsideHost to true and adding

{ "class": "com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter", "name": "RobotsFilter", "params": { } }

to the list of filters (don't forget to add a comma before this section!).

Next add https to the second line of regex-filter.txt, and finally set the content of regex-normalizers.xml to

<?xml version="1.0"?>
<regex-normalize>
<!-- removes parameters from URL -->
<regex>
  <pattern>\?.+</pattern>
  <substitution></substitution>
</regex>
</regex-normalize>

Everything should now be similar to the configuration of last year's tutorial.

Customisation of the topology class

The example topology has a simple memory based Spout which reads preset URLs and a dummy StatusUpdater. What we need to do instead is to use the Redis-based equivalents.

Replace the spout declaration in CrawlTopology.java (line 49) with

builder.setSpout("spout", new com.digitalpebble.stormcrawler.redis.RedisSpout());

and StdOutStatusUpdater() (line 67) with com.digitalpebble.stormcrawler.redis.StatusUpdaterBolt().

For good measure let's get rid of StdOutIndexer() and use com.digitalpebble.stormcrawler.indexing.DummyIndexer() instead.

Run the crawl

First, let's build an uber-jar with 

mvn clean package

Then as mentioned in the README.md we can start the crawl with :

storm jar target/redis-crawler-1.0.jar net.stormcrawler.CrawlTopology -conf crawler-conf.yaml -local

or without '-local' if Storm is running as a (pseudo?) distributed cluster and we want the benefits of the UI, proper logging etc...

You should see the usual log entries and metrics info scrolling on the console. For a better understanding of what's going on, open a console and type

redis-cli KEYS s_* | sort

to see all the URLs discovered during the crawl, regardless of whether they were fetched or not.

To see the entire content of the fetch queue, do :

redis-cli LRANGE q_www.tescobank.com 0 -1

whereas

redis-cli LLEN q_www.tescobank.com

returns its size only. When it returns 0, your crawl is finished and you can kill the process with CTRL-C or do it with STORM KILL if you are running in distributed mode.

Conclusion

Looking back at last year's post, I realised that StormCrawler has evolved a lot since and the previous instructions were not quite up to date. The archetype, for instance, is a good way of getting started with StormCrawler and provides a solid starting point. There are also a lot more useful resources that users can leverage, including our brand new Redis components.

The approach we used for Redis where 2 different sets of keys are used for the crawl frontier and the URLs status could be reused for other key value stores like HBase which do not necessarily have secondary indices that we can use for sharding the URLs per host and guarantee a good diversity of URLs in the topology.

Please do get involved and help reviewing the PR for Redis. If you have any questions or problems, see http://stormcrawler.net/support/.

Happy crawling!

PS: Exercise for the reader

You probably noticed the 'crawler.flux' file in the directory generated by the artefact. Flux is a recent addition to Apache Storm : instead of defining a topology via a Java class as we did above, Flux allows you to do define the components of the topology and their interactions via a yaml file in a language-neutral way. Better, it means that you don't need to recompile the code if you change something in your topology (you'll still need to turn it off and restart it though).

The crawler.flux file corresponds to the default topology which we modified above to use Redis. As stated in the README, you can start a topology in the following way :

storm jar target/redis-crawler-1.0.jar org.apache.storm.flux.Flux --local crawler.flux

As an exercise, why don't you have a look at the Flux file and modify it so that it runs the Redis-based topology?

PPS: Exercise for the reader #2

Crawling is very nice but unless you store or index the documents you crawl it remains a pretty pointless exercise. Why don't you modify the crawl above so that it sends the data to Elasticsearch, SOLR or Cloudsearch? If you are feeling adventurous you could also try the WARC module and generate some great web archives to play with.




Monday 9 July 2012

Nutch 2.0 is out (at last!)

Like pretty much any 2.0 release, Nutch 2.0 marks a radical change from the 1.x branch. I've mentioned 2.0 in previous posts but let's do a bit of history first. Nutch was initially started by Doug Cutting (Lucene's creator) and Mike Caffarella around 2002, then came the MapReduce paper from Google and in 2005 MapReduce was implemented as part of Nutch then became a sub project of Lucene at Apache. You know what happened to Hadoop after that : open source super-stardom, millions of dollars in investment, fierce competition between commercial distributions but also a myriad of related projects (HBase, ZooKeeper, Pig, Hive, Mahout etc...) with in the background the emergence of new concepts such as Big Data or NoSQL.

Meanwhile Nutch tagged along following the various releases of Hadoop but was based on the same architecture. It simply started relying on other projects more and more instead of implementing its own stuff, mainly Apache Tika (another offspring of Nutch) for parsing and extracting metadata from various document formats and Apache SOLR for indexing and searching documents. This made the code much lighter, easier to maintain and also up to date with all sorts of functionalities provided by these projects. However the way we stored and access data in Nutch remained the same since the beginning of Hadoop i.e. SequenceFiles and MapFiles.

Nutch 2.0 (a.k.a NutchGora) started in earnest 2 years ago when one of our clients decided to  invest in the development of a NoSQL-based version of Nutch. There had been a preliminary version called NutchBase developed by Dogacan Guney which was used as a basis except that instead of relying exclusively on HBase, we decided to implement our own Backend-Neutral-MapReduce-friendly-ORM which is now an Apache Top Level Project known as Apache GORA and serializes data with Apache AVRO. GORA provides us with a unified  access to various backends, NoSQL or not, an object-to-datastore mapping mechanism and utilities for MapReduce. This means that Nutch 2.0 can run on HBase, Cassandra, Accumulo or MySQL with just a few configuration files to modify.

One major change in 2.0 is that, instead of having a separation between the status of the URLs (crawlDB) separated from the data for these URLs (content and text in segments) and the webgraph (linkDB), we have a single table-like representation of the data where each entry contains everything we know about a URL, even the links that point to it or the various versions of its content (depending on the backend used). Not having separate segments is definitely good news. One of the side effects is that a fetch or parse step can be resumed. 

From a technical point of view this means that Nutch is not limited to the sequential processing of Hadoop data structures  but can operate at a more atomic level (GET, PUT). Most Nutch tasks are still MapReduce operations though but at least we can get the backends to filter the data and provide only what is needed for a specific task to the MapReduce operations.

The best example of this that I can think of is the update step in a Nutch crawl. Basically what this step does is to merge the information from a round of fetching with the rest of the CrawlDB, typically to change the status of the URLs we have fetched and add the new URLs we have discovered when parsing. With the 1.x branch this is done with a MapReduce operation which takes both the CrawlDB  and the segment as input, reduces on the URLs and updates the status of the CrawlDatum objects in the reduce step. All good. Except that as the crawlDB gets larger and larger, the time taken by the update step gets longer and longer up to a point where it ends up being the slowest part of the crawl. Think about a billion entries in the crawlDB and a single URL to update and you'll get the picture.

There are ways of alleviating this for 1.x (i.e. generate multiple segments in one go and update them with the crawldb at the same time) but the point is that with Nutch 2.0 the equivalent operation would be linear with the number of URLs modified, not the whole crawl dataset.

The change of paradigm between sequential datastructures to a table-like representation is a major change for Nutch which will certainly have many positive side-effects. Being the first release of 2.0, we can expect quite a few fixes to be needed and a massive overhaul of the documentation in the next months but the move seems to be positively welcomed by the Nutch community. Of course 1.x will continue to be the trunk for as long as necessary, i.e. until 2.0 is stable and has all the functionalities that  1.x has.

BTW my slides about 2.0 from last year's Berlin Buzzword are now here.

It is also a symbolic move, with Nutch being at the origin of many successful projects, it was about time it caught up with its famous offspring and the concepts which arose from it.