Showing posts with label java. Show all posts

Monday 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.

You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/24?closed=1

This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

crawler-commons 1.0 #693
okhttp 3.14.0 #692
guava 27.1 (#702)
icu4j 64.1 #702)
httpclient 4.5.8 #702)
Snakeyaml 1.24 #702)
wiremock 2.22.0 #702)
rometools 1.12.0 #702)
Elasticsearch 7.0.0 (#708)

Core

Track how long a spout has been without any URLs in its buffer (#685)
Change ack mechanism for StatusUpdaterBolts (#689)
Robots URL filter to get instructions from cache only (#700)
Allow indexing under canonical URL if in the same domain, not just host (#703)
/bugfix/ URLs ending with a space are fetched over and over again (#704)
ParseFilter to normalise the mime-type of documents into simple values (#707)
Robot rules should check the cache in case of a redirection (#709)
/bugfix/ Fix the logic around sitemap = false (#710)
Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch

Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended (#683)
StatusUpdaterBolt to load config from non-default param names (#687)
Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
ES IndexerBolt : check success of batches before acking tuples (#647)
/bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
/bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
/bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
MetricsConsumer to include topology ID in metrics(#714)

WARC

Generate WARC request records (#509)
WARC format improvements (#691)

Tika

Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.

As usual, thanks to all contributors and users.

Happy crawling!

Thursday 22 November 2018

What's new in StormCrawler 1.12

The previous release was only last month but I decided to ship this one now as it contains several bugfixes and improvements which many users would benefit from.

As you can see below, the main changes are around protocols and sitemaps. We have used Selenium and OKHTTP a lot recently to deal with dynamic websites and the changes below definitely help for these. There is also an important bugfix for JSOUP (#653) and various other improvements.

As usual, we advise users to upgrade to this version.

Dependency upgrades

JSOUP 1.11.3 (#663)
Elasticsearch 6.5.0 (#661)
Jackson and Wiremock dependencies (#640)

Core

Post JSON data with OKHTTP protocol via metadata (#641)
Selenium RemoteDriverProtocol triggered by K/V in metadata (#642)
SeleniumProtocol NavigationFilters not reached in case of a redirection (#643)
Limit crawl to URLs found in sitemaps (#645)
spout.reset.fetchdate.after based on time when query was set to NOW (#648)
Avoid StackOverflowError when generating DocumentFragment from JSOUP (#653)
redirected sitemaps don't have isSitemap=true (#660)
Staggered scheduling of sitemap URLs (#657)
Scheduling -> round to the closest second, minute or hour (#654)
FetcherBolt don't add discovered sitemaps if the robots rules do not allow them (#662)

WARC

WARC record format: trailing zero byte causes WARC parser to fail (#652)

Elasticsearch

ES IndexerBolt track number of batch sent (#540)
Rename index index into docs (#649)
ES StatusMetricsBolt generate metrics for total number of docs (#651)

Coming next...

The release of Storm 2.0.0 has taken longer than expected, which is partly my fault as I reported a number of issues. These issues have now been fixed and hopefully, 2.0.0 will be out soon. As mentioned last month, there's a branch of StormCrawler which works on the Storm 2.x branch. Give it a try if you want to be on the cutting edge!

Finally, there will be a StormCrawler workshop in Vilnius next week. I am sure tickets are still available if you fancy a last minute trip to Lithuania.

As usual, thanks to all contributors and users. Happy crawling!

UPDATE

There were 2 bugs in release 1.12 which have been fixed in 1.12.1, see details on

https://github.com/DigitalPebble/storm-crawler/milestone/23?closed=1

Tuesday 20 March 2018

What's new in StormCrawler 1.8

I have just released StormCrawler 1.8. As usual, here is a summary of the main changes:

Dependency updates

Storm 1.2.1 #531
SOLR 7.2.1 #528
Tika 1.17 #518
Elasticsearch 6.2.2 #525 and #539

Core

Add option to send only N bytes of text to indexers #476
BasicURLNormalizer to optionally convert IDN host names to ASCII/Punycode #522
MemorySpout to generate tuples with DISCOVERED status #529
OKHttp configure type of proxy #530
http.content.limit inconsistent default to -1 #534
Track time spent in the FetcherBolt queues #535
Increase detect.charset.maxlength default value #537
FeedParserBolt: metadata added by parse filters not passed forward in topology #541
Use UTF-8 for input encoding of seeds (FileSpout) #542
Default URL filter: exclude localhost and private address spaces #543
URLStreamGrouping returns the taskIDs and not their index #547

WARC

Upgrade WARC module to 1.1.0 version of storm-hdfs, fixes #520

SOLR

Schema for status index needs date type for nextFetchDate #544
SOLR indexer: use field type text for content field #545

Elasticsearch

AggregationSpout fails with default value of es.status.bucket.field == _routing #521
Move to Elasticsearch RESTAPi #539

We recommend all users to move to this version as it fixes several bugs (#541, #547) and adds some great new features. In particular, the use of the REST API for Elasticsearch, which makes the module future-proof but also easier to configure, but also #535 and #543.

As usual, thanks to all contributors and users. Happy crawling!

Thursday 27 April 2017

Crawl dynamic content with Selenium and StormCrawler

Many websites rely on AJAX to provide smooth and reactive web applications and/or single page websites. While this works fine for humans using modern browsers, this is often challenging for robots as they can’t interpret the Javascript and usually rely on low-level HTTP protocol implementations to get the binary content. Even Google have announced only as recently as October 2015 that their crawlers can handle dynamic content, even though tests have shown that this is still far from being perfect.

Support for dynamic content is something that many users have asked for in StormCrawler and I am pleased to announce that we have recently committed code for this. The next release of StormCrawler (1.5) will contain a Selenium WebDriver-based protocol implementation so let’s have a sneak preview of how to use it and what it can do for you.

Prerequisites

The instructions below are based on Linux commands. You will need to install Java 8 and Maven to compile StormCrawler as well as PhantomJS (2.1.1 or above), which we will connect to via WebDriver. You might want to install Apache Storm, even though this is not a strict requirement as we’ll see below.

Until StormCrawler 1.5 is released, you will need to get the master branch, either with Git or by downloading the code from https://github.com/DigitalPebble/storm-crawler. Once this is done, cd to storm-crawler and run `mvn clean install`. This should put the storm-crawler artefacts in your local Maven repository, ready to use for the next step. This won’t be needed once 1.5 is released and you will be able to get the artefacts straight from Maven Central.

Simple example

Let’s first build a StormCrawler project using the Maven archetype:

mvn archetype:generate -B -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.5 -DgroupId=com.digitalpebble.crawl -DartifactId=selenium-tutorial -Dversion=1.0-SNAPSHOT -Dpackage=com.digitalpebble.crawl

This will give you a basic set of resources and configuration for StormCrawler. Go to the selenium-tutorial directory and build the uber jar with `mvn clean package`. We are now ready to go with a simple example.

Edit the file crawler.flux and set https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing as value for the constructorArgs in the spout config as shown below:

If you look at the source of that page, you’ll see that it consists mostly of Javascript. Fine for our browsers, but how does StormCrawler fare on it? With Storm installed and accessible on the command line, let’s do

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000

This will start the topology defined in the Flux file and let it run for one minute.

Note: the command above assumes that you have installed Storm. Alternatively, you can run the code directly with Maven like so:

mvn clean compile exec:java -Dexec.mainClass=org.apache.storm.flux.Flux -Dexec.args="--local crawler.flux --sleep 60000"

The console will display a lot of logs about the components being initialised but also the status of the URLs (e.g. FETCHED, DISCOVERED, etc...), the fields extracted from the documents fetched and various metrics. To remove the latter, you can comment out the section topology.metrics.consumer.register in crawler-conf.yaml.

Tip: if you are feeling adventurous, have a look at the other entries from the conf files e.g. remove domain=domain from indexer.md.mapping and see how that affects the output below.

Regardless of whether you ran the topology using Storm or Maven, you should see an output similar to this:

content

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Dagbladet Mat

https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing FETCHED Thu Apr 27 14:46:59 BST 2017

The first 5 lines were generated by the StdOutIndexer and as we can see, no text content was generated at all, the title is a generic one and no other fields could be extracted. Further down, a single line was generated by the StdOutStatusUpdater, indicating that the URL was successfully fetched, however, no outlinks were discovered at all (we would have seen lines with a DISCOVERED status).

Selenium to the rescue

Time to put our brand new protocol implementation to use. Edit the file crawler-conf.yaml and add

http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

selenium.addresses: "http://localhost:9515"

This tells StormCrawler to use the custom protocol implementations and connect to a WebDriver server on port 9515.

Open a different console and run `phantomjs --webdriver 9515` then run the topology again and look at the output

content 2873 chars

url https://www.dagbladet.no/mat/oppskrift/bakt-potet-med-romme-og-blamuggostdressing

keywords mat,oppskrift,kokker,råvarer,ingredienser,bakt,potet,med,rømme-,og,blåmuggostdressing

domain dagbladet.no

description Bakte poteter blir like gode når de bakes i ovnen uten folie rundt.

title Bakt potet med rømme- og blåmuggostdressing - Oppskrift | Dagbladet Mat

This time we got some textual content, the correct title and were able to extract keywords. As you’ve certainly noticed, we got all sorts of outlinks, similar to what we can observe with a browser.

What happened under the bonnet is that PhantomJS gave us a fully interpreted HTML page, on which we ran our JSoup parser. The latter used the ParseFilters defined in src/main/resources/parsefilters.json to extract the metadata displayed by the indexer later on (i.e. title, description, domain, keywords, canonical).

Let’s now look at a slightly more complex scenario.

NavigationFilters

Websites often use Javascript for interactions within a page and navigation through the content. If we look at https://rn12.ultipro.com/SOU1022/JobBoard/ListJobs.aspx for instance, we can see that the pagination for the result lists is done in Javascript. Assuming that we want to extract all the jobs listed for that board, we would be able to get the links from the initial page with the simple HTTP protocol implementation but not the links to the following result pages as they are handled with AJAX.

Luckily, we can implement the navigation logic by implementing a class extending NavigationFilter. First, let’s create a new file JobBoardNavigationFilter.java in src/main/java/com/digitalpebble/crawl and fill it with the content below

Tip: wget "https://s.apache.org/mOkz" -O src/main/java/com/digitalpebble/crawl/JobBoardNavigationFilter.java

The approach used here it to generate a dummy HTML content and create links for all the job pages, while iterating on the result pages. This class gets called by the Selenium-based protocol implementation.

Now, let’s create a new file navigationfilters.json in the directory resources and give it the following content

{

"com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters": [

{

"class": "com.digitalpebble.crawl.JobBoardNavigationFilter",

"name": "JobBoard"

}

]

}

Finally, we specify the name of the file we just created in the config with

navigationfilters.config.file: navigationfilters.json

Don’t forget to recompile the code with `mvn clean package` before launching the crawl. This time we’ll just check that we get all the links to the job pages in one go.

storm jar target/selenium-tutorial-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local crawler.flux --sleep 60000 | grep DISCOVERED

Note: why not download chromedriver and use it instead of PhantomJS? By default, chromedriver does not run in headless mode so you could see the browser being driven by the navigation filter, including the stuff you usually don’t notice, like the robots.txt file being fetched.

Conclusion

The resources covered here are the very first step towards making StormCrawler handle dynamic content and there is much work to do on improving it, however, the brand new protocol based on Selenium should already be a useful starting point. I hope you'll give it a try, happy crawling!

Friday 9 September 2016

Index the web with StormCrawler (revisited)

I hope you all had an enjoyable summer. I can't believe it's not even a year since I published the (relatively popular) post on Index the web with AWS CloudSearch! At the time we had just released the version 0.6 of StormCrawler and the post explained how to use SC to crawl a website and index it with CloudSearch. The tutorial also covered the same operations with Apache Nutch and helped users understand how the two projects differ.

StormCrawler has evolved a lot in just one year! In won't go into much details as this is explained in the previous posts but we are now at version 1.0, have a proper website for the project (albeit in constant need of improvements), a logo, many new resources, including a Maven archetype. The latest of these new resources is a new module for the popular Redis data structure store.

Last year's post is now quite outdated as a result so we'll now revisit the same use case (crawling http://www.tescobank.com/) but this time using Redis to store the crawl frontier and URL information and bootstrapping the project with the Maven archetype. This time, we won't index its content with Cloudsearch (or anything else) to keep the configuration to a minimum.

If you are new to StormCrawler, please read the Cloudsearch post for an introduction as well as the material on the website. Please bear in mind that although this short tutorial covers a single website processed with a single machine, StormCrawler is distributed by nature (thanks to Apache Storm) and can run on a cluster to deal with millions of pages.

Prerequisites

The instructions below are based on a Linux distribution. You will need to install the following software :

Java 8
Apache Maven (http://maven.apache.org/)
Redis 2.8.1 (http://redis.io/)
Apache Storm 1.0.1 (http://storm.apache.org/)

The Redis module is currently a PR with the code stored in a separate branch. This might be merged based on user feedback and be available in the next release but until then you can download the code for the redis branch or clone it with Git. Unzip the archive and from its root dir call 'mvn clean install', this will put all the necessary jars in your local repository.

The Storm command must be on your PATH, you don't need its servers to be running if you only want to run the crawls in local (non-distributed) mode.

Redis

Apache Storm as a framework is 'source agnostic' (if such a term exists), all it expects is that the Spout implementations provide the topologies with a steady stream of tuples. In the case of StormCrawler, these tuples are of the form <URL, Metadata>. Depending on the use case, they might come from a distributed queue (e.g. RabbitMQ), a database (MySQL) or a search system (Elasticsearch, SOLR).

The choice of tool to use depends on the following factors :

do you follow outlinks?
if so, is the crawl recursive i.e. can you get to the same URL via different pages?
do you need to revisit the pages?

StormCrawler provides a number of resources in its external plugins and so does Apache Storm itself.

Often the same data structure is used to both persist the information we have about the URLs and queue the URLs to be fetched (crawl frontier). With a key / (structured) value store like Redis we can use a slightly different strategy and separate the crawl frontier from the status of the URLs. For the frontier, we use keys with the prefix 'q_' followed by the host or domain name with a List of URLs to fetch as value. The Spout iterates on the queue entries and removes the head of the queue to send them as Tuples in the topology. This has the advantage of guaranteeing a perfect diversity of URLs in the topology and hence optimal performance. We also the information about the URLs with the prefix 's_' followed by the URL, the value associated with such keys is a String containing the status and metadata of the URLs. When discovering new URLs in recursive crawls, we can check whether the URL is already known, in which case we won't add it to the queues again.

One limitation of our Redis spout and updater is that we can't reschedule URLs for revisiting them but for many use cases, this is absolutely fine.

Let's get started. With the Redis server running, open a client session with redis-cli and type

FLUSHALL RPUSH q_www.tescobank.com http://www.tescobank.com/

We'll skip the creation of the corresponding s_http://www.tescobank.com/ entry out of pure laziness. It will get created once the URL is fetched and its status updated. What we just did is that we created a queue for the host tescobank.com with a list as value which contains a single URL to fetch.

Boostrap a project with an archetype

Instead of having to build everything from scratch we'll use our Maven archetype to bootstrap our crawl project. From anywhere you want on your filesystem do :

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.1-SNAPSHOT -DgroupId=net.stormcrawler -DartifactId=redis-crawler -Dversion=1.0

and press enter to confirm. Change the directory to redis-crawler, you should see a basic set of config and resource files.

├── crawler-conf.yaml ├── crawler.flux ├── pom.xml ├── README.md └── src └── main ├── java │ └── net │ └── stormcrawler │ └── CrawlTopology.java └── resources ├── default-regex-filters.txt ├── default-regex-normalizers.xml ├── parsefilters.json └── urlfilters.json

This will be the starting point for our modifications.

Customisation of resources

Edit the pom.xml file and add the redis module to the list of dependencies

<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-redis</artifactId> <version>1.1-SNAPSHOT</version> </dependency>

Next, we'll edit crawler-conf.yaml and add the following configuration parameters :

http.content.limit: -1

fetcher.server.delay: 2.0

redis.status.max.urls.per.bucket: 5

Let's now edit urlfilters.json by setting ignoreOutsideHost to true and adding

{ "class": "com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter", "name": "RobotsFilter", "params": { } }

to the list of filters (don't forget to add a comma before this section!).

Next add https to the second line of regex-filter.txt, and finally set the content of regex-normalizers.xml to

<?xml version="1.0"?>

<regex-normalize>

<regex>

<pattern>\?.+</pattern>

<substitution></substitution>

</regex>

</regex-normalize>

Everything should now be similar to the configuration of last year's tutorial.

Customisation of the topology class

The example topology has a simple memory based Spout which reads preset URLs and a dummy StatusUpdater. What we need to do instead is to use the Redis-based equivalents.

Replace the spout declaration in CrawlTopology.java (line 49) with

builder.setSpout("spout", new com.digitalpebble.stormcrawler.redis.RedisSpout());

and StdOutStatusUpdater() (line 67) with com.digitalpebble.stormcrawler.redis.StatusUpdaterBolt().

For good measure let's get rid of StdOutIndexer() and use com.digitalpebble.stormcrawler.indexing.DummyIndexer() instead.

Run the crawl

First, let's build an uber-jar with

mvn clean package

Then as mentioned in the README.md we can start the crawl with :

storm jar target/redis-crawler-1.0.jar net.stormcrawler.CrawlTopology -conf crawler-conf.yaml -local

or without '-local' if Storm is running as a (pseudo?) distributed cluster and we want the benefits of the UI, proper logging etc...

You should see the usual log entries and metrics info scrolling on the console. For a better understanding of what's going on, open a console and type

redis-cli KEYS s_* | sort

to see all the URLs discovered during the crawl, regardless of whether they were fetched or not.

To see the entire content of the fetch queue, do :

redis-cli LRANGE q_www.tescobank.com 0 -1

whereas

redis-cli LLEN q_www.tescobank.com

returns its size only. When it returns 0, your crawl is finished and you can kill the process with CTRL-C or do it with STORM KILL if you are running in distributed mode.

Conclusion

Looking back at last year's post, I realised that StormCrawler has evolved a lot since and the previous instructions were not quite up to date. The archetype, for instance, is a good way of getting started with StormCrawler and provides a solid starting point. There are also a lot more useful resources that users can leverage, including our brand new Redis components.

The approach we used for Redis where 2 different sets of keys are used for the crawl frontier and the URLs status could be reused for other key value stores like HBase which do not necessarily have secondary indices that we can use for sharding the URLs per host and guarantee a good diversity of URLs in the topology.

Please do get involved and help reviewing the PR for Redis. If you have any questions or problems, see http://stormcrawler.net/support/.

Happy crawling!

PS: Exercise for the reader

You probably noticed the 'crawler.flux' file in the directory generated by the artefact. Flux is a recent addition to Apache Storm : instead of defining a topology via a Java class as we did above, Flux allows you to do define the components of the topology and their interactions via a yaml file in a language-neutral way. Better, it means that you don't need to recompile the code if you change something in your topology (you'll still need to turn it off and restart it though).

The crawler.flux file corresponds to the default topology which we modified above to use Redis. As stated in the README, you can start a topology in the following way :

storm jar target/redis-crawler-1.0.jar org.apache.storm.flux.Flux --local crawler.flux

As an exercise, why don't you have a look at the Flux file and modify it so that it runs the Redis-based topology?

PPS: Exercise for the reader #2

Crawling is very nice but unless you store or index the documents you crawl it remains a pretty pointless exercise. Why don't you modify the crawl above so that it sends the data to Elasticsearch, SOLR or Cloudsearch? If you are feeling adventurous you could also try the WARC module and generate some great web archives to play with.