DigitalPebble's Blog: web crawl

Showing posts with label web crawl. Show all posts

Thursday 26 October 2023

Focus on protocol improvements in StormCrawler 2.10

StormCrawler 2.10 was released yesterday and, as usual, it contains loads of improvements, dependency upgrades and bug fixes. Instead of going through each one of them, we will focus specifically on what was done for protocols.

First, every protocol implementation can now easily be tested on the command line, even FileProtocol or DelegatorProtocol thanks to #1097. For instance,

storm local target/xxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol -f crawler-conf.yaml https://storm.apache.org/ -b

which configures the RemoteDriverProtocol with the content of crawler-conf.yaml and display info on the console.

You might have noticed that the option to specify a configuration file has changed from -c to -f as the former conflicted with a Storm operator. We also added an option

-b which dumps the content of the URL to a file in the temp folder, making it very easy to check what the protocol actually retrieved for a given configuration.

Using the command above in combination with debugging is particularly powerful. This can easily be done with

export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000"

One of the main changes is about the Selenium module. It had been a while since it had any work done to it and its configuration was pretty obsolete. With #1093, we added some much needed unit tests and removed some incompatible configuration. #1100 added an option to deactivate tracing and we fixed the user agent substitution (#1109).

An important and incompatible change in the Selenium module is about the way the timeouts are configured. The previous mechanism was opaque and error prone. This has been replaced in #1101, the timeouts are now configured with a map

selenium.timeouts:
script: -1
pageLoad: -1
implicit: -1

with -1 preserving the Selenium default values.

The DelegatorProtocol has also been greatly improved. If you are not familiar with it, it allows you to determine which protocol implementation should be used for a URL given the metadata it has. For instance,

  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

will use the OKHTTP protocol for a URL if is has a key isSitemap in its metadata with a value of true. Otherwise it will use the Selenium implementation.

With #1098, we added an operator indicating whether the conditions should be treated as an AND or OR. We also added the possibility to triage based on regular expressions on the URL itself (#1110).

You can now express more complex configurations such as

# use the normal protocol for sitemaps, robots and if asked explicitly

protocol.delegator.config:

- className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"

operator: OR

filters:

isSitemap: "true"

robots.txt:

skipSelenium:

regex:

- \.pdf

- \.doc

- className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

As a result, we removed the deprecated class DelegatorRemoteDriverProtocol.

The DelegatorProtocol is of course particularly useful for avoiding sending URLs to the Selenium implementation unnecessarily, as illustrated above.

StormCrawler 2.10 contains of course other changes and dependency updates and, as usual, we recommend that you switch to it. As we have seen today, the improvements we added to make protocol implementations easier to test and configure should be a reason to upgrade.

We would like to thank all the users and contributors to the 2.10 release.

Happy crawling!

Tuesday 22 March 2022

What's new in StormCrawler 2.3

StormCrawler 2.3 was released yesterday. It contains a relatively small number of changes compared to previous releases but these include important bug fixes. We have also ported existing ParseFilters to JSoupParseFilters, leading to some noticeable performance improvements and an exuberant tweet

So much for the "no-coding-Friday" but this is a bombhttps://t.co/8VIyChtelK

If the parsing bolt in your #StormCrawler topology is a bit slow, you should definitely have a look at this one.

(and a big thank you to https://t.co/UGUVdHl0W1)
— DigitalPebble (@digitalpebble) February 18, 2022

We also welcomed Richard Zowalla as a new committer on the project.

Here are the main changes.

Dependency upgrades

Elasticsearch 7.17.0
Tika 2.3.0
Caffeine 2.9.3

Core

Convert LinkParseFilter into a JSoupFilter (#944)
Rewrote LinkParseFilter + added XPathFilter + tests for JSOUPFilters (#953)
General Code Refactoring and Good Practices (#937)
Add unified way of initializing classes via string … (#943)
Changed order of emit outlinks and emit of parent url ... (#954)

Elasticsearch

Enable compression (#941)
Enable _source for content index in ES archetype (#958)

URLFrontier

Spout does not reconnect to URLFrontier if an exception occurs (#956)

The next release will probably include a new module for Elasticsearch 8, see #945. If you have some experience of using ES new client library, your contribution will be very welcome.

Thank you to all users and contributors, in particular Felix Engl for his work on the code refactoring and Julian Alvarez for reporting and fixing the bug in #954.

Our users Gage Piracy have also been very generous in donating some of the customisations we wrote for them back to the project.

Happy crawling!

Monday 21 March 2022

Unlock your web crawl with URLFrontier

Our guest writer today is Richard Zowalla.

Richard is a committer on StormCrawler, CrawlerCommons and other open source projects such as Apache TomEE. He is a PhD student in the field of medical web data mining. His recent work “Crawling the German Health Web” was published in the Journal of Medical Internet Research and is about using StormCrawler as a focused web crawler to collect a large sample of the German Health Web.

Richard will now tell us about his experimentation with URLFrontier and crawler4j. As you probably know, URLFrontier is a project sponsored by the NLNet foundation that we, at DigitalPebble, have been working on for just over a year and it is now in its second iteration. Let’s start by explaining what it is all about…

What is URLFrontier?

Web crawlers need to store the information about the URLs they process, this is called a crawl frontier. Typically, each web crawling software has its own way of implementing this. Our very own StormCrawler is no exception, except that it is not tied to one specific backend but can use several implementations like Elasticsearch, SOLR or SQL.

What URLFrontier does is to provide a crawler/language-neutral API for the operations that web crawlers do when communicating with a crawl frontier e.g. get the next URLs to crawl, update the information about URLs already processed, change the crawl rate for a particular hostname, get the list of active hosts, get statistics, etc...

URLFrontier is based on gRPC and provides not only an API but also an implementation of the service and client code in Java that can be used to communicate with it. Because the API and implementations are based on gRPC, URLFrontier can be used by web crawlers regardless of the programming language they are written in. As you would expect, StormCrawler has a module for URLFrontier, which was used extensively last year in a large-scale crawl described here.

By externalising the frontier logic from web crawlers, we can reuse the same implementation across different web crawlers and can make it better as a community instead of having each crawler project constantly reinventing the wheel. It also helps modularizing a crawler setup and make it distributed.

Let’s now see what Richard has been up to.

The crawler4j framework

Crawler4j is an open source web crawler written in Java, which provides a simple interface for crawling the Web in a single process. Sadly, the original (academic) project became mostly inactive with its last release in 2018 leaving users only two options: (1) migrate to another crawler framework or (2) maintain a fork of the library and release it to Maven Central. In the end, we decided to do the latter and forked the repository to continue using crawler4j within our academic research projects.

As setting up a multi-threaded web crawler with crawler4j is fairly simple, using a fully distributed web crawler would have been overkill for our small use-cases (i.e. focus on fetching single web sites). Therefore, we decided to maintain our own fork with up-to-date libraries and the possibility to (easily) switch between different frontier implementations as Oracle’s Sleepycat licence does not comply with some of our use-cases.

To start with crawler4j, you need to choose from one of the available crawl frontier implementations:

Sleepycat a.k.a. Berkley DB (Key-Value-based)
HSQLDB (SQL-based)
URLFrontier

The HSQLDB and URLFrontier frontier implementations are only available in our fork. They aim to mitigate the rather strict licensing policies of Sleepycat.

After choosing a crawl frontier implementation, you can simply add the required dependency via Maven to your project (here: we choose URLFrontier):

<groupId>de.hs-heilbronn.mi</groupId>

<artifactId>crawler4j-with-urlfrontier</artifactId>

</dependency>

Next, you have to create a crawler class which extends WebCrawler. This class decides which URLs should be crawled and handles the fetched web pages.

public class FrontierWebCrawler extends WebCrawler {

@Override

public boolean shouldVisit(Page referringPage, WebURL url) {

// determines, if a given URL should be visited by the crawler

return true

}

@Override

public void visit(Page page) {

//handle a fetched page, e.g. store it

}

In addition, you need to implement a controller class which specifies the seeds for the web crawl, the folder in which crawler4j will store intermediate crawl data and some other config options such as the number of crawler threads or if the web crawler should be polite and/or honour the robots exclusion protocol. This can be done like this:

protected CrawlController init() throws Exception {

final CrawlConfig config = new CrawlConfig();

config.setCrawlStorageFolder(“/tmp”);

config.setPolitenessDelay(800);

config.setMaxDepthOfCrawling(3);

config.setIncludeBinaryContentInCrawling(false);

config.setResumableCrawling(true);

config.setHaltOnError(false);

final BasicURLNormalizer normalizer = BasicURLNormalizer.newBuilder().idnNormalization(BasicURLNormalizer.IdnNormalization.NONE).build();

final PageFetcher pageFetcher = new PageFetcher(config, normalizer);

final RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

robotstxtConfig.setSkipCheckForSeeds(true); // we skip the robots checks for adding seeds (will be checked later on demand)

final int maxQueues = 10;

final int port = 10;

final FrontierConfiguration frontierConfiguration = new URLFrontierConfiguration(config, maxQueues, "localhost", port);

final RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher, frontierConfiguration.getWebURLFactory());

return new CrawlController(config, normalizer, pageFetcher, robotstxtServer, frontierConfiguration);

}

Seeds can then be added via the CrawlController. To increase performance, you can skip the robots.txt check while adding new seeds.

Crawler4j in ♥ with URLFrontier

The integration of URLFrontier in crawler4j basically boils down to three (adapter) classes and some boilerplate code to connect with the gRPC code provided by URLFrontier. This reduces the amount of crawler logic to handle the crawl frontier significantly.

As URLFrontier handles duplicate URLs and acts as a remote crawl frontier, it is now fairly simple to run crawler4j on different machines. URLFrontier then acts as the single point of synchronisation. Consequently, this approach can turn crawler4j into a simple distributed web crawler. Without a remote frontier (like URLFrontier), we would have had to implement a custom distributed URLFrontier using a framework like Hazelcast in order to distribute crawler4j’s crawl frontier. In both cases, distributing crawler4j comes at the cost that we need to implement additional business logic to handle or store the fetched Web pages in a distributed way. Nevertheless, the ease to implement a web crawler with crawler4j outweighs this issue.

The default URLFrontier service implementation is based on RocksDB and it is publicly available as a Docker image.

Experimenting with different frontier implementations

For our experiment, we relied on three virtual machines (VMs). Each VM is equipped with 4 vCPU, 10GB of memory and is running on Ubuntu 20.04 LTS with latest OpenJDK 17. We used a seed list of 1M URLs generated from the site rankings computed by CommonCrawl.

Each Web crawl was started simultaneously on each VM and was run for an exact duration of 48 hours. We limited the crawling depth per URL to 3. URLFrontier was run as a docker container residing on the same VM as the crawler. Every 30 seconds, we checked the amount of processed (i.e. completed) URLs.

Note, that we did not apply any further processing of fetched Web pages as this wasn’t in the scope of our experiment. The example’s code is available on GitHub.

Results

On average, the crawler4j framework was capable of downloading up to 90 web pages per minute with a politeness delay of 800ms between each request to the same host. The detailed statistics are:

Sleepycat: fetched ~ 90 pages / min;
URLFrontier: fetched ~ 72 pages / min;
HSQLDB: fetched: ~ 68 pages / min;

Figure 1 depicts the number of processed (i.e. fetched) URLs over the time period of 48 hours.

Overall, there is a noticeable difference between Sleepycat, URLFrontier and the HSQLDB frontier implementation. However, HSQLDB is only a few pages slower than the URLFrontier implementation. As can be seen from the aggregate numbers, Sleepycat is faster compared to the other implementations. We can assume that the proprietary Sleepycat communication protocol outperforms gRPC (URLFrontier) and JDBC (HSQLDB) calls by not adding too much communication overhead.

Conclusion

Performance aside, one benefit of using StormCrawler is that the code needed to integrate it in crawler4j boils down two only three classes while the other two implementations required a significantly more complex integration.

In addition, by adopting URLFrontier as a backend, it is possible to easily exchange the crawler implementation and re-use the same data as before. We also benefit from any improvements to the service implementation without the need to change a single line of code. In particular, the forthcoming versions of URLFrontier should contain some very useful features.

Another important advantage of URLFrontier is that it opens the content of the frontier to the outside world: You can manipulate or view the content of the frontier during an ongoing web crawl via the CLI. This is not possible for the other frontier implementations.

Overall, our experiment showed that URLFrontier is slower than the original Sleepycat implementation, which most likely originates from the overhead introduced by the gRPC calls to communicate with URLFrontier. This is also true for the JDBC-based HSQLDB implementation. On the plus side, URLFrontier does not suffer from (commercial) licensing issues such as Sleepycat and can turn crawler4j into a simple distributed web crawler with little additional work, unlike using the other two implementations.

The figures given in this post depend on the particular seed list, the ordering of URLs, and the hardware used for the experiment. Therefore, you might get different results for your specific use case. The resources and configurations of this experiment being publicly available, you can try to reproduce it and extend it as you wish.

Next steps

This experiment has been very successful and informative and we are hoping to run more benchmarks in the future, like for instance a larger scale crawling in fully distributed mode.

URLFrontier is getting many improvements in its current phase of development and we are beginning to see alternative implementations of the service, like this one based on Opensearch. We are also seeing the project gain some traction with existing web crawlers.

An alternative experiment would be to compare the performance of the different URLFrontier service implementations available. Exciting times ahead!

Happy crawling everyone and a massive thank you to Richard for being our guest writer.

Tuesday 11 January 2022

What's new in StormCrawler 2.2

StormCrawler 2.2 has just been released. This marks the beginning of having releases only for 2.x, 1.18 was the last release for the 1.x branch which is now discontinued. In case you were wondering why there was no "What's new in StormCrawler 2.1", it is simply that it contained the same modifications as 1.18 and did not get its own announcement.
This version contains many bugfixes, as usual, users are advised to upgrade to this version.
Happy crawling and thanks to our sponsors, contributors and users! PS: I am tempted to run a workshop on webcrawling with StormCrawler at the BigData conference in Vilnius in November. Anyone interested? If so please get in touch and let me know what you'd like to learn about. https://bigdataconference.eu/

What's new in StormCrawler 1.18

StormCrawler 1.18 has just been released. Since the previous version dates from nearly 10 months ago, the number of changes is rather large (see below).

This version contains many bugfixes, as usual, users are advised to upgrade to this version. One of the noticeable new features is module for URLFrontier (if you haven't checked it up, do so right now!); I will publish a tutorial on how to use it soon.

1.18 is also likely to be the last release based an Apache Storm 1.x, our 2.x branch will become master as soon as I have released 2.1.

Happy crawling and thanks to our sponsors, contributors and users!

What's new in StormCrawler 1.17

I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

Various dependency upgrades #808
CrawlerCommons 1.1 dependency #807
Tika 1.24.1 #797
Jackson-databind #803 #793 #798

Core

Use regular expressions for custom number of threads per queue fetcher #788
/!breaking!/ Prefix protocol metadata #789
Basic authentication for OKHTTP #792
Utility to debug / test parsefilters #794
/!breaking!/ Remove deprecated methods and fields enhancement #791
AdaptiveScheduler to set last-modified time in metadata #777 #812
/bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
/bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
/!breaking!/ Index pages with content="noindex,follow" meta tag #750
Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC

Implement WARC spout #755 #799

Elasticsearch

/bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
/bugfix/ IndexerBolt issue causing ack failures #801
Allow ES to connect over a proxy #787

Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip

Thanks to all contributors and users! Happy crawling!

PS: something equally exciting is coming next ;-)

Thursday 16 January 2020

What's new in StormCrawler 1.16?

Happy new year!

StormCrawler 1.16 was released a couple of days ago. You can find the full list of changes on https://github.com/DigitalPebble/storm-crawler/milestone/26?closed=1

As usual, we recommend that all users upgrade to this version as it contains important fixes and performance improvements.

Dependency upgrades

Tika 1.23 (#771)
ES 7.5.0 (#770)
jackson-databind from 2.9.9.2 to 2.9.10.1 dependency (#767)

Core

OKHttp configure authentication for proxies (#751)
Make URLBuffer configurable + AbstractURLBuffer uses URLPartitioner (#754)
/bugfix/ okhttp protocol: reliably mark trimmed content because of content limit (#757)
/!breaking!/ urlbuffer code in a separate package + 2 new implementations (#764)
Crawl-delay handling: allow `fetcher.max.crawl.delay` exceed 300 sec.(#768)
okhttp protocol: HTTP request header lacks protocol name and version (#775)
Locking mechanism for Metadata objects (#781)

LangID

/bugfix/ langID parse filter gets stuck (#758)

Elasticsearch

/bugfix/ Fix NullPointerException in JSONResourceWrappers (#760)
ES specify field used for grouping the URLs explicitly in mapping (#761)
Use search after for pagination in HybridSpout (#762)
Filter queries in ES can be defined as lists (#765)
es.status.bucket.sort.field can take a list of values (#766)
Archetype for SC+Elasticsearch (#773)
ES merge seed injection into crawl topology (#778)
Kibana - change format of templates to ndjson (#780)
/bugfix/ HybridSpout get key for results when prefixed by "metadata." (#782)
AggregationSpout to store sortValues for the last result of each bucket (#783)
Import Kibana dashboards using the API (#785)
Include Kibana script and resources in ES archetype (#786)

One of the main improvements in 1.16 is the addition of a Maven archetype to generate a crawl topology using Elasticsearch as a backend (#773). This is done by calling

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-elasticsearch-archetype -DarchetypeVersion=LATEST

The generated project also contains a script and resources to load templates into Kibana.

The topology for Elasticsearch now includes the injection of seeds from a file, which was previously in a separate topology. These changes should help beginners get started with StormCrawler.

The previous release included URLBuffers, with just one simple implementation. Two new implementations have been added in #764. The brand new PriorityURLBuffer sorts the buckets by the number of acks they got since the last sort whereas the SchedulingURLBuffer tries to guess when a queue should release a URL based on how long it took its previous URLs to be acked on average. The former has been used extensively with the HybridSpout but the latter is still experimental.

Finally, we added a soft locking mechanism to Metadata (#781) to help trace the source of ConcurrentModificationExceptions. If you are experiencing such exceptions, calling metadata.lock() when emitting e.g.

collector.emit(StatusStreamName, tuple, new Values(url, metadata.lock(), Status.FETCHED))

will trigger an exception whenever the metadata object is modified somewhere else. You might need to call unlock() in the subsequent bolts.

This does not change the way the Metadata works but is just there to help you debug.

Hopefully, we should be able to release 2.0 in the next few months. In the meantime, happy crawling and a massive thank you to all contributors!

Thursday 26 October 2023

Focus on protocol improvements in StormCrawler 2.10

Tuesday 22 March 2022

What's new in StormCrawler 2.3

Dependency upgrades

Core

Elasticsearch

URLFrontier

Monday 21 March 2022

Unlock your web crawl with URLFrontier

What is URLFrontier?

The crawler4j framework

Crawler4j in ♥ with URLFrontier

Experimenting with different frontier implementations

Results

Conclusion

Tuesday 11 January 2022

What's new in StormCrawler 2.2

Wednesday 5 May 2021

What's new in StormCrawler 1.18

Monday 20 July 2020

What's new in StormCrawler 1.17

Dependency upgrades

Core

WARC

Elasticsearch

Thursday 16 January 2020

What's new in StormCrawler 1.16?

Dependency upgrades

Core

LangID

Elasticsearch