Showing posts with label crawler-commons. Show all posts
Showing posts with label crawler-commons. Show all posts

Monday 20 July 2020

What's new in StormCrawler 1.17


I have just released StormCrawler 1.17. As you can see in the list below, this contains important bugfixes and improvements. For this reason, we recommend that all users upgrade to this version, however, please check the breaking changes below if you apply it to an existing crawl.

Dependency upgrades

  • Various dependency upgrades  #808
  • CrawlerCommons 1.1 dependency #807
  • Tika 1.24.1 #797
  • Jackson-databind  #803 #793 #798

Core

  • Use regular expressions for custom number of threads per queue fetcher #788
  • /!breaking!/ Prefix protocol metadata #789
  • Basic authentication for OKHTTP #792
  • Utility to debug / test parsefilters #794
  • /!breaking!/ Remove deprecated methods and fields enhancement #791
  • AdaptiveScheduler to set last-modified time in metadata  #777 #812
  • /bugfix/ _fetch.exception_ key should be removed from metadata if subsequent fetches are successful #813
  • /bugfix/ SimpleFetcherBolt maxThrottleSleepMSec not deactivated #814
  • /!breaking!/ Index pages with content="noindex,follow" meta tag #750
  • Enable extension parsing for SitemapParser enhancement parser #749 #815

WARC



Elasticsearch


  • /bugfix/ AggregationSpout error due SimpleDateFormat not thread safe #809
  • /bugfix/ IndexerBolt issue causing ack failures #801
  • Allow ES to connect over a proxy #787
Of the breaking changes above, #789 is particularly important. If you want to use SC 1.17 on an existing crawl, make sure you add 

protocol.md.prefix: ""

to the configuration. Similarly, http.skip.robots has changed to http.robots.file.skip


Thanks to all contributors and users! Happy crawling! 

PS: something equally exciting is coming next ;-)



Monday 13 May 2019

What's new in StormCrawler 1.14

StormCrawler 1.14 was released yesterday and as usual, contains loads of improvements and bugfixes.


This release contains a number of breaking changes, mostly related to the move to Elasticsearch 7. We recommend that all users upgrade to this version as it contains very important fixes and performance improvements.

Dependency upgrades

  • crawler-commons 1.0 #693
  • okhttp 3.14.0 #692
  • guava 27.1 (#702)
  • icu4j 64.1  #702)
  • httpclient 4.5.8  #702)
  • Snakeyaml 1.24  #702)
  • wiremock 2.22.0  #702)
  • rometools 1.12.0  #702)
  • Elasticsearch 7.0.0 (#708)

Core



  • Track how long a spout has been without any URLs in its buffer (#685)
  • Change ack mechanism for StatusUpdaterBolts (#689)
  • Robots URL filter to get instructions from cache only (#700)
  • Allow indexing under canonical URL if in the same domain, not just host (#703)
  • /bugfix/ URLs ending with a space are fetched over and over again (#704)
  • ParseFilter to normalise the mime-type of documents into simple values (#707)
  • Robot rules should check the cache in case of a redirection (#709)
  • /bugfix/ Fix the logic around sitemap = false (#710)
  • Reduce logging of exceptions in FetcherBolt (#719)

Elasticsearch



  • Asynchronous spouts (i.e ES) can send queries after max delay since previous one ended  (#683)
  • StatusUpdaterBolt to load config from non-default param names (#687)
  • Add a ScrollSpout to read all the documents from a shard (#688 and #690) - see in our guest post how this can be used to reindex a status index.
  • ES IndexerBolt : check success of batches before acking tuples (#647)
  • /bugfix/ URLs with content that breaks ES get refetched over and over again (#705)
  • /bugfix/ URLs without valid host name (and routing) stay DISCOVERED forever (#706)
  • /bugfix/ ESSeedInjector: no URLs injected because URL filter does not subscribe to status stream (#715)
  • MetricsConsumer to include topology ID in metrics(#714)

WARC

  • Generate WARC request records (#509)
  • WARC format improvements (#691)

Tika


  • Set mimetype whitelist for Tika Parser (#712)

*********

I will be running a workshop on StormCrawler next month at the Web Archiving Conference in Zagreb and give a presentation jointly with Sebastian Nagel of CommonCrawl. I will come with loads of presents generously given by our friends at Elastic.


As usual, thanks to all contributors and users.

Happy crawling!


Thursday 23 March 2017

What’s new in StormCrawler 1.4

StormCrawler 1.4 has just been released! As usual, all users are advised to upgrade to this version as it fixes some bugs and contains quite a few new functionalities.

Core dependencies upgrades

  • Httpclient 4.5.3
  • Storm 1.0.3 #437

Core module

  • JSoupParser does not dedup outlinks properly, #375
  • Custom schedule based on metadata for non-success pages, #386
  • Adaptive fetch scheduler #407
  • Sitemap: increased default offset for guessing + made it configurable  #409
  • Added URLFilterBolt + use it in ESSeedInjector #421
  • URLStreamGrouping 425
  • Better handling of redirections for HTTP robots #4372d16
  • HTTP Proxy over Basic Authentication #432
  • Improved metrics for status updater cache (hits and misses) #434
  • File protocol implementation #436
  • Added CollectionMetrics (used in ES MetricsConsumer + ES Spout, see below) #7d35acb

AWS

  • Added code for caching and retrieving content from AWS S3 #e16b66ef

SOLR

  • Basic upgrade to Solr 6.4.1
  • Use ConcurrentUpdateSolrClient; #183

Elasticsearch

  • Various changes to StatusUpdaterBolt
    Fixed bugs introduced in 1.3 (use of SHA ID), synchronisation issues, better logging, optimisation of docs sent and more robust handling of tuples waiting to be acked (#426). The most important change is a bug fix whereby the cache was never hit (#442) which had a large impact on performance.
  • Simplified README + removed bigjar profile from pom #414
  • Provide basic mapping for doc index #433
  • Simple Grafana dashboard for SC metrics, #380
  • Generate metrics about status counts, #389
  • Spouts report time taken by queries using CollectionMetric, #439 - as illustrated below
Spout query times displayed by Grafana
(illustrating the impact of SamplerAggregationSpout on a large status index )

Coming next?

As usual, it is not clear what the next release will contain but hopefully, we'll switch to Elasticsearch 5 (you can already take it from the branch es5.3) and provide resources for Selenium (see branch jBrowserDriver). As I pointed out in my previous post, getting early feedback on work in progress is a great way of contributing to the project.

We'll probably also upgrade to the next release of crawler-commons, which will have a brand new SAX-based Sitemap parser. We might move to one of the next releases of Apache Storm, where a recent contribution I made will make it possible to use Elasticsearch 5. Also, some of our StormCrawler code has been donated to Storm, which is great!

In the meantime and as usual, thanks to all contributors and users and happy crawling!

PS: I will be running a workshop in Berlin next month about StormCrawler, Storm in general and Elasticsearch


Wednesday 6 July 2011

Crawler-Commons 0.1 released

As announced on various mailing-lists : 

The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list


The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort. 

The current version contains resources for :
- parsing robots.txt
- parsing sitemaps
- URL analyzer which returns Top Level Domains
- a simple HttpFetcher

This release is available on Sonatype's OSS Nexus repository [https://oss.sonatype.org/content/repositories/releases/com/google/code/crawler-commons/] and should be available on Maven Central soon.

Please send your questions, comments or suggestions to http://groups.google.com/group/crawler-commons
Doing the release was quite an interesting experience as I'd never done that before. This was the opportunity to have a closer look at ANT+Maven, how to publish artefacts and use Nexus etc... which I am sure will be useful at some point (Behemoth? GORA? Nutch?).

Now that crawler-commons is released we can start using it from Nutch, Bixo [see https://issues.apache.org/jira/browse/NUTCH-1031].