Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 1 KB

File metadata and controls

11 lines (7 loc) · 1 KB

wofg-web-filters

Filters for processing Web ARChive (WARC) files as part of the WofG Web Reporting Service

Overview

These filters were originally developed with Funnelback for use in both in-crawl and post-crawl filtering of data gathered during a Whole-of-Australian Government web crawl.

Pre-gather workflow tasks are run in order to generate mappings for domains to portfolios (drawn from the Australian Government Organisation Register) and augment with other external data sources.

Post-gather, several content checks are run. These are written in Groovy, and are run with Funnelback's filter framework. Tools for splitting WARC files are also included at this stage.

Post-filtering, metadata is written to JSON for injecting into ElasticSearch.