The Australian Government coat of Arms

Communities of practice

Communities of practice

Whole-of-Australian Government Web Crawl dataset


#1

The DTA has recently published the first of several whole-of-Australian Government web crawl datasets.

The dataset is big - the largest so far on data.gov.au. We’ve made it available both as a single 66GB WARC file, and as a series, split into 57 smaller WARC files.

This raw data is part of a broader service being provided by the DTA - WofG Web Reporting. Further detail about the background of the service (and our next steps) are described on a recent post on blog.data.gov.au.

Questions? Comments? Feel free to discuss in this thread.


#2

Working with this data, I found https://github.com/internetarchive/warctools the easiest Python library to use and there are also included some tools to do basic exploration/indexing of WARC files.


#3

It’s still early days for this GitHub repo, but could make for some interesting background reading:


#4

Here’s a little snippet that I’ve tested on the first split .warc file

Should be pretty easy to modify to do whatever processing operations you might want to do on each entry

I had some issues getting Hanzo’s warctools that asadleir linked to working the way I wanted so it uses this instead:

import zlib
import warc # https://github.com/internetarchive/warc (Python 2 only)
"""Iterate through a compressed warc file, decompressing the payloads
and outputting a chunk"""

def decompressed_records(f):
    """Iterate through records, yielding decompressed payloads
    to avoid loading the entire warc into memory"""
    for rec in f:
        try:
            # read the record's payload
            payload = rec.payload.read()
            # decompress gzip'd content
            plain_text = zlib.decompress(payload, 16+zlib.MAX_WBITS)
            yield plain_text
        except:
            # do nothing, we'll just skip records that can't be decompressed
            pass


if __name__ == '__main__':
    file_path = 'dta-report01-1.warc'
    f = warc.open(file_path)
    for record in decompressed_records(f):
        # you could write the record to a file/db 
        # or process it however you want here
        # instead we'll just print out a chunk
        # for demonstration purposes
        print("")
        print("####################################################")
        print(record[4500:6000])
        print("####################################################")
        print("")