The Australian Government coat of Arms

Communities of practice

Communities of practice

Web scraping: how to measure and provide insights?

Some of my colleagues contend that our website is a serious focus of web scraping, to consume information. However when I ask for any actual metrics or data about this type of interaction, the who and the why, all I get are vague statements that it happens and the same anecdotal stories about the importance of scraping provided by a single external user from many years ago. I think this may be over-influencing some thinking, especially without clear evidence.

With that being said, moving into the future, what are the best methods for gathering analytics and metrics about web scraping, and is there any way to integrate that with say a Google Analytics/GTM based solution and derive some useful insights?

1 Like

It’s worth thinking also about search engines as scrapers that are now delivering “zero click searches” including via digital assistants that you won’t see any traffic for in traditional analytics https://www.binaryfountain.com/blog/zero-click-searches/
There are estimates of >50% searches not resulting in a site visit in August 2019 https://sparktoro.com/blog/less-than-half-of-google-searches-now-result-in-a-click/

So how can you find out the scrapers? Each website request in the web server logs has a User Agent.
While some scrapers will pretend to be a normal web browser, others will use clearly different user agents like “CURL”, “Mechanize”, “python-requests”.
https://github.com/matomo-org/device-detector/blob/master/regexes/bots.yml lists more websearch/utility style “bot” User Agents while I would consider the custom code library User Agents listed on https://github.com/matomo-org/device-detector/blob/master/regexes/client/libraries.yml are used more by those intending to extract data from you specific website

You can use https://goaccess.io/ to turn many formats of web server logs into reports.

2 Likes

People will scrape your website only because you’re not offering an API.

So: offer the data via API and you’ll get excellent statistics on downloads etc. (and you’ll make everyone else’s life easier).

1 Like

@JackH There are a number of seemingly complex (but actually pretty simple) ways to detect automated OSInt or Web Scraping operations. @asadleir is right in that there are already (for legitimate scrapers/crawlers - the NLA’s Trove spider or the ANDS archiver are good examples) ways to detect their presence - and those links are invaluable.

Web scraping however is much different how a digital assistant or zero click search will show up. So beyond the information asadleir provided, the key thing you will need more than anything else is access to your web server logs. What you will be looking for in these logs are patterns that indicate automated activities are taking place, regardless of how the User Agent is reporting itself. I could write an entire post on this sort of thing but really it’s things like automated GET requests in a single directory where the resources are asked alphabetically or by date modified. To spot crawlers, try tactics like looking for obvious attempts to follow links which are denied in your robots.txt file, or set up dummy hidden links to dead pages (which won’t affect screen readers or Google spiders) which only a bot will attempt to follow.

While I agree that providing an API is always great, especially for data, I’m not sure @mhenry’s observation is entirely correct - people will scrape websites for a variety of reasons, most of which have nothing to do with an API, and everything to do with getting information you probably don’t want them to have, or attempting to replicate a site for unapproved syndication or phishing purposes.

Which leads me to the single most important thing you can do as a Web practitioner in Government who is worried about this : Talk to your Cyber Security team and Network Operations people. They will be monitoring all traffic across the network and sites, and likely have specialists who can assist you in spotting attempts to scrape, or in protecting your site from unauthorised archival or exfiltration activities.

1 Like