@JackH There are a number of seemingly complex (but actually pretty simple) ways to detect automated OSInt or Web Scraping operations. @asadleir is right in that there are already (for legitimate scrapers/crawlers - the NLA’s Trove spider or the ANDS archiver are good examples) ways to detect their presence - and those links are invaluable.
Web scraping however is much different how a digital assistant or zero click search will show up. So beyond the information asadleir provided, the key thing you will need more than anything else is access to your web server logs. What you will be looking for in these logs are patterns that indicate automated activities are taking place, regardless of how the User Agent is reporting itself. I could write an entire post on this sort of thing but really it’s things like automated GET requests in a single directory where the resources are asked alphabetically or by date modified. To spot crawlers, try tactics like looking for obvious attempts to follow links which are denied in your robots.txt file, or set up dummy hidden links to dead pages (which won’t affect screen readers or Google spiders) which only a bot will attempt to follow.
While I agree that providing an API is always great, especially for data, I’m not sure @mhenry’s observation is entirely correct - people will scrape websites for a variety of reasons, most of which have nothing to do with an API, and everything to do with getting information you probably don’t want them to have, or attempting to replicate a site for unapproved syndication or phishing purposes.
Which leads me to the single most important thing you can do as a Web practitioner in Government who is worried about this : Talk to your Cyber Security team and Network Operations people. They will be monitoring all traffic across the network and sites, and likely have specialists who can assist you in spotting attempts to scrape, or in protecting your site from unauthorised archival or exfiltration activities.