Scraping via Googlebot – How is it possible?
Hi,
I run a website that recently experienced unusually high traffic from what appeared to be legitimate Googlebot. After investigating the access patterns, I was able to identify the source through some creative analysis.
Background
Someone has been scraping my website extensively using what appears to be authentic Googlebot. I traced the activity back to the person responsible, and they revealed they're using a commercial API service that can trigger real Googlebot crawls on-demand.
Technical Details
I tested the service myself to verify their claims, and confirmed it does indeed dispatch legitimate Googlebot to any URL within 1–2 seconds.
Verified Googlebot IPs (via reverse DNS):
- 66.249.76.65 → crawl-66-249-76-65.googlebot.com
- 192.178.4.87 → crawl-192-178-4-87.googlebot.com
- 2001:4860:4801:002d::0006 → crawl-2001-4860-4801-002d...googlebot.com
- Additional IPs from 34.96.x.x range → googleusercontent.com
Request Headers:
- User-Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
- From: googlebot(at)googlebot.com
- Referer: https://www.google.com/
What Makes This Unusual:
- The service returns scraped HTML within 1–2 seconds
- It works for completely fresh URLs that have never been crawled
- All reverse DNS lookups confirm legitimate Google infrastructure
- The requests are triggered on-demand via API call
Verification Offer
I'm happy to validate these claims by having the service trigger a crawl to a unique test URL, so you can verify in your internal logs that it's genuinely Googlebot being dispatched.
Any insights into how this is technically possible?
Thanks!
The search console for domains allows you to put in a URL and test-scrape it to see how things look to the bot. Could be some reverse-engineering/abuse of that API.
Correct me if I'm wrong, but I believe you referring to the Rich Results Test. Fetching through that embeds `Google-InspectionTool` in the user agent, which isn't the case here.
there are blockers for webcrawlers. A few dozen were supplied by my neocities.org account, but I had to uncomment them
Not sure how this is relevant.