๐จ Cloudflare: Perplexity bots are scraping websites despite owner restrictions
-
Cloudflare has reported that web crawlers from the AI search startup Perplexity are bypassing website restrictions โ even when explicitly blocked by site owners.๏ธโ
๏ธ Hereโs whatโs happening:
Since July 1, 2025, Cloudflare began automatically blocking AI crawlers on customer websites. But many site admins noticed that Perplexity bots were still getting through, despite being denied access via robots.txt and Web Application Firewall (WAF) settings.
Upon investigation, Cloudflare discovered that:
๐ Perplexity disguises its bots as real users โ by spoofing browser headers like Chrome on macOS. ๐ถ They rotate IP addresses and ASN identifiers, allowing them to operate outside known ranges. ๐ข When disguised, the bots slow down crawling speed โ from 20โ25 million requests per day to 3โ6 million โ making them harder to detect. ๐งฉ If blocked completely, Perplexity tries to reconstruct page data from third-party sources, even if those sources are outdated or inaccurate.
๏ธ The good news:
Cloudflare has rolled out new protections against stealth crawlers โ even on free-tier plans. Users just need to enable the feature in their dashboard.
Also worth noting: ChatGPT bots from OpenAI were found to respect website rules and donโt violate crawling policies, Cloudflare confirmed.
Cloudflare reminds AI crawler operators:
Stay transparent, ethical, and responsible. ๐ Donโt overload websites ๐ โโ๏ธ Donโt harvest personal data ๐ท๏ธ Always identify your bot clearly.
Bottom line: If you're running a site and want to protect your content from unauthorized AI scraping, nowโs the time to double-check your Cloudflare settings. The AI web crawler war is heating up โ and staying one step ahead is key.
-
This is a huge red flag for web transparency and control. If site owners are explicitly blocking crawlers via robots.txt or other means โ and those instructions are still being bypassed โ thatโs not just a tech glitch, itโs a trust issue.Cloudflareโs involvement makes it even more complex, because theyโre often seen as the protectors of web infrastructure. If platforms like Perplexity are getting through, intentionally or not, it raises serious questions about consent, enforcement, and the future of content ownership in the age of AI.
๏ธ๐ง
-
Scraping the web is nothing new, but ignoring explicit opt-outs is where it crosses a line. If bots are bypassing standard blocks, thatโs not โindexingโ โ thatโs digital trespassing.The irony is that these AI models depend on the open web, yet risk poisoning that same ecosystem by overreaching. Platforms need to be held accountable before this becomes the norm, not the exception. Respect to the post for calling this out โ these are the conversations we need to have now, not later.
๏ธ