bootstrap menu
Detecting and Blocking Stealth Crawlers with .htaccess + Cloudflare

Detecting and Blocking Stealth Crawlers with .htaccess + Cloudflare

Home > Blog > Detecting and Blocking Stealth Crawlers with .htaccess + Cloudflare

While investigating unexplained server strain for an online bookstore with over 50,000 titles, we discovered a stealth bot problem that had bypassed Cloudflare’s initial defenses.

The problem was sneaky crawlers i.e. bad bots trying to look like real visitors.

Phase 1: Bots on Cloud Servers Pretending to Be Google bots, for example

We found a lot of bot traffic coming from these cloud platforms:

  • Google Cloud Platform (GCP)
  • Amazon AWS
  • Alibaba Cloud
  • OVH & M247

Some of these bots were even pretending to be Googlebot while using GCP. This is relatively terrible because websites usually trust Google's crawlers.

Important: Just because it's Google Cloud doesn't mean it's Googlebot. Anyone can rent a GCP server and fake the Googlebot info.

πŸ” How to Spot a Fake Googlebot

  1. Reverse DNS Lookup – Find the hostname for the IP address.
  2. Check the Hostname – Make sure it ends with googlebot.com or google.com.
  3. Forward DNS Lookup – Check if the hostname goes back to the same IP address.

Only bots that passed this test were allowed.

Phase 2: Bots Masking Their Identity at the Edge

After we blocked the obvious cloud bots, some trickier crawlers showed up. These bots hide their true identity from Cloudflare but reveal it to our server.

Example: Barkrowler

What Cloudflare Sees:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/132.0.0.0 Safari/537.36

What the Server Sees:
Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)

This gets around Cloudflare's firewall (Free plan) because it only checks the first request.

Why does this happen?

Cloudflare only sees the initial request from the client. Some bots change their info when they talk to your server. If you don't check the headers on your server, you won't notice.

How to Protect Your Website If You Don't Have Full Control

If you can't use fancy firewalls or manage your own servers (ex: lack of root access), here's a simple way to protect your website using Cloudflare and .htaccess.

βœ… Step 1: Cloudflare Firewall Rules (Works with the Free Plan)

Make firewall rules to reduce the number of bots that reach your server.

πŸ”’ Block Headers That Are Empty or Suspicious

Rule: Block requests without a referrer or user-agent.

Why: Real browsers always send this info.

🚫 Block Fake or Known Bot User-Agents

Example:

  • Chrome/134.0.0.0 (not released yet)
  • curl, wget, python, libwww, axios, etc.

Use patterns to find common scraper user agents.

⚠️ JavaScript Challenge or Rate Limiting

Use the JS challenge for pages that get a lot of abuse.

Limit how many requests an IP address can make to stop fast scraping.

βœ… Step 2: Use .htaccess to Filter at the Server

For bots that get past Cloudflare, .htaccess can be a last line of defense.

# Block requests with empty User-Agent
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule ^ - [F,L]
            
# Block known scraper libraries
RewriteCond %{HTTP_USER_AGENT} (curl|wget|scrapy|httpclient|axios|go-http-client|python|java|libwww|perl) [NC]
RewriteRule ^ - [F,L]
            
# Block Barkrowler and other aggressive crawlers
RewriteCond %{HTTP_USER_AGENT} Barkrowler [NC]
RewriteRule ^ - [F,L]
            

Be careful: .htaccess filtering uses server resources, so it's best as a backup, not your main filter.

πŸ” Other Things to Consider

  1. Check All Headers at the Server
    Record all request headers (like user-agent, referrer, and cookies) to find bots that are hiding.
  2. Identify Bots by Their Behavior
    → How many requests they make per second
    → If they're missing cookies or have inconsistent headers
    → If they use headless browsing tools (like Playwright, Puppeteer), which have unique patterns
  3. Use Threat Intelligence
    Check IP addresses against lists of known bad actors (like AbuseIPDB, Spamhaus).
  4. Turnstile or CAPTCHA
    If you have a lot of abuse, use Cloudflare's Turnstile CAPTCHA on important pages like search or login.

The Takeaway:

Sneaky crawlers are getting smarter, so Cloudflare alone isn't always enough especially if you're on the free plan. By checking crawlers, adding defenses at Cloudflare and your server, and keeping good logs, you can take back control of your server.

Even if you don't have root access, you can greatly reduce bad bot traffic and protect your website if you know what to look for, as mentioned in this article.


Did this article help you in understanding Websites better? Consult us to know how you can create, manage & host your websites more efficiently.

THE AUTHOR

Deepak - VizConn

Deepak founded VizConn in 2011 and currently serves as the Principal Consultant, leading the charge on CMS and AI adoption for clients across APAC, EU, and NA regions. Let us know and we will pass it on to him. He responds to most questions via email.

Deepak Selvan
Co-Founder - VizConn


CATEGORIES

OUR OFFICES

INDIA

Tamil Nadu:
2nd Floor, 146, Raja Street,
Arapalayam, Madurai

Madhya Pradesh:
Regus, DNR 90, Unit Nos. 301, 3rd floor, 569/3, MG Road, Indore

CONTACT

Phone:
(IN) +91-0731-4785018

Email: admin@vizconn.com

Partnership & Media
For partnership & media enquiries, click here.