“Shielding Your U.S. Online Store: How to Block Unwanted Bots and Optimize Performance”, by Nikolay Gul.
Bots are an inevitable part of the online world, but not all bots are beneficial. For a U.S.-only online store, excessive bot traffic from unnecessary or suspicious crawlers can strain your website’s resources and compromise user experience. In this guide, we’ll show you how to block popular bots like Baiduspider, AhrefsBot, and Mail.RU Bot using simple yet effective methods, including robots.txt, .htaccess rules, and advanced tools. By the end, you’ll know how to optimize your site for legitimate traffic while minimizing unwanted bot activity.
An explanation of each bot or crawler you listed, its potential impact on a U.S.-only Zen Cart 1.38 shopping site, and recommendations for handling it. The advice is tailored to your focus on U.S.-based customers and avoiding unnecessary traffic.
1. Slackbot
- What it does: Slackbot fetches website content when someone shares a link in Slack (a team communication tool).
- Usefulness: Minimal for your site unless employees or partners share links on Slack.
- Recommendation: Allow if you or your team use Slack for work-related purposes involving your site; otherwise, disallow to reduce noise in logs.
2. Barkrowler
- What it does: A general-purpose web crawler used for website analytics and SEO tools.
- Usefulness: Limited, as it’s unlikely to drive targeted U.S. traffic or benefit your Zen Cart.
- Recommendation: Block it unless you explicitly use a service that relies on Barkrowler for analytics.
3. YandexImages
- What it does: Crawls images for Yandex (a Russian search engine).
- Usefulness: Likely none, as you don’t sell outside the U.S., and Yandex’s user base is predominantly Russian.
- Recommendation: Block it to save bandwidth unless you plan to target a Russian-speaking audience.
4. Go-http-client
- What it does: A generic user agent for various tools and scripts.
- Usefulness: Hard to determine since it’s often associated with automated requests (e.g., API testing, scrapers).
- Recommendation: Block unless you can verify the specific purpose and legitimacy of requests.
5. SeznamBot
- What it does: A Czech search engine crawler.
- Usefulness: Minimal, as its users are mostly in the Czech Republic and neighboring regions.
- Recommendation: Block it unless you see unexpected, genuine Czech traffic.
6. Scrapy-user-agents
- What it does: Indicates web scraping activity using the Scrapy framework.
- Usefulness: None for your purposes; likely scraping data like prices or product info.
- Recommendation: Block it to protect your site from potential scrapers.
7. ZoomInfoBot
- What it does: Gathers data for ZoomInfo, a business contact and intelligence platform.
- Usefulness: None for your e-commerce site unless you actively use ZoomInfo for sales or marketing.
- Recommendation: Block it, as it doesn’t benefit your traffic or sales directly.
8. Lightspeed
- What it does: A generic crawler often used by developers or SEO tools.
- Usefulness: Likely none unless linked to a specific analytics tool you’ve authorized.
- Recommendation: Block it unless you find evidence of legitimate use.
9. YisouSpider
- What it does: A Chinese search engine crawler.
- Usefulness: None for a U.S.-only audience, as its users are based in China.
- Recommendation: Block it to reduce unnecessary traffic and bandwidth usage.
10. Curl
- What it does: A command-line tool used for automated requests or API testing.
- Usefulness: None for general traffic, as it’s often used for scraping or testing.
- Recommendation: Block unless you or a trusted partner use it for legitimate purposes.
11. DomainStatsBot
- What it does: Collects data on websites for DomainStats.com, often for SEO and analytics.
- Usefulness: Limited unless you use their service or find value in their analysis.
- Recommendation: Block to avoid unnecessary traffic.
12. Mail.RU Bot
- What it does: Crawls for Mail.ru, a Russian search engine and email service.
- Usefulness: None for a U.S.-only audience, as its users are predominantly Russian.
- Recommendation: Block it to save resources.
13. MojeekBot
- What it does: Crawls for Mojeek, a small, privacy-focused search engine based in the UK.
- Usefulness: Minimal, as its user base is small and unlikely to drive significant traffic or sales.
- Recommendation: Allow if you value privacy-centric search engines or block to reduce unnecessary load.
14. Nautic Expo using Firefox/1.5
- What it does: Associated with crawling data for industry-related expos and directories.
- Usefulness: None for your U.S.-based e-commerce store unless you are marketing products for boating or marine industries specifically listed in such directories.
- Recommendation: Block to avoid unnecessary traffic.
15. Survey
- What it does: Likely used for automated data collection or polling purposes.
- Usefulness: None for a shopping cart website, as it does not drive customer traffic or improve SEO.
- Recommendation: Block to prevent misuse of resources.
16. nbot
- What it does: A generic bot with minimal documentation, often associated with unknown or suspicious activities.
- Usefulness: Unlikely to provide any benefit.
- Recommendation: Block due to lack of transparency and potential for misuse.
17. DoCoMo
- What it does: A crawler for NTT DoCoMo, a major Japanese telecommunications company.
- Usefulness: None for a U.S.-only audience.
- Recommendation: Block to save resources and avoid irrelevant traffic.
18. The Knowledge AI
- What it does: Crawls websites to gather data for AI or knowledge-building purposes.
- Usefulness: Limited or none, as it doesn’t directly contribute to customer traffic or sales.
- Recommendation: Block unless there is a specific reason to allow AI data collection.
19. Baiduspider
- What it does: The crawler for Baidu, a major Chinese search engine.
- Usefulness: None for a U.S.-only audience, as it targets a predominantly Chinese user base.
- Recommendation: Block to reduce unnecessary traffic and protect bandwidth.
20. Slackbot-LinkExpanding
- What it does: Fetches link previews when someone shares your website link on Slack.
- Usefulness: Minimal unless your team uses Slack for collaboration.
- Recommendation: Block unless you anticipate link-sharing within Slack that benefits your site.
21. SemrushBot
- What it does: Collects data for Semrush, a widely used SEO tool.
- Usefulness: Potentially helpful if you use Semrush for SEO monitoring. Otherwise, it consumes bandwidth with no direct benefit.
- Recommendation: Allow only if you or your SEO agency actively use Semrush; otherwise, block.
22. aiHitBot
- What it does: Gathers data for aiHit, a company database and research tool.
- Usefulness: None for a U.S.-focused online store.
- Recommendation: Block to prevent unnecessary resource consumption.
23. Baidu (catchall)
- What it does: A collective name for Baidu-related bots.
- Usefulness: None for U.S.-only businesses.
- Recommendation: Block all Baidu bots to avoid irrelevant traffic.
24. SemrushBot-SI
- What it does: A variant of SemrushBot focusing on specific indexing or analytics tasks.
- Usefulness: Similar to SemrushBot—only helpful if you actively use the Semrush tool.
- Recommendation: Same as SemrushBot—block unless actively used.
25. Dalvik
- What it does: User agent used by Android-based applications, often associated with automated or testing tools.
- Usefulness: None for your e-commerce store unless linked to a specific app integration.
- Recommendation: Block unless required for app-based use cases.
General Recommendations
If I were the site owner, I’d take a pragmatic approach:
Analyze server logs: Identify which bots consume significant resources without providing value.
Update robots.txt: Add specific disallow rules for unhelpful bots.
Monitor results: Use tools like Google Search Console and analytics software to ensure that blocking these bots doesn’t negatively impact legitimate traffic.
Primarily target non-U.S. audiences (e.g., Baiduspider, YisouSpider, Mail.RU Bot).
Gather data for purposes unrelated to your business (e.g., Nautic Expo, aiHitBot, Survey).
Consume bandwidth without offering any SEO or customer acquisition benefits (e.g., nbot, Scrapy, ZoomInfoBot).
Next Steps
Update your robots.txt file: Include rules to disallow these bots explicitly.
Monitor your logs: Use server logs to identify any new or missed bots consuming resources unnecessarily.
Evaluate SEO tools: Only allow bots like SemrushBot if actively using their services.
What you might consider to do.
Adding bots to your robots.txt
file will not stop all bots from visiting your site, as many crawlers (especially malicious or resource-heavy ones) ignore robots.txt
directives entirely. Legitimate bots, like those from Google, Bing, and Yahoo, respect robots.txt
, but others, including many unnecessary or harmful crawlers, will still access your site unless additional measures are in place.
Here’s an explanation and recommendations:
1. What Happens with robots.txt
- Respected by Legitimate Bots: Search engines like Google, Bing, Yahoo, and DuckDuckGo respect
robots.txt
directives. - Ignored by Suspicious or Malicious Bots: Many bad actors, scrapers, and resource-heavy crawlers (e.g., Mail.RU Bot, AhrefsBot, Baiduspider) ignore
robots.txt
. They will continue crawling your site unless blocked at the server level or through other methods.
2. What Happens with .htaccess
.htaccess
rules are enforced server-side, meaning:
- Effectively Block Bots: Bots matching the user agent rules in
.htaccess
will be blocked entirely at the server level and won’t be able to access your site. - Clever Bots May Circumvent: Some advanced bots can spoof legitimate user agents (e.g., pretending to be Googlebot) to bypass such rules.
However, .htaccess blocking is far more effective than robots.txt
alone, especially when combined with IP blocking for persistent bad actors.
3. Why Bots Might Still Visit
- User-Agent Spoofing: Some bots disguise themselves as legitimate crawlers like Googlebot or Bingbot. They can bypass both
robots.txt
and.htaccess
rules unless deeper verification (like reverse DNS lookup) is applied. - IP Address Rotation: Bots frequently rotate their IP addresses, making it difficult to block them entirely based on IP or other static identifiers.
4. Recommendations for Effective Bot Management
To ensure unwanted bots are blocked as much as possible:
A. Use robots.txt
+ .htaccess
Together
- Start with
robots.txt
for standard compliance with legitimate bots. - Use
.htaccess
to enforce blocking for resource-intensive or suspicious crawlers.
B. Implement Advanced Server-Side Bot Filtering
- Reverse DNS Lookup: Verify bots claiming to be Googlebot, Bingbot, or other legitimate crawlers. For example:
- Googlebot’s IPs should resolve to
googlebot.com
. - If a bot claims to be Google but doesn’t resolve correctly, block it.
- This requires custom server configuration or tools.
- Googlebot’s IPs should resolve to
- Dynamic Firewall Rules: Tools like Fail2Ban or Cloudflare can detect and block abusive behavior in real-time.
- Third-Party Plugins (For WordPress):
- Use plugins like Wordfence or SecuPress to monitor and block bots.
- These plugins can also help identify patterns of unwanted traffic and provide automated blocking.
C. Monitor Your Logs
- Check your server access logs for repeated visits by bots you’re trying to block. Look for patterns like:
- Excessive requests from the same IP.
- User agents pretending to be legitimate bots.
- Use these patterns to update your
.htaccess
rules or ban IP ranges.
D. Use Cloud-Based Security Tools
- Services like Cloudflare or Sucuri can block harmful bots before they even reach your server. Cloudflare’s Bot Management and firewall rules are particularly effective.
Summary
robots.txt
: Helps manage legitimate bots but is ignored by many harmful crawlers. Use as a first line of defense..htaccess
: Much stronger, as it enforces server-side blocking, but bots can still spoof or bypass these rules if they’re sophisticated.- Advanced Tools: To truly minimize bot traffic, integrate tools like reverse DNS verification, dynamic firewalls, or cloud-based protection (e.g., Cloudflare).
Bonus from Nikolay Gul.
Block Using robots.txt
The robots.txt
file is a standard method for telling crawlers not to access certain parts of your site. However, note that malicious bots often ignore this file, so it is best combined with .htaccess
rules for enforcement.
Add the following to your robots.txt
file:
Example of the robots.txt
User-agent: Baiduspider
Disallow: /
User-agent: YandexImages
Disallow: /
User-agent: Mail.RU Bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Slackbot-LinkExpanding
Disallow: /
Commentary:
- Baiduspider, YandexImages, and Mail.RU Bot: These primarily serve non-U.S. search engines and are irrelevant to your store.
- SemrushBot and AhrefsBot: These crawlers are resource-heavy and used by SEO professionals to analyze websites, which does not benefit your site unless you actively use their tools.
- Slackbot-LinkExpanding: Only useful if you expect Slack users to share links, which is uncommon for most e-commerce businesses.
Block Using .htaccess
Example of the .htaccess
To block these bots more effectively, use .htaccess
rules. These rules are processed server-side, preventing the bots from even reaching your website. Below are the safest rules to block the selected bots while ensuring minimal risk to your site’s functionality.
# Block Baiduspider
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC]
RewriteRule .* – [F,L]
# Block YandexImages
RewriteCond %{HTTP_USER_AGENT} YandexImages [NC]
RewriteRule .* – [F,L]
Explanation of the Code:
RewriteCond %{HTTP_USER_AGENT}
: Checks if the request comes from a bot matching the user agent string (case-insensitive with[NC]
).RewriteRule .* - [F,L]
: Denies access ([F]
= Forbidden) and stops processing further rules for the request ([L]
= Last rule for this request).
Why This is Safe:
- The rules only match specific user agents and do not interfere with other server functions or legitimate traffic.
- It’s compatible with PHP 7.4 and higher, as it relies on Apache’s built-in
mod_rewrite
module.
Testing and Monitoring
- Test Locally First: Before applying these rules on a live site, test them in a staging environment to ensure they don’t block legitimate traffic.
- Check Server Logs: After implementation, monitor your server logs to ensure the rules are working and no unexpected issues arise.
- Use Online Tools: Tools like Google’s Robots.txt Tester can help you verify the validity of your
robots.txt
.
“Ready to take control of your website’s traffic? Learn how to block unwanted bots and focus on what matters—your customers. Implement these tips today and create a safer, faster e-commerce experience!”
Disclaimer:
The information provided in this blog is intended for general guidance only. Improper configuration of robots.txt or .htaccess files may cause unintended issues on your site. Test all changes in a staging environment before implementing them on your live website. For advanced server configurations, consult a qualified professional.