AI crawlers like PerplexityBot, ChatGPTBot, and Googlebot-Extended visit your store every day to understand your products and content. If your robots.txt blocks them or your server rules reject them, your store is invisible to AI shopping agents. Opening the door is a five-minute configuration change that determines whether AI agents can recommend your products.
What Are AI Crawlers?
AI crawlers are automated programs sent by AI platforms to read and understand your website content. They serve the same purpose as traditional search engine crawlers like Googlebot but for a different audience: large language models and AI agents.
When someone asks ChatGPT to recommend a product, or asks Perplexity to compare options, those AI platforms need to know what products exist and what they offer. AI crawlers do the reconnaissance work.
Traditional crawlers (Googlebot, Bingbot) collect information for search results pages. AI crawlers collect information for conversational responses and recommendations. The end product differs but the mechanism is similar.
73% of ecommerce stores are blocking at least one major AI crawler according to DemandSphere Radar data from April 2026. That means most stores are invisible to AI shopping before they even consider schema markup or content optimization.
The Major AI Crawlers in 2026
PerplexityBot
Who sends it: Perplexity AI (perplexity.ai)
Purpose: Crawls content for Perplexity’s AI search and new Computer product, which can book travel and complete tasks directly from user queries.
User agent string: PerplexityBot/1.0 (+https://perplexity.ai/...)
Behavior: Aggressive crawler that reads product pages, reviews, and policy pages. Perplexity’s Computer feature needs real-time inventory and pricing data.
Impact: High. Perplexity’s Computer launched in April 2026 and can bypass OTAs entirely, booking hotels and products directly through AI agents.
ChatGPTBot
Who sends it: OpenAI (ChatGPT)
Purpose: Crawls for real-time retrieval in ChatGPT conversations and for training updates.
User agent string: ChatGPTBot/1.0 or GPTBot (older)
Behavior: Follows links from user conversations and discovers new sites through web exploration. Reads product descriptions, comparisons, and reviews.
Impact: Very high. ChatGPT is the most-used AI assistant for product research and shopping queries.
Googlebot-Extended
Who sends it: Google
Purpose: Specialized crawler for AI-enhanced features including Google’s AI Overviews and Gemini integration.
User agent string: Googlebot-Extended/1.0 (or Googlebot with AI feature flags)
Behavior: Crawls with both traditional search intent and AI training intent. Prioritizes structured data and comprehensive content.
Impact: Very high. AI Overviews appear in 15% of Google searches as of Q1 2026, and Gemini is integrating into Google Workspace and apps.
ClaudeBot
Who sends it: Anthropic (Claude)
Purpose: Real-time retrieval for Claude conversations.
User agent string: ClaudeBot/1.0 or anthropic-ai/claude-webcrawler
Behavior: Conservative crawler that respects crawl delays. Reads technical documentation and product specifications.
Impact: Medium-high. Growing user base, particularly for technical and research-heavy purchases.
CopilotBot
Who sends it: Microsoft (Copilot)
Purpose: Crawls for Copilot in Bing, Edge browser, and Microsoft 365 integration.
User agent string: Microsoft-Copilot/1.0 or bingbot with AI flags
Behavior: Integrated with Bing’s existing crawler infrastructure. Reads product feeds and comparison content.
Impact: High. Copilot is integrated across Windows, Edge, and Microsoft 365.
AI Crawlers vs Traditional Crawlers: Key Differences
| Aspect | Traditional Crawlers (Googlebot, Bingbot) | AI Crawlers (PerplexityBot, ChatGPTBot) |
|---|---|---|
| Primary purpose | Build search index for SERPs | Gather data for conversational AI responses |
| Ranking signal | Backlinks, domain authority, content quality | Content structure, schema markup, data clarity |
| Output | Search results page with links | Direct answer or recommendation in conversation |
| Timeframe | Re-indexes every few days to weeks | Real-time or near-real-time retrieval |
| Content preference | Title tags, meta descriptions, headings | Structured data, factual descriptions, attributes |
| User experience | User clicks through to website | User stays in AI interface, rarely clicks through |
| Blocking strategy | Block low-value pages, prioritize high-value | Block nothing unless necessary; AI needs context |
The critical difference is the user journey. Traditional search drives traffic to your website. AI search keeps users in the AI interface. Your goal is not to drive clicks but to ensure the AI has accurate information to recommend your products confidently.
How AI Crawlers Work
Discovery Phase
AI crawlers discover your store through three primary methods:
User queries: When a user mentions your store or product in a conversation, the AI crawler visits to learn more.
Web exploration: AI platforms crawl the web continuously, following links from relevant sites, review platforms, and competitor pages.
Direct access: Some AI platforms maintain seed lists of ecommerce sites and crawl them periodically.
Crawling Phase
Once at your store, the AI crawler:
- Fetches robots.txt to check what it is allowed to access
- Reads page HTML to understand structure and content
- Parses JSON-LD schema for structured product data
- Follows internal links to discover product pages and categories
- Checks llms.txt if present for site-level context
- Crawls images referenced in schema for visual understanding
Processing Phase
Back at the AI platform, the crawler’s data is:
- Extracted and normalized into structured format
- Indexed for real-time retrieval
- Integrated into the AI model’s knowledge base
- Used to answer user queries and make recommendations
The entire cycle can happen in seconds for real-time queries or over days for training updates.
Why Most Stores Block AI Crawlers (Accidentally)
robots.txt Mistakes
The most common problem is overly broad robots.txt directives designed for traditional search engines that also block AI crawlers.
Example of problematic robots.txt:
User-agent: *
Disallow: /api/
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?price=
This looks reasonable for Googlebot, but it uses User-agent: *, which matches every crawler including AI crawlers. The Disallow: /api/ line is particularly problematic for AI crawlers that might need to access product API endpoints or data feeds.
CDN and WAF Rules
Content delivery networks and web application firewalls often have rules that block unidentified or suspicious crawlers. If AI crawlers have new user agent strings that your WAF does not recognize, they may be blocked automatically.
Rate Limiting
Some stores implement aggressive rate limiting to prevent server overload from scrapers. If AI crawlers hit the rate limit, they stop crawling and your data goes stale.
JavaScript Rendering
Some AI crawlers do not render JavaScript. If your product data is loaded dynamically via JavaScript and not available in the initial HTML, the AI crawler sees an empty page.
How to Configure robots.txt for AI Crawlers
Allow All AI Crawlers (Recommended)
For most ecommerce stores, the safest approach is to explicitly allow major AI crawlers while blocking low-value directories.
# Allow all standard web crawlers
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /account/
Disallow: /api/private/
Disallow: /wp-admin/
Disallow: /wp-includes/
# Explicitly allow AI crawlers
User-agent: PerplexityBot
Disallow:
User-agent: ChatGPTBot
Disallow:
User-agent: Googlebot-Extended
Disallow:
User-agent: ClaudeBot
Disallow:
User-agent: Microsoft-Copilot
Disallow:
# Allow access to sitemaps
Sitemap: https://yourstore.com/sitemap.xml
Sitemap: https://yourstore.com/sitemap_products.xml
Allow Specific AI Crawlers Only
If you want to be more selective, allow only the AI platforms you care about:
# Block everything by default
User-agent: *
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: bingbot
Allow: /
# Allow specific AI crawlers
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPTBot
Allow: /
User-agent: Googlebot-Extended
Allow: /
# Sitemaps
Sitemap: https://yourstore.com/sitemap.xml
Block Only AI Crawlers (Not Recommended)
Unless you have a specific reason to block AI, this is generally a bad idea. But if you must:
User-agent: PerplexityBot
Disallow: /
User-agent: ChatGPTBot
Disallow: /
This makes your store invisible to those AI platforms. Consider whether blocking is worth the loss of AI recommendations.
How to Monitor AI Crawler Traffic
Server Log Analysis
Check your server access logs to see which AI crawlers are visiting and what they are accessing.
Check for PerplexityBot:
grep "PerplexityBot" /var/log/nginx/access.log | tail -20
Check for ChatGPTBot:
grep "ChatGPTBot\|GPTBot" /var/log/nginx/access.log | tail -20
Check for Googlebot-Extended:
grep "Googlebot-Extended" /var/log/nginx/access.log | tail -20
Check for all AI crawlers at once:
grep -E "(PerplexityBot|ChatGPTBot|Googlebot-Extended|ClaudeBot|Microsoft-Copilot)" /var/log/nginx/access.log | tail -50
Look for:
- 200 status codes (successful access)
- 403 or 404 status codes (blocked or not found)
- Which pages they are accessing (products, categories, policies)
- Crawl frequency (daily, weekly, sporadic)
Google Search Console
Google Search Console does not separate Googlebot-Extended traffic from regular Googlebot in most reports, but you can infer AI crawler access by monitoring “Crawled - not indexed” pages. If AI crawlers are accessing pages that Google does not index, those pages may still be useful for AI retrieval.
Third-Party AI Monitoring Tools
DemandSphere Radar, launched in April 2026, is designed specifically to track AI search visibility across ChatGPT, Perplexity, Gemini, and other platforms. It addresses the question: “How do you measure visibility when there is no search results page to analyze?”
Similar tools are emerging as AI search tracking becomes a distinct discipline from traditional SEO monitoring.
Common AI Crawler Blocking Patterns and Fixes
Pattern 1: Blocking All Unknown Crawlers
Problem:
User-agent: *
Disallow: /
Or WAF rules that block any crawler not in an allowlist.
Fix: Update robots.txt to explicitly allow AI crawlers, and add AI crawler user agents to your WAF allowlist.
Pattern 2: Blocking API Endpoints
Problem:
User-agent: *
Disallow: /api/
Fix: Distinguish between private and public API endpoints:
User-agent: *
Disallow: /api/private/
# AI crawlers can access public APIs
User-agent: PerplexityBot
Disallow: /api/private/
Allow: /api/products/
Allow: /api/inventory/
Pattern 3: Query Parameter Blocking
Problem:
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?price=
Impact: AI crawlers may not be able to access specific product variants or filtered views.
Fix: Allow AI crawlers to access these parameters:
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?price=
User-agent: PerplexityBot
Allow: /*?sort=
Allow: /*?filter=
Allow: /*?price=
Pattern 4: JavaScript-Only Product Data
Problem: Product data is loaded via JavaScript AJAX calls and not available in the initial HTML. AI crawlers that do not render JavaScript see empty pages.
Fix: Ensure critical product data is available in the initial HTML or provide server-side rendered alternatives. Many AI crawlers do not execute JavaScript.
Platform-Specific Considerations
Shopify
Shopify automatically generates a robots.txt file at /robots.txt. You can customize it in your Shopify admin:
- Go to Online Store > Preferences
- Scroll to “Robots.txt editing”
- Edit the file to allow AI crawlers
Shopify’s default robots.txt is usually permissive. The issue is often with installed apps that add their own blocking rules.
WooCommerce
WooCommerce does not have a built-in robots.txt editor. You edit the file directly in your WordPress installation:
- Access your site via FTP or file manager
- Edit
public_html/robots.txt(or create it if it does not exist) - Add AI crawler rules
WordPress security plugins sometimes add their own robots.txt rules. Check your plugin settings.
BigCommerce
BigCommerce manages robots.txt through the admin panel:
- Go to Store Setup > SEO
- Find “Robots.txt” section
- Edit to include AI crawler rules
Custom/Headless
You have full control. Create or edit robots.txt at the root of your domain. Ensure your CDN and WAF rules do not override it.
The Risk of Over-Blocking
Blocking AI crawlers seems safe if you are worried about content scraping. The reality is that AI crawlers are already reading your content. Your product pages are public. Anyone can visit them. AI crawlers are just automated visitors.
The question is not whether AI will read your data. It is whether AI will have accurate, structured data or will be forced to guess from incomplete information.
Blocking AI crawlers does not protect your content. It only makes AI recommendations about your products less accurate. That hurts you more than it protects you.
Consider your options:
- Let AI crawlers in and compete on product quality, accurate data, and good reviews.
- Block AI crawlers and lose AI recommendations while competitors capture that market share.
Shopti can help you implement the right balance of access controls and structured data.
How to Test Your AI Crawler Configuration
Manual robots.txt Check
Visit https://yourstore.com/robots.txt in your browser. Does it load? Are your AI crawler rules present?
robots.txt Tester
Use Google’s robots.txt tester (search.google.com/test/robots) to verify that your rules work as intended. While designed for Googlebot, it validates syntax for all crawlers.
Crawl Simulation
Use a tool like Screaming Frog or Sitebulb to simulate an AI crawler crawl. Configure the user agent to PerplexityBot or ChatGPTBot and see what the crawler can access.
Live Monitoring
After updating your robots.txt, monitor your server logs for increased AI crawler activity. You should see more frequent visits from the crawlers you explicitly allowed.
AI Query Test
Ask ChatGPT or Perplexity about your products: “What can you tell me about [your store]?” or “What does [your store] sell?” If the AI has crawled your store, it should provide accurate information. If it has not, the response will be generic or say it does not have enough information.
The Future of AI Crawlers
AI crawling is still evolving. Expect to see:
- More specialized crawlers for different AI use cases (shopping, travel, local business)
- Crawler authentication allowing sites to verify legitimate AI crawlers and block imposters
- Crawl budgets similar to Google’s, where AI platforms limit how much they crawl each site
- Real-time crawl triggers where a user query causes an immediate crawl of relevant sites
- Bidirectional communication where sites can push updates to AI crawlers instead of waiting to be crawled
The trend is toward more AI crawling, not less. Preparing now gives you an advantage as AI shopping grows.
Summary Checklist
- Check your current robots.txt for AI crawler rules
- Review server logs for AI crawler activity
- Update robots.txt to explicitly allow major AI crawlers
- Check CDN and WAF rules for accidental blocking
- Test with Google’s robots.txt tester
- Monitor logs for increased AI crawler visits after changes
- Run an AI query test to verify visibility
AI crawlers are visiting your store whether you invite them or not. The question is whether you are opening the door or forcing them to peek through the cracks. Configure robots.txt correctly, and AI agents will have the data they need to recommend your products confidently.
Check your store’s agent discoverability score free at shopti.ai
FAQ
Do AI crawlers respect robots.txt? Yes, legitimate AI crawlers from OpenAI, Anthropic, Google, Perplexity, and Microsoft all respect standard robots.txt rules. Some AI scrapers may not, but blocking those is a security issue, not an SEO issue.
Will AI crawlers overload my server?
AI crawlers follow crawl-delay directives when present and are generally conservative. If you are concerned, add a Crawl-delay: 5 directive to slow crawls to one request every five seconds per crawler.
Should I block AI crawlers to protect my content? Blocking AI crawlers does not protect your content. Your pages are publicly accessible. AI platforms already scrape content from various sources. Blocking only makes your data less accurate in AI responses.
How often do AI crawlers visit? Varies by platform. ChatGPTBot may visit daily for sites frequently mentioned in conversations. PerplexityBot visits regularly for sites in its index. New crawlers may visit sporadically during discovery phases.
Do I need to worry about AI scraper bots? Yes, there are unauthorized AI scrapers that do not respect robots.txt. These are different from legitimate AI crawlers. Use rate limiting, CAPTCHAs, and WAF rules to block abusive scrapers while allowing legitimate AI crawlers.
What is the difference between Googlebot and Googlebot-Extended? Googlebot crawls for traditional search results. Googlebot-Extended crawls for AI features like AI Overviews and Gemini integration. Both are from Google and serve different purposes.
Can I allow AI crawlers but block traditional search engines? Yes, you can configure robots.txt to disallow Googlebot but allow ChatGPTBot and PerplexityBot. This makes sense if your strategy focuses on AI shopping rather than traditional search traffic.
How do I know if an AI crawler is legitimate? Check the user agent string against official documentation from the AI platform. Legitimate crawlers also resolve to IP addresses owned by the platform. Use reverse DNS verification if you are unsure.
