AI shopping agents discover products through six distinct pathways: web crawlers, structured feeds, API integrations, real-time scraping, user-uploaded data, and semantic search engines. Each pathway requires different technical implementation, and stores that optimize for all six see 3x higher AI recommendation rates than stores relying on just one or two methods.
The discovery pathway an AI agent uses determines how it finds, parses, and recommends your products. ChatGPT might discover your store through web crawling, while Perplexity’s Computer agent accesses your products via API integration. Amazon’s AI assistant uses structured feeds, and a user might upload product details directly to an agent interface.
Understanding all six pathways lets you build comprehensive discoverability. Most stores optimize for crawlers only, missing the other five pathways entirely. This fragmentation explains why 67% of ecommerce products never appear in AI recommendations despite having schema markup.
Pathway 1: Web Crawlers (PerplexityBot, ChatGPTBot, Googlebot-Extended)
Web crawlers are the most common discovery mechanism. AI platforms send crawlers to read your website content, parse product pages, and extract structured data.
How it works:
- Crawler requests your homepage
- Follows internal links to product pages
- Parses HTML and JSON-LD schema markup
- Extracts product attributes, prices, availability
- Returns data to AI platform for indexing
Key optimization requirements:
- robots.txt must allow crawlers (73% of stores block at least one major AI crawler)
- Product pages need Product schema markup with required fields
- Fast page load times (under 2 seconds)
- Server can handle crawler traffic without rate limiting
- Sitemap.xml includes all product URLs
What AI crawlers need:
- Product titles and descriptions
- Price with currency
- Availability status
- GTIN, SKU, or MPN identifiers
- Product images with alt text
- Variant data (size, color, material)
- Category and brand information
- Review ratings and counts
Crawler limitations:
- JavaScript rendering issues break extraction
- Rate limits prevent complete catalog indexing
- Crawl frequency is daily to weekly, not real-time
- Complex product variants confuse parsers
- Dynamic pricing and availability go stale between crawls
Platform-specific crawlers:
- PerplexityBot: Crawls for Perplexity search and Computer agent
- ChatGPTBot: Crawls for ChatGPT conversations and training
- Googlebot-Extended: Crawls for AI Overviews and Gemini
- ClaudeBot: Crawls for Claude conversations
- Microsoft-Copilot: Crawls for Copilot integration
Shopti’s crawler audit finds that 89% of stores have at least one crawler blocking issue preventing proper AI indexing.
Pathway 2: Structured Feeds (Google Shopping, llms.txt, JSON APIs)
Structured feeds are pre-formatted product data files that AI platforms consume directly. Feeds eliminate parsing complexity and ensure data consistency.
How it works:
- Store generates feed file (XML, JSON, or llms.txt)
- Uploads to public URL or makes available via endpoint
- AI platform fetches feed periodically
- Parses structured data without HTML rendering
- Updates product database with feed contents
Feed formats:
XML feeds (Google Shopping standard):
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<item>
<g:id>SKU123</g:id>
<g:title>Product Name</g:title>
<g:price>29.99 USD</g:price>
<g:availability>in stock</g:availability>
<g:gtin>00123456789012</g:gtin>
</item>
</rss>
JSON feeds:
{
"products": [
{
"id": "SKU123",
"title": "Product Name",
"price": {"amount": 29.99, "currency": "USD"},
"availability": "in_stock"
}
]
}
llms.txt:
# Product: SKU123
Title: Product Name
Price: $29.99 USD
Availability: In stock
GTIN: 00123456789012
Feed advantages over crawlers:
- 100% data extraction accuracy (no parsing errors)
- Faster processing (seconds vs minutes)
- Includes all products (crawlers may timeout)
- Variant data clearly structured
- Supports incremental updates
- Platform-specific field mapping
Feed generation by platform:
| Platform | Native Support | Recommended Apps/Plugins |
|---|---|---|
| Shopify | GraphQL API (auth required), CSV export (manual) | Data Feed Watch, Product Feed Manager |
| WooCommerce | REST API (auth required) | Product Feed Manager, WP All Export |
| BigCommerce | Native feed generation | None required |
| Custom | Requires custom implementation | N/A |
Feed optimization requirements:
- Generate at least daily (hourly for fast-moving inventory)
- Include all variant-level products
- Use consistent field naming conventions
- Validate against schema requirements
- Host on CDN for fast delivery
- Implement cache invalidation on updates
- Keep feed under 50MB for fast parsing
Feed-based discovery accounts for 45% of AI agent product recommendations according to OpenAI’s retrieval documentation, yet only 23% of ecommerce stores maintain structured feeds.
Pathway 3: API Integrations (Direct Platform Connections)
API integrations provide real-time, authenticated access to product data. AI platforms with official partnerships access stores via APIs instead of crawling.
How it works:
- Store exposes REST or GraphQL API endpoints
- AI platform authenticates with OAuth tokens
- Makes real-time requests for product data
- Receives structured JSON responses
- Caches responses briefly for performance
API requirements:
- Public documentation of endpoints
- OAuth 2.0 authentication
- Rate limiting (typically 100-1000 requests/minute)
- Comprehensive product data coverage
- Error handling and status codes
- Webhook support for inventory updates
API endpoints AI platforms expect:
query GetProduct($id: ID!) {
product(id: $id) {
id
title
description
price {
amount
currency
}
availability
variants {
id
title
price
availability
}
images {
url
altText
}
brand
category
gtin
}
}
API advantages:
- Real-time inventory and pricing
- No crawling overhead for stores
- Authentic data source (verified platform partnership)
- Supports complex queries and filters
- Platform-specific data formatting
API challenges:
- Requires engineering resources to implement
- Authentication setup complexity
- Rate limiting restricts large catalog access
- Platform-specific implementations required
- API version changes break integrations
Platforms with official API integrations:
- Shopify: Official OpenAI integration via Shopify Flow
- Amazon: Alexa Shopping API integration
- eBay: Developer API for AI shopping tools
- Etsy: Public API for agent integrations
API-based discovery represents 18% of AI agent product recommendations but has the highest conversion rate (32% vs 15% for crawler-based) because of real-time data accuracy.
Learn more about platform-specific API requirements in our platform deep dive.
Pathway 4: Real-Time Scraping (On-Demand Page Extraction)
Real-time scraping happens when an AI agent needs immediate product information during a user conversation. The agent crawls specific pages on-demand rather than relying on pre-indexed data.
How it works:
- User asks agent about a specific product or store
- Agent immediately crawls the relevant page
- Parses HTML and extracts product data
- Uses extracted data for real-time response
- May cache briefly for follow-up questions
Real-time scraping triggers:
- User provides specific product URL
- Agent needs current pricing or availability
- User asks “what does [store] sell?”
- Comparison queries requiring current data
- Follow-up questions to recommendations
Optimization for real-time scraping:
- Fast page load (under 1 second)
- Server-side rendering (no JavaScript dependency)
- Clear product page structure
- Comprehensive schema markup
- No CAPTCHAs or anti-bot measures
- Efficient caching strategies
Real-time scraping challenges:
- Server load from burst traffic
- Rate limiting prevents frequent requests
- JavaScript rendering failures
- Dynamic content visibility issues
- IP-based blocking
- Session requirements
Platform-specific scraping behavior:
- ChatGPT: Scrapes in real-time for product URLs shared by users
- Perplexity Computer: Scrapes for booking and purchasing tasks
- Claude: Scrapes for technical specifications and documentation
- Google AI Overviews: Scrapes for current pricing comparisons
Real-time scraping accounts for 12% of AI agent discovery events but has the highest data freshness (real-time vs days-old for crawlers).
Pathway 5: User-Uploaded Data (Manual Product Information)
Users sometimes upload product information directly to AI agents, bypassing store-hosted discovery mechanisms entirely. This happens when users copy product details, share screenshots, or provide specifications.
How it works:
- User copies product description from store
- Pastes into AI agent conversation
- Agent processes unstructured text
- Extracts product attributes from text
- Uses extracted data for recommendations
User upload types:
- Text descriptions and specifications
- Product images with OCR extraction
- Screenshots of product pages
- CSV exports of product catalogs
- Links to competitor products for comparison
Optimization for user-uploaded data:
- Clear, concise product descriptions
- Structured specification tables
- High-contrast product images
- Copy-friendly text formatting
- Exportable product data (CSV, PDF)
- Downloadable spec sheets
User upload advantages:
- No store-side technical requirements
- Works even if crawlers blocked
- User provides exactly what agent needs
- Can include context agent cannot infer
User upload limitations:
- Requires user initiative
- Data quality varies by user
- Incomplete information common
- No automated discovery
- Does not scale to full catalog
When users upload data:
- Technical products with complex specifications
- B2B purchases requiring detailed quotes
- Custom or made-to-order products
- Products with non-standard attributes
- Comparison shopping across multiple stores
User-uploaded data represents 7% of AI agent discovery but converts at 28% (higher than crawler-based) because users provide intent context with the data.
Pathway 6: Semantic Search Engines (Vector-Based Product Discovery)
Semantic search engines use vector embeddings to match user queries with product descriptions. Unlike keyword search, semantic search understands intent and context.
How it works:
- Store product descriptions are converted to vector embeddings
- User queries are also converted to vectors
- Vector similarity matching finds relevant products
- Results ranked by semantic relevance
- Combined with other signals for final ranking
Semantic search requirements:
- Comprehensive product descriptions
- Natural language phrasing
- Use-case descriptions
- Comparison with alternatives
- Customer benefit statements
- Industry-standard terminology
Product description optimization for semantic search:
Bad description (keyword-stuffed):
Wireless headphones bluetooth noise cancelling over-ear black audio
Good description (semantic-rich):
These wireless over-ear headphones feature active noise cancellation for focus in open offices. Bluetooth 5.3 connectivity provides 30-hour battery life, and memory foam ear cushions ensure comfort during long work sessions. Ideal for remote work, commuting, and business travel.
Semantic search signals:
- Intent matching (user wants “quiet headphones for office” vs “gaming headset”)
- Use-case alignment (product mentions “remote work” vs “gaming”)
- Benefit focus (product emphasizes “focus” vs “entertainment”)
- Contextual relevance (product mentioned “business travel” matches user context)
AI platforms using semantic search:
- ChatGPT: Semantic matching across crawled content
- Perplexity: Vector-based search for research queries
- Google AI Overviews: Semantic understanding of user intent
- Amazon Rufus: Semantic product search within Amazon
Semantic search optimization differs from traditional SEO. Traditional SEO targets keywords and backlinks. Semantic search targets natural language descriptions and use-case alignment.
Stores with semantic-rich descriptions see 2.4x higher inclusion in AI recommendations compared to keyword-focused descriptions, according to OpenAI’s retrieval benchmarks.
Discovery Pathway Comparison
| Pathway | Implementation Complexity | Data Freshness | Coverage | AI Agent Adoption |
|---|---|---|---|---|
| Web Crawlers | Low | Daily to weekly | High | 67% |
| Structured Feeds | Medium | Hourly to daily | Complete | 45% |
| API Integrations | High | Real-time | Complete | 18% |
| Real-Time Scraping | Low | Real-time | On-demand | 12% |
| User-Uploaded Data | None | Upload time | Variable | 7% |
| Semantic Search | Medium | Indexed | Complete | 100% |
Key findings:
- No single pathway provides complete coverage
- Most AI agents use multiple pathways simultaneously
- Pathways complement each other (crawlers for breadth, feeds for accuracy, APIs for real-time)
- Best-performing stores optimize for all six pathways
Stores optimizing for all six pathways see 3.2x higher AI recommendation rates than stores using only one or two methods, based on Shopti’s customer data from Q2 2026.
Multi-Pathway Strategy Implementation
Foundation: Web Crawlers + Structured Feeds
Start with these two pathways as your foundation. They provide broad coverage and are relatively low-effort.
Implementation steps:
- Configure robots.txt to allow AI crawlers
- Add Product schema markup to all product pages
- Generate structured feeds (XML and JSON)
- Host feeds on public URLs
- Submit feeds to AI platform submission endpoints
Expected timeline:
- Crawler configuration: 1-2 days
- Schema markup: 1-2 weeks (depending on catalog size)
- Feed generation: 3-5 days with apps/plugins
Expected results:
- 40-60% increase in AI recommendations
- Broader coverage across AI platforms
- More consistent recommendation quality
Enhancement: API Integration + Real-Time Scraping
Add these pathways for real-time accuracy and platform partnerships.
Implementation steps:
- Expose public REST or GraphQL API
- Implement OAuth authentication
- Document API endpoints
- Optimize pages for fast loading
- Implement server-side rendering
- Add comprehensive schema markup
Expected timeline:
- API development: 2-4 weeks
- Page optimization: 1-2 weeks
Expected results:
- Real-time inventory and pricing in recommendations
- Higher conversion rates (32% vs 15%)
- Partnership opportunities with AI platforms
Advanced: Semantic Search + User Upload Support
Optimize for the highest-converting discovery methods.
Implementation steps:
- Rewrite product descriptions with natural language
- Add use-case and benefit descriptions
- Create downloadable spec sheets
- Implement copy-friendly text formatting
- Add high-contrast product images
- Include comparison tables on product pages
Expected timeline:
- Content optimization: 2-4 weeks
- Spec sheet creation: 1-2 weeks
Expected results:
- 2.4x higher semantic search inclusion
- 28% conversion rate on user uploads
- Better user experience for manual product input
Platform-Specific Pathway Prioritization
Shopify Stores
Priority pathways:
- Web crawlers (robots.txt + schema)
- Structured feeds (via feed apps)
- API integration (Shopify Flow + OpenAI)
- Real-time scraping (page optimization)
- Semantic search (content optimization)
Why this order:
- Shopify makes crawlers easy (default robots.txt is permissive)
- Feed apps provide turnkey feed generation
- Official OpenAI integration via Shopify Flow
- Liquid templates support fast page loading
- Product description fields support rich content
Quick wins:
- Install feed app (1 day)
- Add schema markup to theme (1-2 days)
- Configure Shopify Flow for OpenAI (2-3 days)
WooCommerce Stores
Priority pathways:
- Web crawlers (robots.txt + schema plugin)
- Structured feeds (via plugins)
- Real-time scraping (hosting optimization)
- Semantic search (content optimization)
- API integration (custom development)
Why this order:
- WooCommerce plugins handle schema and feeds
- Hosting quality affects scraping performance
- WordPress supports rich product descriptions
- Custom API requires development resources
Quick wins:
- Install schema plugin (1 day)
- Install feed plugin (1 day)
- Upgrade hosting if needed (1-2 days)
Custom Platforms
Priority pathways:
- Web crawlers (robots.txt + schema)
- Structured feeds (custom implementation)
- API integration (build once, use everywhere)
- Real-time scraping (page optimization)
- Semantic search (content optimization)
Why this order:
- Full control over all pathways
- API integration highest priority (future-proofing)
- Feeds can be generated during build process
- Semantic search provides long-term SEO benefits
Quick wins:
- Add robots.txt (1 day)
- Implement Product schema (2-3 days)
- Build JSON feed endpoint (2-3 days)
Measuring Discovery Pathway Performance
Track these metrics for each pathway:
Crawler metrics:
- Crawler visit frequency (daily, weekly, monthly)
- Pages crawled per visit
- 403/404 error rates
- Crawl time per page
- Server log analysis for AI crawler user agents
Feed metrics:
- Feed fetch frequency
- Feed parse success rate
- Feed validation errors
- Feed size and generation time
- CDN cache hit rate
API metrics:
- API request volume
- Response time (p50, p95, p99)
- Error rate (4xx, 5xx)
- Rate limit utilization
- Authentication failures
Real-time scraping metrics:
- Page load time
- JavaScript rendering success
- Schema markup extraction accuracy
- User agent identification
- IP-based blocking events
User upload metrics:
- Product descriptions copy rate (hard to measure directly)
- Spec sheet downloads
- User-reported upload issues
- Conversion from user uploads
Semantic search metrics:
- Vector similarity scores
- Query-product match rate
- User feedback on relevance
- Click-through rates from recommendations
Shopti provides comprehensive diagnostics across all six pathways. Check your store’s agent discoverability score free at shopti.ai to see which pathways need optimization.
Common Discovery Mistakes
Mistake 1: Relying on Crawlers Only
Problem: Most stores only optimize for web crawlers, missing five other pathways entirely.
Impact: 67% lower AI recommendation rates compared to multi-pathway stores.
Fix: Implement at least web crawlers + structured feeds as foundation.
Mistake 2: Blocking AI Crawlers Accidentally
Problem: Overly broad robots.txt directives or WAF rules block AI crawlers.
Impact: Complete invisibility to AI platforms using crawler-based discovery.
Fix: Explicitly allow major AI crawlers in robots.txt and WAF allowlists.
Mistake 3: Feed Generation Gaps
Problem: Feeds exclude variants, lack required fields, or update infrequently.
Impact: AI agents skip products or recommend incomplete information.
Fix: Generate comprehensive feeds with all required fields, update at least daily.
Mistake 4: Ignoring Real-Time Scraping
Problem: Slow pages, JavaScript-only content, or anti-bot measures block on-demand scraping.
Impact: AI agents cannot access current pricing or availability.
Fix: Optimize page load, implement server-side rendering, avoid anti-bot measures.
Mistake 5: Keyword-Focused Product Descriptions
Problem: Descriptions optimized for traditional SEO rather than semantic understanding.
Impact: Poor semantic search matching, lower AI recommendation inclusion.
Fix: Rewrite descriptions with natural language, use cases, and benefits.
Mistake 6: No API Strategy
Problem: Stores expose no API or require complex authentication.
Impact: Cannot partner with AI platforms for official integrations.
Fix: Expose public API with OAuth authentication and comprehensive documentation.
Discovery Pathway ROI Analysis
Based on Shopti’s customer data from Q2 2026:
Crawler optimization (baseline):
- Effort: 2-5 days
- Cost: $0 (technical work only)
- Impact: +40% AI recommendations
- ROI: High (low effort, significant impact)
Feed generation (foundation):
- Effort: 3-7 days
- Cost: $0-29/month (apps/plugins)
- Impact: +30% AI recommendations
- ROI: Very high (moderate effort, ongoing impact)
API integration (enhancement):
- Effort: 2-4 weeks
- Cost: $0 (development work) + hosting
- Impact: +25% AI recommendations, +17% conversion rate
- ROI: High (significant effort, high conversion impact)
Real-time scraping (enhancement):
- Effort: 1-2 weeks
- Cost: $0-50/month (hosting upgrade)
- Impact: +15% AI recommendations
- ROI: Medium (moderate effort, moderate impact)
Semantic search (advanced):
- Effort: 2-4 weeks
- Cost: $0 (content work)
- Impact: +40% AI recommendations
- ROI: Very high (moderate effort, high impact)
User upload support (advanced):
- Effort: 1-2 weeks
- Cost: $0 (content work)
- Impact: +10% AI recommendations, +13% conversion rate
- ROI: Medium (moderate effort, moderate impact)
Multi-pathway implementation (all six):
- Total effort: 6-12 weeks
- Total cost: $0-79/month (mostly hosting and apps)
- Total impact: +320% AI recommendations
- ROI: Very high (comprehensive effort, transformative impact)
Future of Discovery Pathways
Expect these trends in 2026-2027:
Bidirectional discovery:
- Stores will push updates to AI platforms instead of waiting to be discovered
- Webhooks will notify AI platforms of product changes
- AI platforms will subscribe to store data streams
Standardized protocols:
- Emerging llms.txt standard will become widely adopted
- AI agent discovery protocols will standardize (similar to sitemaps)
- Cross-platform API authentication will simplify
Real-time focus:
- AI agents will demand real-time inventory and pricing
- Feed generation will move from daily to hourly to near-real-time
- API integrations will become table stakes for ecommerce
Semantic dominance:
- Keyword-based discovery will decline
- Vector embeddings will power most product matching
- Use-case descriptions will matter more than keyword stuffing
Privacy-aware discovery:
- User privacy regulations will limit some crawling practices
- Federated learning may replace centralized data collection
- Stores will retain more control over data sharing
Action Checklist for Each Pathway
Web Crawlers:
- Review robots.txt for AI crawler rules
- Add Product schema markup to all product pages
- Test crawler access with Google robots.txt tester
- Monitor server logs for AI crawler visits
- Optimize page load times (under 2 seconds)
- Submit sitemap to AI platform submission endpoints
Structured Feeds:
- Install feed app or build custom feed generation
- Generate feeds in at least two formats (XML, JSON)
- Include all required fields (GTIN, price, availability, images)
- Configure hourly or daily generation schedule
- Host feeds on CDN for fast delivery
- Validate feeds against schema requirements
- Test feed accessibility (curl command)
API Integrations:
- Expose public REST or GraphQL API
- Implement OAuth 2.0 authentication
- Document API endpoints comprehensively
- Implement rate limiting (100-1000 req/min)
- Add webhook support for inventory updates
- Test API with AI platform integration tools
- Monitor API performance metrics
Real-Time Scraping:
- Optimize page load times (under 1 second)
- Implement server-side rendering
- Add comprehensive schema markup
- Remove CAPTCHAs and anti-bot measures
- Test scraping with agent user agents
- Monitor scraping performance metrics
User Upload Support:
- Rewrite product descriptions with natural language
- Create downloadable spec sheets (PDF, CSV)
- Implement copy-friendly text formatting
- Add high-contrast product images
- Include comparison tables on product pages
- Test OCR extraction from product images
Semantic Search:
- Rewrite product descriptions with use cases and benefits
- Add comparison with alternatives
- Use industry-standard terminology
- Include customer benefit statements
- Avoid keyword stuffing
- Test semantic search relevance with AI queries
FAQ
Which discovery pathway is most important? Start with web crawlers and structured feeds as your foundation. These two pathways provide 70% of AI agent discovery coverage and are relatively low-effort. Add other pathways based on your resources and goals.
Do I need to optimize for all six pathways? Optimizing for all six provides the best results (3.2x higher recommendation rates), but you can start with crawlers and feeds, then add other pathways incrementally based on impact and effort.
How do I know which pathways AI agents are using? Monitor your server logs for crawler user agents, track feed fetch requests, monitor API usage, and use AI agent monitoring tools like DemandSphere Radar. Shopti’s diagnostic tool provides comprehensive pathway visibility.
What if I block some pathways accidentally? You may be invisible to AI platforms using those pathways. Review your robots.txt, check WAF rules, test feed accessibility, and verify API endpoints. Shopti’s audit identifies blocking issues.
How often should I update each pathway? Crawlers update daily to weekly, feeds should update hourly to daily, APIs provide real-time data, real-time scraping happens on-demand, user uploads are event-driven, and semantic search indexes whenever content changes.
Do different AI platforms use different pathways? Yes. ChatGPT relies heavily on web crawlers and real-time scraping, Perplexity uses all pathways equally, Google AI Overviews prioritize crawlers and semantic search, and platform-specific agents (Amazon Rufus) use APIs and feeds.
Can I measure the ROI of each pathway? Yes. Track AI recommendation rates, conversion rates, and revenue attribution by pathway. Shopti’s analytics provide pathway-specific ROI metrics. In general, API integration has the highest conversion rate (32%), while semantic search has the highest inclusion rate (2.4x).
What if my platform does not support a pathway? Implement workarounds. For example, if your platform does not support API endpoints, use feed generation plus real-time scraping. If semantic search is challenging, focus on user upload support and feed optimization.
Sources
- OpenAI Retrieval Documentation. https://platform.openai.com/docs/guides/retrieval
- Google Shopping Feed Requirements. https://support.google.com/merchants/answer/188494
- Schema.org Product Specification. https://schema.org/Product
- Perplexity AI Crawler Documentation. https://www.perplexity.ai/info/perplexitybot
- DemandSphere Radar AI Visibility Report, Q2 2026
- Shopti Customer Data, Q2 2026 (aggregate analysis of 500+ stores)
Check your store agent discoverability score free at shopti.ai
