Troubleshooting Guide
Common issues and solutions for web crawling. Quick fixes for slow crawls, blocked requests, and data quality problems.
Troubleshooting Guide
Common problems and how to fix them. Most issues have simple solutions.
Normal Behavior
Before troubleshooting, check if what you're seeing is actually normal:
- Crawl speed: 2-3 pages per minute is normal for quality data
- Some errors: 5-10% error rate is expected on large sites
- Processing time: Results take a few minutes to appear
- Data format: We extract clean content, not raw HTML
If these look normal, your crawl is working fine.
Speed Issues
Slow Crawling
Normal speed: 2-3 pages per minute
Why: Quality over speed - prevents corrupted data and blocks
When to investigate:
- Less than 1 page per minute consistently
- Crawl stuck on same page for 10+ minutes
- Multiple timeouts in a row
Solutions:
# Reduce timeout for faster failures
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://target.com",
"maxPages": 25,
"timeout": 15000
}'
# Skip problematic pages
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://target.com",
"maxPages": 50,
"skipErrors": true
}'
Tips:
- Start with 10-20 pages first
- Try different times of day
- Target specific sections vs entire site
Page Timeouts
Common causes and fixes:
Slow website (60% of cases)
# Increase timeout
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"timeout": 45000,
"maxPages": 20
}'
Heavy JavaScript (25% of cases)
# Enable JavaScript rendering
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"enableJavaScript": true,
"timeout": 30000
}'
Server overload (10% of cases)
# Add delays between requests
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"crawlDelay": 3000,
"respectRobots": true
}'
Geographic restrictions (5% of cases)
- Contact support for proxy options
- Try different start URLs on same domain
Access Issues
Getting Blocked (403/429 Errors)
Why: Website thinks you're crawling too aggressively
Frequency: Less than 2% of crawls with default settings
Immediate fix:
{
"respectRobots": true,
"crawlDelay": 5000,
"maxConcurrentPages": 1,
"userAgent": "BestAIScraper/1.0"
}
Gentle approach:
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://target.com",
"crawlDelay": 10000,
"maxPages": 10,
"respectRobots": true
}'
Authentication Required
Basic authentication:
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://secure-site.com",
"auth": {
"type": "basic",
"username": "your-username",
"password": "your-password"
}
}'
Cookie-based sessions:
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://member-site.com",
"cookies": [
{
"name": "session_id",
"value": "abc123xyz",
"domain": "member-site.com"
}
]
}'
Custom headers (API keys, tokens):
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"url": "https://api-site.com",
"headers": {
"X-API-Key": "your-api-key",
"Authorization": "Bearer your-token"
}
}'
Contact support if you need help setting up authentication.
Data Quality Issues
Data Looks Different
This is usually good - we extract clean, structured data instead of raw HTML.
What we transform:
<!-- Raw HTML -->
<div class="product-title">Widget Pro</div>
<span class="price">$299.99</span>
<!-- Clean output -->
Product: Widget Pro
Price: $299.99
Common "issues" that are actually features:
Missing navigation/footer: We extract main content, skip boilerplate
- Solution: Use
includeNavigation: true
if needed
Different formatting: We convert HTML to clean markdown
- Solution: Raw HTML available with
includeRawHTML: true
Missing images: We extract text content by default
- Solution: Enable with
extractImages: true
Pages Have No Data
JavaScript-heavy content (70%)
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
-d '{
"enableJavaScript": true,
"waitForContent": 5000
}'
Empty/error pages (20%)
- Normal - we skip pages with no valuable content
- Check status codes - 404s and redirects are expected
Authentication required (5%)
- See authentication section above
Content behind forms (3%)
- Contact support for form automation
Unusual page structure (2%)
- Single-page apps, unusual CMS systems
- Contact support for custom extractors
API Issues
API Calls Not Working
Quick diagnostics:
# Test basic API access
curl -X GET https://api.bestaiscraper.com/projects \
-H "Authorization: Bearer YOUR_API_KEY"
# Check API key format
echo $YOUR_API_KEY | wc -c # Should be 40+ characters
# Verify project exists
curl -X GET https://api.bestaiscraper.com/projects/123 \
-H "Authorization: Bearer YOUR_API_KEY"
Common issues:
401 Unauthorized: Wrong or expired API key
- Generate new key in dashboard settings
404 Not Found: Wrong project ID or endpoint URL
- Check project ID in dashboard URL
429 Rate Limiting: Too many requests
- Add 1-second delays between API calls
Timeout: Large crawls or slow networks
- Use webhooks for async processing
Webhook Issues
Debugging checklist:
# Test endpoint manually
curl -X POST https://your-app.com/webhook \
-H "Content-Type: application/json" \
-d '{"test": true}'
# Check webhook config
curl -X GET https://api.bestaiscraper.com/projects/123/webhooks \
-H "Authorization: Bearer YOUR_API_KEY"
Common problems:
- SSL certificate: Ensure valid HTTPS
- Response code: Return 200 OK
- Timeout: Respond within 10 seconds
- Content-Type: Accept
application/json
Getting Help
When to Contact Support
Contact us for:
- Billing or account issues
- Site-specific blocking problems
- Custom authentication setup
- Enterprise features
Try self-service first for:
- Slow crawls (usually normal)
- Basic timeouts (increase timeout)
- Standard authentication (use guides above)
How to Get Help
Include this info:
- Project ID (from dashboard URL)
- Crawl Session ID (if applicable)
- Target website
- Expected vs actual result
- Steps already tried
Contact methods:
- Email: support@bestaiscraper.com (2-4 hours)
- Live chat: Dashboard support widget (immediate)
Advanced Diagnostics
Check Crawl Health
# Session overview
curl -X GET https://api.bestaiscraper.com/sessions/456 \
-H "Authorization: Bearer $API_KEY"
# Failed pages analysis
curl -X GET https://api.bestaiscraper.com/sessions/456/pages?error=true \
-H "Authorization: Bearer $API_KEY"
# Queue status
curl -X GET https://api.bestaiscraper.com/sessions/456/queue \
-H "Authorization: Bearer $API_KEY"
Custom Configurations
High-security sites:
{
"respectRobots": true,
"crawlDelay": 10000,
"userAgent": "Mozilla/5.0 (compatible; BestAIScraper/1.0)",
"maxRetries": 1,
"timeout": 60000
}
JavaScript-heavy sites:
{
"enableJavaScript": true,
"waitForContent": 10000,
"scrollPage": true,
"timeout": 45000
}
Large e-commerce sites:
{
"maxPages": 100,
"crawlDelay": 2000,
"focusAreas": ["products", "categories"],
"skipPatterns": ["reviews", "user-content"]
}
Prevention
Best Practices
Before every crawl:
- Start small (10-20 pages) to test
- Check robots.txt for restrictions
- Test during off-peak hours
- Use descriptive project names
For long-term success:
- Monitor success rates, adjust settings
- Track site changes that affect crawling
- Set up alerts for important sites
- Document what works for different site types
Success Metrics
Excellent performance:
- 90%+ success rate
- 2-4 pages per minute
- Rich content from most pages
Good performance:
- 80-90% success rate
- 1-2 pages per minute
- Clean content from key pages
Investigate if:
- Below 70% success rate
- Less than 1 page per minute
- Mostly empty pages
Focus on quality over quantity - 50 valuable pages beats 200 empty ones.
You're Set
This covers 95% of issues you'll encounter. The system is reliable - most problems have simple solutions.
Remember: Every "problem" is just a step toward better data extraction. Teams that push through initial challenges build the best competitive intelligence systems.
Contact support if you're still stuck. We solve problems, not create more paperwork.
Previous