Troubleshooting Guide

Common issues and solutions for web crawling. Quick fixes for slow crawls, blocked requests, and data quality problems.

Last updated January 6, 2025
6 min read

Troubleshooting Guide

Common problems and how to fix them. Most issues have simple solutions.

Normal Behavior

Before troubleshooting, check if what you're seeing is actually normal:

  • Crawl speed: 2-3 pages per minute is normal for quality data
  • Some errors: 5-10% error rate is expected on large sites
  • Processing time: Results take a few minutes to appear
  • Data format: We extract clean content, not raw HTML

If these look normal, your crawl is working fine.

Speed Issues

Slow Crawling

Normal speed: 2-3 pages per minute
Why: Quality over speed - prevents corrupted data and blocks

When to investigate:

  • Less than 1 page per minute consistently
  • Crawl stuck on same page for 10+ minutes
  • Multiple timeouts in a row

Solutions:

# Reduce timeout for faster failures
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://target.com",
    "maxPages": 25,
    "timeout": 15000
  }'

# Skip problematic pages
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://target.com",
    "maxPages": 50,
    "skipErrors": true
  }'

Tips:

  • Start with 10-20 pages first
  • Try different times of day
  • Target specific sections vs entire site

Page Timeouts

Common causes and fixes:

Slow website (60% of cases)

# Increase timeout
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "timeout": 45000,
    "maxPages": 20
  }'

Heavy JavaScript (25% of cases)

# Enable JavaScript rendering
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "enableJavaScript": true,
    "timeout": 30000
  }'

Server overload (10% of cases)

# Add delays between requests
curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "crawlDelay": 3000,
    "respectRobots": true
  }'

Geographic restrictions (5% of cases)

  • Contact support for proxy options
  • Try different start URLs on same domain

Access Issues

Getting Blocked (403/429 Errors)

Why: Website thinks you're crawling too aggressively
Frequency: Less than 2% of crawls with default settings

Immediate fix:

{
  "respectRobots": true,
  "crawlDelay": 5000,
  "maxConcurrentPages": 1,
  "userAgent": "BestAIScraper/1.0"
}

Gentle approach:

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://target.com",
    "crawlDelay": 10000,
    "maxPages": 10,
    "respectRobots": true
  }'

Authentication Required

Basic authentication:

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://secure-site.com",
    "auth": {
      "type": "basic",
      "username": "your-username",
      "password": "your-password"
    }
  }'

Cookie-based sessions:

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://member-site.com",
    "cookies": [
      {
        "name": "session_id",
        "value": "abc123xyz",
        "domain": "member-site.com"
      }
    ]
  }'

Custom headers (API keys, tokens):

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "url": "https://api-site.com",
    "headers": {
      "X-API-Key": "your-api-key",
      "Authorization": "Bearer your-token"
    }
  }'

Contact support if you need help setting up authentication.

Data Quality Issues

Data Looks Different

This is usually good - we extract clean, structured data instead of raw HTML.

What we transform:

<!-- Raw HTML -->
<div class="product-title">Widget Pro</div>
<span class="price">$299.99</span>

<!-- Clean output -->
Product: Widget Pro
Price: $299.99

Common "issues" that are actually features:

Missing navigation/footer: We extract main content, skip boilerplate

  • Solution: Use includeNavigation: true if needed

Different formatting: We convert HTML to clean markdown

  • Solution: Raw HTML available with includeRawHTML: true

Missing images: We extract text content by default

  • Solution: Enable with extractImages: true

Pages Have No Data

JavaScript-heavy content (70%)

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -d '{
    "enableJavaScript": true,
    "waitForContent": 5000
  }'

Empty/error pages (20%)

  • Normal - we skip pages with no valuable content
  • Check status codes - 404s and redirects are expected

Authentication required (5%)

  • See authentication section above

Content behind forms (3%)

  • Contact support for form automation

Unusual page structure (2%)

  • Single-page apps, unusual CMS systems
  • Contact support for custom extractors

API Issues

API Calls Not Working

Quick diagnostics:

# Test basic API access
curl -X GET https://api.bestaiscraper.com/projects \
  -H "Authorization: Bearer YOUR_API_KEY"

# Check API key format
echo $YOUR_API_KEY | wc -c  # Should be 40+ characters

# Verify project exists
curl -X GET https://api.bestaiscraper.com/projects/123 \
  -H "Authorization: Bearer YOUR_API_KEY"

Common issues:

401 Unauthorized: Wrong or expired API key

  • Generate new key in dashboard settings

404 Not Found: Wrong project ID or endpoint URL

  • Check project ID in dashboard URL

429 Rate Limiting: Too many requests

  • Add 1-second delays between API calls

Timeout: Large crawls or slow networks

  • Use webhooks for async processing

Webhook Issues

Debugging checklist:

# Test endpoint manually
curl -X POST https://your-app.com/webhook \
  -H "Content-Type: application/json" \
  -d '{"test": true}'

# Check webhook config
curl -X GET https://api.bestaiscraper.com/projects/123/webhooks \
  -H "Authorization: Bearer YOUR_API_KEY"

Common problems:

  • SSL certificate: Ensure valid HTTPS
  • Response code: Return 200 OK
  • Timeout: Respond within 10 seconds
  • Content-Type: Accept application/json

Getting Help

When to Contact Support

Contact us for:

  • Billing or account issues
  • Site-specific blocking problems
  • Custom authentication setup
  • Enterprise features

Try self-service first for:

  • Slow crawls (usually normal)
  • Basic timeouts (increase timeout)
  • Standard authentication (use guides above)

How to Get Help

Include this info:

  • Project ID (from dashboard URL)
  • Crawl Session ID (if applicable)
  • Target website
  • Expected vs actual result
  • Steps already tried

Contact methods:

Advanced Diagnostics

Check Crawl Health

# Session overview
curl -X GET https://api.bestaiscraper.com/sessions/456 \
  -H "Authorization: Bearer $API_KEY"

# Failed pages analysis
curl -X GET https://api.bestaiscraper.com/sessions/456/pages?error=true \
  -H "Authorization: Bearer $API_KEY"

# Queue status
curl -X GET https://api.bestaiscraper.com/sessions/456/queue \
  -H "Authorization: Bearer $API_KEY"

Custom Configurations

High-security sites:

{
  "respectRobots": true,
  "crawlDelay": 10000,
  "userAgent": "Mozilla/5.0 (compatible; BestAIScraper/1.0)",
  "maxRetries": 1,
  "timeout": 60000
}

JavaScript-heavy sites:

{
  "enableJavaScript": true,
  "waitForContent": 10000,
  "scrollPage": true,
  "timeout": 45000
}

Large e-commerce sites:

{
  "maxPages": 100,
  "crawlDelay": 2000,
  "focusAreas": ["products", "categories"],
  "skipPatterns": ["reviews", "user-content"]
}

Prevention

Best Practices

Before every crawl:

  • Start small (10-20 pages) to test
  • Check robots.txt for restrictions
  • Test during off-peak hours
  • Use descriptive project names

For long-term success:

  • Monitor success rates, adjust settings
  • Track site changes that affect crawling
  • Set up alerts for important sites
  • Document what works for different site types

Success Metrics

Excellent performance:

  • 90%+ success rate
  • 2-4 pages per minute
  • Rich content from most pages

Good performance:

  • 80-90% success rate
  • 1-2 pages per minute
  • Clean content from key pages

Investigate if:

  • Below 70% success rate
  • Less than 1 page per minute
  • Mostly empty pages

Focus on quality over quantity - 50 valuable pages beats 200 empty ones.

You're Set

This covers 95% of issues you'll encounter. The system is reliable - most problems have simple solutions.

Remember: Every "problem" is just a step toward better data extraction. Teams that push through initial challenges build the best competitive intelligence systems.

Contact support if you're still stuck. We solve problems, not create more paperwork.

Dashboard | Documentation