Crawler Core Concepts

Deep dive into the web crawling engine that powers Best AI Scraper - understand projects, sessions, pages, and the intelligent crawling system.

Last updated January 6, 2025
12 min read

Crawler Core Concepts

Best AI Scraper's crawling engine is built for scale, intelligence, and reliability. Our project-centric architecture organizes everything around business goals, making complex web data extraction as simple as organizing folders.

Architecture Overview

The Big Picture

🏢 Project
├── 🎯 Crawl Sessions (individual runs)
│   ├── 📄 Pages (discovered content)
│   ├── 🔗 Links (connection graph)
│   └── 📊 Queue (intelligent scheduling)
└── ⚙️ Settings (configuration & limits)

Every piece of data flows through this hierarchy, ensuring your crawling activities stay organized and purposeful.

Projects: Your Crawling Command Center

What Makes Projects Special

Projects are containers, where you can organize multiple different crawl sessions:

Smart Organization

{
  "id": 123,
  "name": "E-commerce Competitor Analysis",
}

Project Lifecycle

  1. Creation: Define scope, target domain, and crawling behavior
  2. Configuration: Set limits, delays, and extraction preferences
  3. Execution: Run sessions with different parameters
  4. Learning: System adapts based on success patterns
  5. Evolution: Refine settings based on results

Crawl Sessions:

Session Anatomy

Each session represents one focused crawling with specific goals:

{
  "id": 456,
  "projectId": 123,
  "domain": "https://competitor.com",
  "status": "running",
  "startedAt": "2025-01-06T10:00:00Z",
  "pagesScanned": 25,
  "maxPages": 100,
  "config": {
    "timeout": 20000,
    "headless": true,
    "followDepth": 3
  }
}

Session States & Intelligence

Status Progression

  • running: Active crawling in progress
  • completed: Successfully finished within limits
  • warning: Completed but hit limits or constraints
  • failed: Stopped due to errors or restrictions

Smart Limits The system automatically adjusts session parameters based on:

  • Your current usage quota
  • Website response patterns
  • Content discovery rate
  • Error frequency

Session Types

Discovery Sessions

Perfect for exploring new websites:

curl -X POST /api/projects/123/crawl \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{
    "url": "https://target.com",
    "maxPages": 50,
    "timeout": 25000
  }'

Focused Sessions

Target specific sections:

curl -X POST /api/projects/123/crawl \
  -d '{
    "url": "https://target.com/products",
    "maxPages": 200,
    "followDepth": 2
  }'

Monitoring Sessions

Regular checks for changes:

curl -X POST /api/projects/123/crawl \
  -d '{
    "url": "https://target.com/pricing",
    "maxPages": 10,
    "timeout": 15000
  }'

Pages: The Content Foundation

Page Data Model

Every crawled page becomes a rich data structure:

interface Page {
  id: number;
  projectId: number;
  sessionId: number;
  url: string;
  title: string;
  metaDescription: string;
  content: string;         // Clean text content
  html: string;           // Raw HTML for future processing
  wordCount: number;
  internalLinksCount: number;
  externalLinksCount: number;
  statusCode: number;
  crawledAt: Date;
  lastAnalyzedAt?: Date;  // Track when actions were run
}

Content Processing Pipeline

  1. Fetch: Retrieve page using intelligent browser automation
  2. Parse: Extract structured information with Cheerio
  3. Clean: Convert to readable markdown format
  4. Analyze: Extract metadata, links, and structured data
  5. Store: Save with full context and relationships
  6. Index: Prepare for search and analysis

Page Intelligence Features

Smart Content Extraction

  • Automatic main content identification
  • Navigation and footer filtering
  • Structured data recognition
  • Image and media cataloging

Quality Scoring

  • Content depth analysis
  • Link authority assessment
  • SEO signal evaluation
  • User experience indicators

Change Detection

  • Content modification tracking
  • Link structure evolution
  • Performance degradation alerts
  • Update frequency patterns

The Intelligent Crawl Queue

Priority Intelligence

Automatic Prioritization

  • 100: Start URLs (highest priority)
  • 75: Direct navigation pages
  • 50: Content pages discovered through links
  • 25: External resources and assets
  • 10: Low-value utility pages

Dynamic Adjustments The system automatically boosts priority for:

  • Pages with high link density
  • Content matching your project type
  • Recently updated pages
  • Pages with strong SEO signals

Queue Management

Retry Logic

  • Failed pages get 3 retry attempts
  • Exponential backoff (1s → 10s → 60s)
  • Different retry strategies by error type
  • Automatic session termination on repeated failures

Resource Management

  • Concurrent page limits per session
  • Rate limiting per domain
  • Memory usage monitoring
  • Queue size optimization

Every discovered link becomes part of a rich relationship graph:

interface Link {
  id: number;
  sessionId: number;
  fromPageId: number;
  toUrl: string;
  toPageId?: number;     // Set when target is crawled
  anchorText: string;
  isInternal: boolean;
  context: string;       // Surrounding text
}

Relationship Mapping

  • Parent-child page hierarchies
  • Cross-reference networks
  • Content cluster identification
  • Navigation pattern analysis

Quality Assessment

  • Anchor text relevance scoring
  • Link placement context
  • Target page authority
  • User journey optimization

Opportunity Detection

interface LinkOpportunity {
  fromPageId: number;
  toPageId: number;
  relevanceScore: number;  // 0-1 ML-generated score
  keyword: string;
  context: string;
  reason: string;
}

Session Monitoring & Analytics

Real-time Progress Tracking

Monitor your crawling sessions as they happen:

curl -X GET /api/sessions/456/status
{
  "id": 456,
  "status": "running",
  "progress": {
    "pages_crawled": 47,
    "pages_found": 112,
    "pages_remaining": 65,
    "current_url": "https://target.com/products/item-47",
    "estimated_completion": "2025-01-06T10:45:00Z",
    "crawl_speed": "2.3 pages/minute",
    "success_rate": 0.94
  }
}

Performance Insights

Crawl Efficiency

  • Pages per minute rates
  • Success vs failure ratios
  • Bottleneck identification
  • Resource utilization patterns

Content Discovery

  • New page discovery rates
  • Link density analysis
  • Content type distribution
  • Value extraction metrics

Quality Metrics

  • Error pattern analysis
  • Timeout frequency
  • Redirect chain analysis
  • Content richness scoring

Advanced Crawling Features

Intelligent Page Discovery

Multi-level Link Following

{
  "followDepth": 3,
  "discoverPatterns": [
    "*/products/*",
    "*/categories/*",
    "*/blog/*"
  ],
  "avoidPatterns": [
    "*/admin/*",
    "*/api/*",
    "*download*"
  ]
}

Content-aware Crawling

  • Skip low-value pages automatically
  • Prioritize content-rich sections
  • Detect and avoid infinite scrolls
  • Handle dynamic content loading

Retry & Recovery Systems

Smart Error Handling

interface RetryConfig {
  maxRetries: number;     // Default: 3
  initialDelay: number;   // Default: 1000ms
  maxDelay: number;       // Default: 30000ms
  backoffMultiplier: number; // Default: 2
}

Failure Analysis

  • Network timeout detection
  • Rate limiting response
  • Content blocking identification
  • Server error categorization

Automatic Recovery

  • Failed page retry scheduling
  • Session resumption after interruption
  • Partial data preservation
  • Progress checkpoint restoration

Database Architecture

Schema Design

The crawler uses PostgreSQL with Drizzle ORM for robust data management:

-- Core tables with intelligent indexing
CREATE TABLE crawl_sessions (
  id SERIAL PRIMARY KEY,
  project_id INTEGER REFERENCES projects(id),
  user_id TEXT REFERENCES users(id),
  domain TEXT NOT NULL,
  status TEXT DEFAULT 'running',
  pages_scanned INTEGER DEFAULT 0,
  max_pages INTEGER DEFAULT 50,
  config JSONB DEFAULT '{}'
);

-- Optimized for fast lookups
CREATE INDEX pages_project_url_idx ON pages(project_id, url);
CREATE INDEX crawl_queue_session_status_idx ON crawl_queue(session_id, status);

Performance Optimizations

Intelligent Indexing

  • Composite indexes for common queries
  • Partial indexes for active sessions
  • GIN indexes for JSONB search
  • Automatic index maintenance

Data Partitioning

  • Session-based table partitioning
  • Time-based archival strategies
  • Hot/cold data separation
  • Query performance optimization

Usage Limits & Resource Management

Smart Quota Management

The system automatically balances your usage across sessions:

interface UsageCheck {
  allowed: boolean;
  current: number;
  limit: number;
  resetDate: Date;
  estimatedCrawlCost: number;
}

Dynamic Adjustment

  • Real-time quota monitoring
  • Session limit auto-adjustment
  • Priority-based resource allocation
  • Usage prediction and warnings

Resource Optimization

Memory Management

  • Page content streaming
  • Automatic garbage collection
  • Large page handling
  • Memory leak prevention

CPU Optimization

  • Concurrent processing limits
  • Queue batching strategies
  • Background job scheduling
  • System load monitoring

Best Practices for Effective Crawling

Project Setup

  1. Choose meaningful names: Include target domain and purpose
  2. Set appropriate limits: Start small, scale based on results
  3. Configure delays: Respect target website resources
  4. Plan your approach: Discovery → Focused → Monitoring

Session Configuration

For Discovery

{
  "maxPages": 50,
  "timeout": 30000,
  "followDepth": 2,
  "respectRobots": true
}

For Production Monitoring

{
  "maxPages": 200,
  "timeout": 15000,
  "followDepth": 3,
  "crawlDelay": 1000
}

Performance Optimization

  1. Monitor success rates: Adjust timeouts for better completion
  2. Analyze queue patterns: Optimize priority settings
  3. Track resource usage: Stay within quotas efficiently
  4. Review failure patterns: Improve configuration based on errors

Content Quality

  1. Focus on value: Target content-rich sections
  2. Avoid noise: Skip navigation, ads, and boilerplate
  3. Validate extraction: Check content quality regularly
  4. Maintain freshness: Balance discovery with updates

Error Handling & Troubleshooting

Common Crawling Issues

Rate Limiting

⚠️ Getting 429 responses?
→ Increase crawlDelay in project settings
→ Reduce maxConcurrentPages
→ Check robots.txt compliance

Timeouts

⏱️ Pages timing out frequently?
→ Increase timeout value (max 120s)
→ Check target website performance
→ Consider headless vs full browser

Memory Issues

💾 Running out of resources?
→ Reduce maxPages per session
→ Break large crawls into smaller sessions
→ Clean up old session data

Debugging Tools

Session Analysis

# Get detailed session information
curl -X GET /api/sessions/456 \
  -H "Authorization: Bearer YOUR_KEY"

# Check failed pages
curl -X GET /api/sessions/456/pages?error=true

# Retry failed pages
curl -X POST /api/sessions/456/retry-failed

Real-time Monitoring

  • Live crawl progress tracking
  • Error rate monitoring
  • Resource usage alerts
  • Queue depth analysis

Ready to start crawling? Check our Quick Start Guide to launch your first project, or dive into Advanced Configuration for power user features.

Need help troubleshooting crawling issues? Our Support Center has solutions for common problems and direct access to our technical team.