Crawler Core Concepts
Deep dive into the web crawling engine that powers Best AI Scraper - understand projects, sessions, pages, and the intelligent crawling system.
Crawler Core Concepts
Best AI Scraper's crawling engine is built for scale, intelligence, and reliability. Our project-centric architecture organizes everything around business goals, making complex web data extraction as simple as organizing folders.
Architecture Overview
The Big Picture
🏢 Project
├── 🎯 Crawl Sessions (individual runs)
│ ├── 📄 Pages (discovered content)
│ ├── 🔗 Links (connection graph)
│ └── 📊 Queue (intelligent scheduling)
└── ⚙️ Settings (configuration & limits)
Every piece of data flows through this hierarchy, ensuring your crawling activities stay organized and purposeful.
Projects: Your Crawling Command Center
What Makes Projects Special
Projects are containers, where you can organize multiple different crawl sessions:
Smart Organization
{
"id": 123,
"name": "E-commerce Competitor Analysis",
}
Project Lifecycle
- Creation: Define scope, target domain, and crawling behavior
- Configuration: Set limits, delays, and extraction preferences
- Execution: Run sessions with different parameters
- Learning: System adapts based on success patterns
- Evolution: Refine settings based on results
Crawl Sessions:
Session Anatomy
Each session represents one focused crawling with specific goals:
{
"id": 456,
"projectId": 123,
"domain": "https://competitor.com",
"status": "running",
"startedAt": "2025-01-06T10:00:00Z",
"pagesScanned": 25,
"maxPages": 100,
"config": {
"timeout": 20000,
"headless": true,
"followDepth": 3
}
}
Session States & Intelligence
Status Progression
running
: Active crawling in progresscompleted
: Successfully finished within limitswarning
: Completed but hit limits or constraintsfailed
: Stopped due to errors or restrictions
Smart Limits The system automatically adjusts session parameters based on:
- Your current usage quota
- Website response patterns
- Content discovery rate
- Error frequency
Session Types
Discovery Sessions
Perfect for exploring new websites:
curl -X POST /api/projects/123/crawl \
-H "Authorization: Bearer YOUR_KEY" \
-d '{
"url": "https://target.com",
"maxPages": 50,
"timeout": 25000
}'
Focused Sessions
Target specific sections:
curl -X POST /api/projects/123/crawl \
-d '{
"url": "https://target.com/products",
"maxPages": 200,
"followDepth": 2
}'
Monitoring Sessions
Regular checks for changes:
curl -X POST /api/projects/123/crawl \
-d '{
"url": "https://target.com/pricing",
"maxPages": 10,
"timeout": 15000
}'
Pages: The Content Foundation
Page Data Model
Every crawled page becomes a rich data structure:
interface Page {
id: number;
projectId: number;
sessionId: number;
url: string;
title: string;
metaDescription: string;
content: string; // Clean text content
html: string; // Raw HTML for future processing
wordCount: number;
internalLinksCount: number;
externalLinksCount: number;
statusCode: number;
crawledAt: Date;
lastAnalyzedAt?: Date; // Track when actions were run
}
Content Processing Pipeline
- Fetch: Retrieve page using intelligent browser automation
- Parse: Extract structured information with Cheerio
- Clean: Convert to readable markdown format
- Analyze: Extract metadata, links, and structured data
- Store: Save with full context and relationships
- Index: Prepare for search and analysis
Page Intelligence Features
Smart Content Extraction
- Automatic main content identification
- Navigation and footer filtering
- Structured data recognition
- Image and media cataloging
Quality Scoring
- Content depth analysis
- Link authority assessment
- SEO signal evaluation
- User experience indicators
Change Detection
- Content modification tracking
- Link structure evolution
- Performance degradation alerts
- Update frequency patterns
The Intelligent Crawl Queue
Priority Intelligence
Automatic Prioritization
100
: Start URLs (highest priority)75
: Direct navigation pages50
: Content pages discovered through links25
: External resources and assets10
: Low-value utility pages
Dynamic Adjustments The system automatically boosts priority for:
- Pages with high link density
- Content matching your project type
- Recently updated pages
- Pages with strong SEO signals
Queue Management
Retry Logic
- Failed pages get 3 retry attempts
- Exponential backoff (1s → 10s → 60s)
- Different retry strategies by error type
- Automatic session termination on repeated failures
Resource Management
- Concurrent page limits per session
- Rate limiting per domain
- Memory usage monitoring
- Queue size optimization
Links: The Connection Graph
Link Intelligence
Every discovered link becomes part of a rich relationship graph:
interface Link {
id: number;
sessionId: number;
fromPageId: number;
toUrl: string;
toPageId?: number; // Set when target is crawled
anchorText: string;
isInternal: boolean;
context: string; // Surrounding text
}
Link Analysis Features
Relationship Mapping
- Parent-child page hierarchies
- Cross-reference networks
- Content cluster identification
- Navigation pattern analysis
Quality Assessment
- Anchor text relevance scoring
- Link placement context
- Target page authority
- User journey optimization
Opportunity Detection
interface LinkOpportunity {
fromPageId: number;
toPageId: number;
relevanceScore: number; // 0-1 ML-generated score
keyword: string;
context: string;
reason: string;
}
Session Monitoring & Analytics
Real-time Progress Tracking
Monitor your crawling sessions as they happen:
curl -X GET /api/sessions/456/status
{
"id": 456,
"status": "running",
"progress": {
"pages_crawled": 47,
"pages_found": 112,
"pages_remaining": 65,
"current_url": "https://target.com/products/item-47",
"estimated_completion": "2025-01-06T10:45:00Z",
"crawl_speed": "2.3 pages/minute",
"success_rate": 0.94
}
}
Performance Insights
Crawl Efficiency
- Pages per minute rates
- Success vs failure ratios
- Bottleneck identification
- Resource utilization patterns
Content Discovery
- New page discovery rates
- Link density analysis
- Content type distribution
- Value extraction metrics
Quality Metrics
- Error pattern analysis
- Timeout frequency
- Redirect chain analysis
- Content richness scoring
Advanced Crawling Features
Intelligent Page Discovery
Multi-level Link Following
{
"followDepth": 3,
"discoverPatterns": [
"*/products/*",
"*/categories/*",
"*/blog/*"
],
"avoidPatterns": [
"*/admin/*",
"*/api/*",
"*download*"
]
}
Content-aware Crawling
- Skip low-value pages automatically
- Prioritize content-rich sections
- Detect and avoid infinite scrolls
- Handle dynamic content loading
Retry & Recovery Systems
Smart Error Handling
interface RetryConfig {
maxRetries: number; // Default: 3
initialDelay: number; // Default: 1000ms
maxDelay: number; // Default: 30000ms
backoffMultiplier: number; // Default: 2
}
Failure Analysis
- Network timeout detection
- Rate limiting response
- Content blocking identification
- Server error categorization
Automatic Recovery
- Failed page retry scheduling
- Session resumption after interruption
- Partial data preservation
- Progress checkpoint restoration
Database Architecture
Schema Design
The crawler uses PostgreSQL with Drizzle ORM for robust data management:
-- Core tables with intelligent indexing
CREATE TABLE crawl_sessions (
id SERIAL PRIMARY KEY,
project_id INTEGER REFERENCES projects(id),
user_id TEXT REFERENCES users(id),
domain TEXT NOT NULL,
status TEXT DEFAULT 'running',
pages_scanned INTEGER DEFAULT 0,
max_pages INTEGER DEFAULT 50,
config JSONB DEFAULT '{}'
);
-- Optimized for fast lookups
CREATE INDEX pages_project_url_idx ON pages(project_id, url);
CREATE INDEX crawl_queue_session_status_idx ON crawl_queue(session_id, status);
Performance Optimizations
Intelligent Indexing
- Composite indexes for common queries
- Partial indexes for active sessions
- GIN indexes for JSONB search
- Automatic index maintenance
Data Partitioning
- Session-based table partitioning
- Time-based archival strategies
- Hot/cold data separation
- Query performance optimization
Usage Limits & Resource Management
Smart Quota Management
The system automatically balances your usage across sessions:
interface UsageCheck {
allowed: boolean;
current: number;
limit: number;
resetDate: Date;
estimatedCrawlCost: number;
}
Dynamic Adjustment
- Real-time quota monitoring
- Session limit auto-adjustment
- Priority-based resource allocation
- Usage prediction and warnings
Resource Optimization
Memory Management
- Page content streaming
- Automatic garbage collection
- Large page handling
- Memory leak prevention
CPU Optimization
- Concurrent processing limits
- Queue batching strategies
- Background job scheduling
- System load monitoring
Best Practices for Effective Crawling
Project Setup
- Choose meaningful names: Include target domain and purpose
- Set appropriate limits: Start small, scale based on results
- Configure delays: Respect target website resources
- Plan your approach: Discovery → Focused → Monitoring
Session Configuration
For Discovery
{
"maxPages": 50,
"timeout": 30000,
"followDepth": 2,
"respectRobots": true
}
For Production Monitoring
{
"maxPages": 200,
"timeout": 15000,
"followDepth": 3,
"crawlDelay": 1000
}
Performance Optimization
- Monitor success rates: Adjust timeouts for better completion
- Analyze queue patterns: Optimize priority settings
- Track resource usage: Stay within quotas efficiently
- Review failure patterns: Improve configuration based on errors
Content Quality
- Focus on value: Target content-rich sections
- Avoid noise: Skip navigation, ads, and boilerplate
- Validate extraction: Check content quality regularly
- Maintain freshness: Balance discovery with updates
Error Handling & Troubleshooting
Common Crawling Issues
Rate Limiting
⚠️ Getting 429 responses?
→ Increase crawlDelay in project settings
→ Reduce maxConcurrentPages
→ Check robots.txt compliance
Timeouts
⏱️ Pages timing out frequently?
→ Increase timeout value (max 120s)
→ Check target website performance
→ Consider headless vs full browser
Memory Issues
💾 Running out of resources?
→ Reduce maxPages per session
→ Break large crawls into smaller sessions
→ Clean up old session data
Debugging Tools
Session Analysis
# Get detailed session information
curl -X GET /api/sessions/456 \
-H "Authorization: Bearer YOUR_KEY"
# Check failed pages
curl -X GET /api/sessions/456/pages?error=true
# Retry failed pages
curl -X POST /api/sessions/456/retry-failed
Real-time Monitoring
- Live crawl progress tracking
- Error rate monitoring
- Resource usage alerts
- Queue depth analysis
Ready to start crawling? Check our Quick Start Guide to launch your first project, or dive into Advanced Configuration for power user features.
Need help troubleshooting crawling issues? Our Support Center has solutions for common problems and direct access to our technical team.
Next