Managing Projects

Learn how to organize your scraping activities using projects, sessions, and effective project management strategies.

Last updated January 6, 2025
6 min read

Managing Projects

Projects are the organizational foundation of Best AI Scraper. They help you group related scraping activities, manage configurations, and track results across different websites or campaigns.

Project Structure

Hierarchy Overview

Project
├── Actions (what to extract)
├── Sessions (individual crawl runs)  
│   ├── Pages (crawled URLs)
│   └── Results (extracted data)
└── Settings (project configuration)

Project Lifecycle

  1. Create Project: Define scope and objectives
  2. Configure Actions: Specify what data to extract
  3. Run Sessions: Execute crawling with different parameters
  4. Analyze Results: Review and export extracted data
  5. Iterate: Refine actions and re-run as needed

Creating Projects

Via Dashboard

  1. Navigate to your Dashboard
  2. Click "New Project"
  3. Fill in project details:
    • Name: Descriptive project name
    • Description: Project goals and scope
    • Tags: Organizational labels
    • Website: Primary target website (optional)

Via API

curl -X POST https://api.bestaiscraper.com/projects \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "E-commerce Competitor Analysis",
    "description": "Track pricing and product information from competitor sites",
    "tags": ["ecommerce", "monitoring", "competitive"],
    "settings": {
      "respectRobots": true,
      "crawlDelay": 2000,
      "userAgent": "BestAIScraper/1.0"
    }
  }'

Project Configuration

Basic Settings

Every project has configurable settings that apply to all sessions:

{
  "settings": {
    "respectRobots": true,
    "crawlDelay": 1000,
    "maxConcurrentPages": 5,
    "userAgent": "BestAIScraper/1.0", 
    "timeout": 30000,
    "retryCount": 3
  }
}

Setting Descriptions:

  • respectRobots (boolean): Follow robots.txt directives
  • crawlDelay (number): Delay between requests in milliseconds
  • maxConcurrentPages (number): Maximum concurrent page requests
  • userAgent (string): User agent string for requests
  • timeout (number): Page load timeout in milliseconds
  • retryCount (number): Number of retries for failed pages

Advanced Configuration

{
  "settings": {
    "authentication": {
      "type": "basic",
      "username": "user",
      "password": "pass"
    },
    "headers": {
      "X-Custom-Header": "value"
    },
    "cookies": [
      {
        "name": "session",
        "value": "abc123",
        "domain": "example.com"
      }
    ]
  }
}

Project Templates

Speed up project creation with pre-configured templates:

E-commerce Analysis

{
  "template": "ecommerce-analysis",
  "actions": [
    {
      "type": "ecommerce-data",
      "config": {
        "extractPrices": true,
        "extractReviews": true
      }
    },
    {
      "type": "structured-data", 
      "config": {
        "schemas": ["Product", "Offer"]
      }
    }
  ]
}

SEO Audit

{
  "template": "seo-audit",
  "actions": [
    {
      "type": "internal-links",
      "config": {
        "followDepth": 3,
        "extractAnchorContext": true
      }
    },
    {
      "type": "structured-data",
      "config": {
        "validateSchema": true
      }
    }
  ]
}

Content Discovery

{
  "template": "content-discovery", 
  "actions": [
    {
      "type": "internal-links",
      "config": {
        "followDepth": 5,
        "filterPatterns": ["*/blog/*", "*/articles/*"]
      }
    }
  ]
}

Managing Sessions

Session Types

One-time Crawl

Perfect for ad-hoc analysis:

{
  "type": "one-time",
  "startUrl": "https://example.com",
  "maxPages": 100
}

Scheduled Crawl

For regular monitoring:

{
  "type": "scheduled",
  "startUrl": "https://example.com", 
  "schedule": "0 9 * * 1",
  "maxPages": 50
}

Incremental Crawl

Only crawl new/changed pages:

{
  "type": "incremental",
  "startUrl": "https://example.com",
  "lastCrawlDate": "2025-01-01T00:00:00Z"
}

Session Monitoring

Track session progress in real-time:

curl -X GET https://api.bestaiscraper.com/sessions/sess_123/status \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "id": "sess_123",
  "status": "running", 
  "progress": {
    "pages_crawled": 25,
    "pages_found": 47,
    "pages_remaining": 22,
    "current_url": "https://example.com/page-25",
    "estimated_completion": "2025-01-06T10:15:00Z"
  }
}

Project Collaboration

Team Access

Invite team members to collaborate on projects:

curl -X POST https://api.bestaiscraper.com/projects/proj_123/members \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "email": "teammate@example.com",
    "role": "editor"
  }'

Role Permissions:

  • Viewer: Read-only access to project and results
  • Editor: Can run sessions and modify actions
  • Admin: Full project management including member management

Sharing Results

Generate shareable links for results:

curl -X POST https://api.bestaiscraper.com/projects/proj_123/share \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "permissions": "read",
    "expireAt": "2025-02-01T00:00:00Z"
  }'

Project Analytics

Usage Tracking

Monitor project resource usage:

{
  "project_id": "proj_123",
  "analytics": {
    "total_sessions": 15,
    "pages_crawled": 1250,
    "data_points_extracted": 5430,
    "success_rate": 0.96,
    "avg_session_duration": 180,
    "monthly_usage": {
      "api_calls": 847,
      "storage_mb": 23.4
    }
  }
}

Performance Insights

Identify optimization opportunities:

{
  "insights": [
    {
      "type": "efficiency",
      "message": "Consider reducing followDepth to improve crawl speed",
      "action": "internal-links",
      "impact": "medium"
    },
    {
      "type": "coverage", 
      "message": "12% of pages failed to extract data",
      "pages": ["https://example.com/js-heavy-page"],
      "suggestion": "Enable JavaScript rendering"
    }
  ]
}

Data Export & Integration

Export Formats

Export project results in multiple formats:

  • JSON: Programmatic integration
  • CSV: Spreadsheet analysis
  • XML: Legacy system integration
  • Webhook: Real-time streaming

Export API

curl -X GET https://api.bestaiscraper.com/projects/proj_123/export \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Accept: text/csv" \
  -G -d "format=csv" -d "action_type=internal-links"

Webhook Integration

Set up webhooks for automated data processing:

{
  "webhook": {
    "url": "https://your-app.com/webhook",
    "events": ["session.completed"],
    "filters": {
      "project_id": "proj_123",
      "action_types": ["ecommerce-data"]
    }
  }
}

Best Practices

Project Organization

  1. Use descriptive names: Include target website and purpose
  2. Tag consistently: Develop a tagging taxonomy
  3. Document objectives: Clear descriptions help team members
  4. Archive completed projects: Keep workspace clean

Performance Optimization

  1. Start small: Test with limited pages before scaling
  2. Monitor resources: Track usage against plan limits
  3. Optimize actions: Remove unnecessary extractors
  4. Use filters: Target specific content areas

Data Management

  1. Regular exports: Don't rely solely on platform storage
  2. Version control: Track configuration changes
  3. Data validation: Verify extraction accuracy
  4. Cleanup old data: Remove outdated sessions

Need help organizing your projects? Check our Getting Started guide or contact support for personalized assistance.