Getting Started

Set up your first crawling project and start extracting data from websites. Simple setup, reliable results.

Last updated January 6, 2025
8 min read

Getting Started

This guide walks you through setting up your first project and running your first crawl. Takes about 10 minutes to get results.

The system is built around projects that contain crawl sessions. Each session crawls a website and extracts the content into structured data you can analyze or export.

How It Works

The crawler works in layers:

  1. Projects - Organize related crawling activities
  2. Sessions - Individual crawl runs with specific settings
  3. Pages - Extracted content with metadata and links
  4. Queue - Manages page discovery and crawling order

Data comes out clean and structured, not as raw HTML. You can export to CSV, JSON, or connect via API.

Quick Start: First Crawl

Step 1: Create a Project

Projects group related crawls together.

Via Dashboard:

  1. Go to Dashboard
  2. Click "New Project"
  3. Give it a name and description
  4. Choose project type if you want (optional - helps with default settings)

Via API:

curl -X POST https://api.bestaiscraper.com/projects \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Site Analysis",
    "description": "Analyze competitor site structure"
  }'

Use clear names - you'll have multiple projects eventually.

Step 2: Crawl Settings

Default settings work for most sites:

{
  "settings": {
    "respectRobots": true,
    "crawlDelay": 2000,
    "maxPages": 50,
    "timeout": 25000,
    "userAgent": "BestAIScraper/1.0"
  }
}

These prevent getting blocked and ensure reliable extraction. You can adjust them later if needed.

Step 3: Start Crawling

Dashboard Method:

  1. Open your project
  2. Click "Start Crawl"
  3. Enter the website URL
  4. Set max pages (start with 25)
  5. Click "Start"

API Method:

curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "maxPages": 25,
    "timeout": 25000
  }'

What Happens:

  • System discovers pages by following links
  • Content gets extracted and cleaned
  • Links between pages are tracked
  • Data gets organized for export

Step 4: Monitor Progress

Check crawl status:

curl -X GET https://api.bestaiscraper.com/sessions/456/status \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "status": "running",
  "progress": {
    "pages_crawled": 12,
    "pages_found": 28,
    "current_url": "https://example.com/page-12",
    "crawl_speed": "2.4 pages/minute"
  }
}

When status shows "completed", your data is ready.

Step 5: Get Your Data

View Results: Go to Projects → Your Project → Results

Export Data:

curl -X GET https://api.bestaiscraper.com/projects/123/export \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Accept: text/csv"

What You Get:

  • Clean page content (no HTML tags)
  • Page titles, descriptions, word counts
  • Link relationships between pages
  • Export as CSV, JSON, or via API

What You Get

After a successful crawl:

  • All page content cleaned and structured
  • Page titles, descriptions, metadata
  • Link relationships mapped
  • Content organized by page hierarchy
  • Data ready for analysis or export

This beats manual copy-pasting because it's consistent, complete, and you can repeat it anytime.

Common Uses

Marketing: Find content gaps, analyze messaging, discover keywords E-commerce: Compare prices, track product changes, monitor inventory Analysis: Research markets, track trends, gather competitive data

The data format stays consistent, so you can build analysis workflows around it.

Next Steps

Multiple Sites

Crawl several sites in the same project:

sites=("site1.com" "site2.com" "site3.com")

for site in "${sites[@]}"; do
  curl -X POST https://api.bestaiscraper.com/projects/123/crawl \
    -H "Authorization: Bearer YOUR_API_KEY" \
    -d "{\"url\": \"https://$site\", \"maxPages\": 30}"
done

Webhooks

Get notified when crawls complete:

curl -X POST https://api.bestaiscraper.com/projects/123/webhooks \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "url": "https://your-app.com/webhook",
    "events": ["session.completed"]
  }'

Scheduled Crawls

Planned feature - Run crawls automatically:

curl -X POST https://api.bestaiscraper.com/projects/123/schedule \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "schedule": "0 9 * * 1",
    "maxPages": 50
  }'

Why This Works Better

vs Manual Research:

  • Faster and more consistent
  • Repeatable process
  • Better data format
  • No copy-paste errors

vs Developer Tools:

  • No setup or maintenance
  • Works immediately
  • No technical knowledge needed
  • Predictable costs

vs Enterprise Software:

  • Quick to implement
  • Reasonable pricing
  • Built for actual business needs

The main benefit is having reliable, current data instead of guessing or using outdated information.

Learning Path

Week 1: Run several crawls, understand the data format, try exports Week 2: Learn about Actions and Project ManagementWeek 3: Set up monitoring, integrate with your existing tools

The core concepts are simple - projects contain sessions, sessions crawl sites, sites become structured data.

Documentation

Getting Help

Common Questions

"Crawls seem slow" - Normal. 2-3 pages per minute ensures good data quality.

"Can I crawl protected sites?" - Yes, see authentication docs.

"Getting blocked?" - Rare with default settings. See troubleshooting guide.

"Need more pages?" - Increase maxPages or upgrade plan.

Support

We respond quickly and solve problems instead of sending you in circles.

You're Done

You now have a working system for extracting structured data from websites. It's faster and more reliable than manual research.

Next steps:

  1. Run more crawls to get familiar with the data
  2. Export results and see how they fit your workflow
  3. Read other docs to understand advanced features

The system is straightforward - you'll figure out the rest by using it.

Dashboard | All Documentation