Skip to main content
The page.extract() method uses AI to extract structured data from web pages. You can optionally provide a Zod schema for type-safe extraction with automatic validation.

Signature

// With schema (type-safe)
page.extract<T>(
  instruction: string,
  schema: z.ZodType<T>
): Promise<ExtractResult<T>>

// Without schema (string extraction)
page.extract(
  instruction: string
): Promise<ExtractResult<{ extraction: string }>>
Parameters:
  • instruction - Natural language description of what data to extract
  • schema (optional) - Zod schema for validation and type safety
Returns: Promise<ExtractResult<T>>
{
  success: boolean
  data?: T              // Typed based on your schema
  error?: string
  reasoning?: string    // AI's explanation
}

Installation

You’ll need to install Zod for schema-based extraction:
npm install zod

Basic Examples

Simple Text Extraction

Extract text without a schema:
await page.goto('https://example.com/article')

const result = await page.extract('Extract the article title')

if (result.success) {
  console.log(result.data.extraction)
  // "How to Build Better Software"
}

Single Object Extraction

Extract a structured object with type safety:
import { z } from 'zod'

const productSchema = z.object({
  name: z.string(),
  price: z.number(),
  inStock: z.boolean()
})

await page.goto('https://shop.example.com/product/123')

const result = await page.extract(
  'Extract the product information',
  productSchema
)

if (result.success && result.data) {
  console.log(result.data.name)      // string
  console.log(result.data.price)     // number
  console.log(result.data.inStock)   // boolean
  // TypeScript knows the exact types!
}

Intermediate Examples

Multiple Fields

Extract complex objects with many fields:
const userSchema = z.object({
  name: z.string(),
  email: z.string().email(),
  joinDate: z.string(),
  verified: z.boolean(),
  profileImage: z.string().url()
})

await page.goto('https://example.com/profile/johndoe')

const result = await page.extract(
  'Extract user profile information',
  userSchema
)

if (result.success && result.data) {
  const user = result.data
  console.log(`${user.name} (${user.email})`)
  console.log(`Joined: ${user.joinDate}`)
  console.log(`Verified: ${user.verified}`)
}

Optional Fields

Handle optional data with Zod:
const articleSchema = z.object({
  title: z.string(),
  author: z.string(),
  publishDate: z.string(),
  updatedDate: z.string().optional(),  // May not be present
  tags: z.array(z.string()).optional(),
  readTime: z.number().optional()
})

const result = await page.extract(
  'Extract article metadata',
  articleSchema
)

Advanced Examples

Array Extraction

Extract lists of items:
const searchResultsSchema = z.object({
  query: z.string(),
  results: z.array(z.object({
    title: z.string(),
    url: z.string().url(),
    description: z.string(),
    price: z.number().optional()
  })),
  totalResults: z.number()
})

await page.goto('https://example.com/search?q=laptop')

const result = await page.extract(
  'Extract all search results with their details',
  searchResultsSchema
)

if (result.success && result.data) {
  console.log(`Found ${result.data.totalResults} results for "${result.data.query}"`)

  result.data.results.forEach(item => {
    console.log(`${item.title} - $${item.price}`)
    console.log(item.url)
  })
}

Nested Objects

Extract complex nested data structures:
const restaurantSchema = z.object({
  name: z.string(),
  rating: z.number(),
  priceLevel: z.string(),
  address: z.object({
    street: z.string(),
    city: z.string(),
    state: z.string(),
    zip: z.string()
  }),
  hours: z.object({
    monday: z.string(),
    tuesday: z.string(),
    wednesday: z.string(),
    thursday: z.string(),
    friday: z.string(),
    saturday: z.string(),
    sunday: z.string()
  }),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    text: z.string(),
    date: z.string()
  }))
})

const result = await page.extract(
  'Extract complete restaurant information including address, hours, and recent reviews',
  restaurantSchema
)

Table Extraction

Extract data from HTML tables:
const tableSchema = z.object({
  headers: z.array(z.string()),
  rows: z.array(z.array(z.string()))
})

await page.goto('https://example.com/data')

const result = await page.extract(
  'Extract the pricing table data',
  tableSchema
)

if (result.success && result.data) {
  // Print as CSV
  console.log(result.data.headers.join(','))
  result.data.rows.forEach(row => {
    console.log(row.join(','))
  })
}

Zod Schema Primer

Basic Types

import { z } from 'zod'

z.string()              // string
z.number()              // number
z.boolean()             // boolean
z.date()                // Date object
z.string().url()        // URL string
z.string().email()      // Email string
z.string().uuid()       // UUID string
z.literal('specific')   // Exact value
z.enum(['a', 'b', 'c']) // One of several values

Optional and Nullable

z.string().optional()         // string | undefined
z.string().nullable()         // string | null
z.string().nullish()          // string | null | undefined
z.string().default('hello')   // string with default value

Arrays and Objects

z.array(z.string())           // string[]
z.object({                    // { name: string, age: number }
  name: z.string(),
  age: z.number()
})

Validation

z.string().min(3)             // At least 3 characters
z.string().max(100)           // At most 100 characters
z.number().positive()         // Must be positive
z.number().int()              // Must be integer
z.number().min(0).max(100)    // Between 0 and 100

Error Handling

Handle validation errors gracefully:
const schema = z.object({
  price: z.number().positive(),
  email: z.string().email()
})

const result = await page.extract(
  'Extract product price and contact email',
  schema
)

if (result.success && result.data) {
  // Data is valid and typed
  console.log('Price:', result.data.price)
  console.log('Email:', result.data.email)
} else {
  // Extraction failed or validation failed
  console.error('Extraction error:', result.error)

  // See AI reasoning
  if (result.reasoning) {
    console.log('AI reasoning:', result.reasoning)
  }
}

Best Practices

Be Specific in Instructions

Clear instructions lead to better extraction results
Good:
await page.extract(
  'Extract the product name, price in USD, and availability status from the product details section',
  productSchema
)
Bad:
await page.extract('Get the info', productSchema) // Too vague

Design Schemas Carefully

Match your schema to the actual data structure:
// If prices include currency symbols
const priceSchema = z.string() // "$29.99"
// Not: z.number() (would fail validation)

// Or extract and parse
const priceSchema = z.string().transform(str =>
  parseFloat(str.replace(/[$,]/g, ''))
)

Handle Missing Data

Use optional fields for data that might not be present:
const schema = z.object({
  title: z.string(),
  // These might not always be present
  subtitle: z.string().optional(),
  author: z.string().optional(),
  rating: z.number().optional()
})

Test with Real Pages

Always test extraction with actual pages:
// Test extraction
const result = await page.extract(instruction, schema)

if (!result.success) {
  console.error('Extraction failed:', result.error)
  // Adjust instruction or schema
}

Common Use Cases

E-commerce Product Data

const productSchema = z.object({
  name: z.string(),
  brand: z.string(),
  price: z.number(),
  originalPrice: z.number().optional(),
  discount: z.number().optional(),
  rating: z.number(),
  reviewCount: z.number(),
  inStock: z.boolean(),
  images: z.array(z.string().url()),
  description: z.string()
})

const result = await page.extract(
  'Extract all product details',
  productSchema
)

Article Metadata

const articleSchema = z.object({
  headline: z.string(),
  subheadline: z.string().optional(),
  author: z.string(),
  publishDate: z.string(),
  readingTime: z.number(),
  tags: z.array(z.string()),
  summary: z.string()
})

const result = await page.extract(
  'Extract article metadata and summary',
  articleSchema
)

Contact Information

const contactSchema = z.object({
  name: z.string(),
  email: z.string().email().optional(),
  phone: z.string().optional(),
  address: z.string().optional(),
  website: z.string().url().optional(),
  socialMedia: z.object({
    twitter: z.string().optional(),
    linkedin: z.string().optional(),
    facebook: z.string().optional()
  }).optional()
})

const result = await page.extract(
  'Extract contact information from the page',
  contactSchema
)

Reviews and Ratings

const reviewsSchema = z.object({
  overallRating: z.number(),
  totalReviews: z.number(),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    title: z.string(),
    text: z.string(),
    date: z.string(),
    helpful: z.number().optional()
  }))
})

const result = await page.extract(
  'Extract product reviews and ratings',
  reviewsSchema
)

Performance Tips

  1. Be specific - Clear instructions reduce processing time
  2. Use appropriate schemas - Don’t over-complicate schemas
  3. Extract once - Cache results instead of re-extracting
  4. Batch extraction - Extract multiple fields at once rather than separate calls

Limitations

AI extraction has some limitations:
  • Requires API calls (adds latency and cost)
  • May not work on obfuscated or heavily JavaScript-rendered content
  • Accuracy depends on page structure and instruction clarity
  • Rate limits apply based on your AI provider

Natural Language Actions

Perform actions with page.act()

AI Setup

Configure AI agents and providers

Best Practices

Effective AI automation patterns

JavaScript Evaluation

Manual data extraction with evaluate()