Skip to main content
The page.extract() method uses AI to extract structured data from web pages. You can optionally provide a Zod schema for type-safe extraction with automatic validation.

Signature

// With schema (type-safe)
page.extract<T>(
  instruction: string,
  schema: z.ZodType<T>
): Promise<ExtractResult<T>>

// Without schema (string extraction)
page.extract(
  instruction: string
): Promise<ExtractResult<{ extraction: string }>>
Parameters:
  • instruction - Natural language description of what data to extract
  • schema (optional) - Zod schema for validation and type safety
Returns: Promise<ExtractResult<T>>
{
  success: boolean
  data?: T              // Typed based on your schema
  error?: string
  reasoning?: string    // AI's explanation
}

Installation

You’ll need to install Zod for schema-based extraction:
npm install zod

Basic Examples

Simple Text Extraction

Extract text without a schema:
await page.goto('https://example.com/article')

const result = await page.extract('Extract the article title')

if (result.success) {
  console.log(result.data.extraction)
  // "How to Build Better Software"
}

Single Object Extraction

Extract a structured object with type safety:
import { z } from 'zod'

const productSchema = z.object({
  name: z.string(),
  price: z.number(),
  inStock: z.boolean()
})

await page.goto('https://shop.example.com/product/123')

const result = await page.extract(
  'Extract the product information',
  productSchema
)

if (result.success && result.data) {
  console.log(result.data.name)      // string
  console.log(result.data.price)     // number
  console.log(result.data.inStock)   // boolean
  // TypeScript knows the exact types!
}

Intermediate Examples

Multiple Fields

Extract complex objects with many fields:
const userSchema = z.object({
  name: z.string(),
  email: z.string().email(),
  joinDate: z.string(),
  verified: z.boolean(),
  profileImage: z.string().url()
})

await page.goto('https://example.com/profile/johndoe')

const result = await page.extract(
  'Extract user profile information',
  userSchema
)

if (result.success && result.data) {
  const user = result.data
  console.log(`${user.name} (${user.email})`)
  console.log(`Joined: ${user.joinDate}`)
  console.log(`Verified: ${user.verified}`)
}

Optional Fields

Handle optional data with Zod:
const articleSchema = z.object({
  title: z.string(),
  author: z.string(),
  publishDate: z.string(),
  updatedDate: z.string().optional(),  // May not be present
  tags: z.array(z.string()).optional(),
  readTime: z.number().optional()
})

const result = await page.extract(
  'Extract article metadata',
  articleSchema
)

Advanced Examples

Array Extraction

Extract lists of items:
const searchResultsSchema = z.object({
  query: z.string(),
  results: z.array(z.object({
    title: z.string(),
    url: z.string().url(),
    description: z.string(),
    price: z.number().optional()
  })),
  totalResults: z.number()
})

await page.goto('https://example.com/search?q=laptop')

const result = await page.extract(
  'Extract all search results with their details',
  searchResultsSchema
)

if (result.success && result.data) {
  console.log(`Found ${result.data.totalResults} results for "${result.data.query}"`)

  result.data.results.forEach(item => {
    console.log(`${item.title} - $${item.price}`)
    console.log(item.url)
  })
}

Nested Objects

Extract complex nested data structures:
const restaurantSchema = z.object({
  name: z.string(),
  rating: z.number(),
  priceLevel: z.string(),
  address: z.object({
    street: z.string(),
    city: z.string(),
    state: z.string(),
    zip: z.string()
  }),
  hours: z.object({
    monday: z.string(),
    tuesday: z.string(),
    wednesday: z.string(),
    thursday: z.string(),
    friday: z.string(),
    saturday: z.string(),
    sunday: z.string()
  }),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    text: z.string(),
    date: z.string()
  }))
})

const result = await page.extract(
  'Extract complete restaurant information including address, hours, and recent reviews',
  restaurantSchema
)

Table Extraction

Extract data from HTML tables:
const tableSchema = z.object({
  headers: z.array(z.string()),
  rows: z.array(z.array(z.string()))
})

await page.goto('https://example.com/data')

const result = await page.extract(
  'Extract the pricing table data',
  tableSchema
)

if (result.success && result.data) {
  // Print as CSV
  console.log(result.data.headers.join(','))
  result.data.rows.forEach(row => {
    console.log(row.join(','))
  })
}

Zod Schema Primer

Basic Types

import { z } from 'zod'

z.string()              // string
z.number()              // number
z.boolean()             // boolean
z.date()                // Date object
z.string().url()        // URL string
z.string().email()      // Email string
z.string().uuid()       // UUID string
z.literal('specific')   // Exact value
z.enum(['a', 'b', 'c']) // One of several values

Optional and Nullable

z.string().optional()         // string | undefined
z.string().nullable()         // string | null
z.string().nullish()          // string | null | undefined
z.string().default('hello')   // string with default value

Arrays and Objects

z.array(z.string())           // string[]
z.object({                    // { name: string, age: number }
  name: z.string(),
  age: z.number()
})

Validation

z.string().min(3)             // At least 3 characters
z.string().max(100)           // At most 100 characters
z.number().positive()         // Must be positive
z.number().int()              // Must be integer
z.number().min(0).max(100)    // Between 0 and 100

Error Handling

Handle validation errors gracefully:
const schema = z.object({
  price: z.number().positive(),
  email: z.string().email()
})

const result = await page.extract(
  'Extract product price and contact email',
  schema
)

if (result.success && result.data) {
  // Data is valid and typed
  console.log('Price:', result.data.price)
  console.log('Email:', result.data.email)
} else {
  // Extraction failed or validation failed
  console.error('Extraction error:', result.error)

  // See AI reasoning
  if (result.reasoning) {
    console.log('AI reasoning:', result.reasoning)
  }
}

Best Practices

Be Specific in Instructions

Clear instructions lead to better extraction results
Good:
await page.extract(
  'Extract the product name, price in USD, and availability status from the product details section',
  productSchema
)
Bad:
await page.extract('Get the info', productSchema) // Too vague

Design Schemas Carefully

Match your schema to the actual data structure:
// If prices include currency symbols
const priceSchema = z.string() // "$29.99"
// Not: z.number() (would fail validation)

// Or extract and parse
const priceSchema = z.string().transform(str =>
  parseFloat(str.replace(/[$,]/g, ''))
)

Handle Missing Data

Use optional fields for data that might not be present:
const schema = z.object({
  title: z.string(),
  // These might not always be present
  subtitle: z.string().optional(),
  author: z.string().optional(),
  rating: z.number().optional()
})

Test with Real Pages

Always test extraction with actual pages:
// Test extraction
const result = await page.extract(instruction, schema)

if (!result.success) {
  console.error('Extraction failed:', result.error)
  // Adjust instruction or schema
}

Common Use Cases

E-commerce Product Data

const productSchema = z.object({
  name: z.string(),
  brand: z.string(),
  price: z.number(),
  originalPrice: z.number().optional(),
  discount: z.number().optional(),
  rating: z.number(),
  reviewCount: z.number(),
  inStock: z.boolean(),
  images: z.array(z.string().url()),
  description: z.string()
})

const result = await page.extract(
  'Extract all product details',
  productSchema
)

Article Metadata

const articleSchema = z.object({
  headline: z.string(),
  subheadline: z.string().optional(),
  author: z.string(),
  publishDate: z.string(),
  readingTime: z.number(),
  tags: z.array(z.string()),
  summary: z.string()
})

const result = await page.extract(
  'Extract article metadata and summary',
  articleSchema
)

Contact Information

const contactSchema = z.object({
  name: z.string(),
  email: z.string().email().optional(),
  phone: z.string().optional(),
  address: z.string().optional(),
  website: z.string().url().optional(),
  socialMedia: z.object({
    twitter: z.string().optional(),
    linkedin: z.string().optional(),
    facebook: z.string().optional()
  }).optional()
})

const result = await page.extract(
  'Extract contact information from the page',
  contactSchema
)

Reviews and Ratings

const reviewsSchema = z.object({
  overallRating: z.number(),
  totalReviews: z.number(),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    title: z.string(),
    text: z.string(),
    date: z.string(),
    helpful: z.number().optional()
  }))
})

const result = await page.extract(
  'Extract product reviews and ratings',
  reviewsSchema
)

Performance Tips

  1. Be specific - Clear instructions reduce processing time
  2. Use appropriate schemas - Don’t over-complicate schemas
  3. Extract once - Cache results instead of re-extracting
  4. Batch extraction - Extract multiple fields at once rather than separate calls

Limitations

AI extraction has some limitations:
  • Requires API calls (adds latency and cost)
  • May not work on obfuscated or heavily JavaScript-rendered content
  • Accuracy depends on page structure and instruction clarity
  • Rate limits apply based on your AI provider