Type-Safe Data Extraction

The page.extract() method uses AI to extract structured data from web pages. You can optionally provide a Zod schema for type-safe extraction with automatic validation.

Signature

// With schema (type-safe)
page.extract<T>(
  instruction: string,
  schema: z.ZodType<T>
): Promise<ExtractResult<T>>

// Without schema (string extraction)
page.extract(
  instruction: string
): Promise<ExtractResult<{ extraction: string }>>

Parameters:

instruction - Natural language description of what data to extract
schema (optional) - Zod schema for validation and type safety

Returns: Promise<ExtractResult<T>>

{
  success: boolean
  data?: T              // Typed based on your schema
  error?: string
  reasoning?: string    // AI's explanation
}

Installation

You’ll need to install Zod for schema-based extraction:

npm install zod

Basic Examples

Simple Text Extraction

Extract text without a schema:

await page.goto('https://example.com/article')

const result = await page.extract('Extract the article title')

if (result.success) {
  console.log(result.data.extraction)
  // "How to Build Better Software"
}

Single Object Extraction

Extract a structured object with type safety:

import { z } from 'zod'

const productSchema = z.object({
  name: z.string(),
  price: z.number(),
  inStock: z.boolean()
})

await page.goto('https://shop.example.com/product/123')

const result = await page.extract(
  'Extract the product information',
  productSchema
)

if (result.success && result.data) {
  console.log(result.data.name)      // string
  console.log(result.data.price)     // number
  console.log(result.data.inStock)   // boolean
  // TypeScript knows the exact types!
}

Intermediate Examples

Multiple Fields

Extract complex objects with many fields:

const userSchema = z.object({
  name: z.string(),
  email: z.string().email(),
  joinDate: z.string(),
  verified: z.boolean(),
  profileImage: z.string().url()
})

await page.goto('https://example.com/profile/johndoe')

const result = await page.extract(
  'Extract user profile information',
  userSchema
)

if (result.success && result.data) {
  const user = result.data
  console.log(`${user.name} (${user.email})`)
  console.log(`Joined: ${user.joinDate}`)
  console.log(`Verified: ${user.verified}`)
}

Optional Fields

Handle optional data with Zod:

const articleSchema = z.object({
  title: z.string(),
  author: z.string(),
  publishDate: z.string(),
  updatedDate: z.string().optional(),  // May not be present
  tags: z.array(z.string()).optional(),
  readTime: z.number().optional()
})

const result = await page.extract(
  'Extract article metadata',
  articleSchema
)

Advanced Examples

Array Extraction

Extract lists of items:

const searchResultsSchema = z.object({
  query: z.string(),
  results: z.array(z.object({
    title: z.string(),
    url: z.string().url(),
    description: z.string(),
    price: z.number().optional()
  })),
  totalResults: z.number()
})

await page.goto('https://example.com/search?q=laptop')

const result = await page.extract(
  'Extract all search results with their details',
  searchResultsSchema
)

if (result.success && result.data) {
  console.log(`Found ${result.data.totalResults} results for "${result.data.query}"`)

  result.data.results.forEach(item => {
    console.log(`${item.title} - $${item.price}`)
    console.log(item.url)
  })
}

Nested Objects

Extract complex nested data structures:

const restaurantSchema = z.object({
  name: z.string(),
  rating: z.number(),
  priceLevel: z.string(),
  address: z.object({
    street: z.string(),
    city: z.string(),
    state: z.string(),
    zip: z.string()
  }),
  hours: z.object({
    monday: z.string(),
    tuesday: z.string(),
    wednesday: z.string(),
    thursday: z.string(),
    friday: z.string(),
    saturday: z.string(),
    sunday: z.string()
  }),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    text: z.string(),
    date: z.string()
  }))
})

const result = await page.extract(
  'Extract complete restaurant information including address, hours, and recent reviews',
  restaurantSchema
)

Table Extraction

Extract data from HTML tables:

const tableSchema = z.object({
  headers: z.array(z.string()),
  rows: z.array(z.array(z.string()))
})

await page.goto('https://example.com/data')

const result = await page.extract(
  'Extract the pricing table data',
  tableSchema
)

if (result.success && result.data) {
  // Print as CSV
  console.log(result.data.headers.join(','))
  result.data.rows.forEach(row => {
    console.log(row.join(','))
  })
}

Zod Schema Primer

Basic Types

import { z } from 'zod'

z.string()              // string
z.number()              // number
z.boolean()             // boolean
z.date()                // Date object
z.string().url()        // URL string
z.string().email()      // Email string
z.string().uuid()       // UUID string
z.literal('specific')   // Exact value
z.enum(['a', 'b', 'c']) // One of several values

Optional and Nullable

z.string().optional()         // string | undefined
z.string().nullable()         // string | null
z.string().nullish()          // string | null | undefined
z.string().default('hello')   // string with default value

Arrays and Objects

z.array(z.string())           // string[]
z.object({                    // { name: string, age: number }
  name: z.string(),
  age: z.number()
})

Validation

z.string().min(3)             // At least 3 characters
z.string().max(100)           // At most 100 characters
z.number().positive()         // Must be positive
z.number().int()              // Must be integer
z.number().min(0).max(100)    // Between 0 and 100

Error Handling

Handle validation errors gracefully:

const schema = z.object({
  price: z.number().positive(),
  email: z.string().email()
})

const result = await page.extract(
  'Extract product price and contact email',
  schema
)

if (result.success && result.data) {
  // Data is valid and typed
  console.log('Price:', result.data.price)
  console.log('Email:', result.data.email)
} else {
  // Extraction failed or validation failed
  console.error('Extraction error:', result.error)

  // See AI reasoning
  if (result.reasoning) {
    console.log('AI reasoning:', result.reasoning)
  }
}

Best Practices

Be Specific in Instructions

Clear instructions lead to better extraction results

Good:

await page.extract(
  'Extract the product name, price in USD, and availability status from the product details section',
  productSchema
)

Bad:

await page.extract('Get the info', productSchema) // Too vague

Design Schemas Carefully

Match your schema to the actual data structure:

// If prices include currency symbols
const priceSchema = z.string() // "$29.99"
// Not: z.number() (would fail validation)

// Or extract and parse
const priceSchema = z.string().transform(str =>
  parseFloat(str.replace(/[$,]/g, ''))
)

Handle Missing Data

Use optional fields for data that might not be present:

const schema = z.object({
  title: z.string(),
  // These might not always be present
  subtitle: z.string().optional(),
  author: z.string().optional(),
  rating: z.number().optional()
})

Test with Real Pages

Always test extraction with actual pages:

// Test extraction
const result = await page.extract(instruction, schema)

if (!result.success) {
  console.error('Extraction failed:', result.error)
  // Adjust instruction or schema
}

Common Use Cases

E-commerce Product Data

const productSchema = z.object({
  name: z.string(),
  brand: z.string(),
  price: z.number(),
  originalPrice: z.number().optional(),
  discount: z.number().optional(),
  rating: z.number(),
  reviewCount: z.number(),
  inStock: z.boolean(),
  images: z.array(z.string().url()),
  description: z.string()
})

const result = await page.extract(
  'Extract all product details',
  productSchema
)

Article Metadata

const articleSchema = z.object({
  headline: z.string(),
  subheadline: z.string().optional(),
  author: z.string(),
  publishDate: z.string(),
  readingTime: z.number(),
  tags: z.array(z.string()),
  summary: z.string()
})

const result = await page.extract(
  'Extract article metadata and summary',
  articleSchema
)

Contact Information

const contactSchema = z.object({
  name: z.string(),
  email: z.string().email().optional(),
  phone: z.string().optional(),
  address: z.string().optional(),
  website: z.string().url().optional(),
  socialMedia: z.object({
    twitter: z.string().optional(),
    linkedin: z.string().optional(),
    facebook: z.string().optional()
  }).optional()
})

const result = await page.extract(
  'Extract contact information from the page',
  contactSchema
)

Reviews and Ratings

const reviewsSchema = z.object({
  overallRating: z.number(),
  totalReviews: z.number(),
  reviews: z.array(z.object({
    author: z.string(),
    rating: z.number(),
    title: z.string(),
    text: z.string(),
    date: z.string(),
    helpful: z.number().optional()
  }))
})

const result = await page.extract(
  'Extract product reviews and ratings',
  reviewsSchema
)

Performance Tips

Be specific - Clear instructions reduce processing time
Use appropriate schemas - Don’t over-complicate schemas
Extract once - Cache results instead of re-extracting
Batch extraction - Extract multiple fields at once rather than separate calls

Limitations

AI extraction has some limitations:

Requires API calls (adds latency and cost)
May not work on obfuscated or heavily JavaScript-rendered content
Accuracy depends on page structure and instruction clarity
Rate limits apply based on your AI provider

Natural Language Actions

Perform actions with page.act()

AI Setup

Configure AI agents and providers

Best Practices

Effective AI automation patterns

JavaScript Evaluation

Manual data extraction with evaluate()

JavaScript SDK

Page API

AI Automation

Type-Safe Data Extraction

Signature

Installation

Basic Examples

Simple Text Extraction

Single Object Extraction

Intermediate Examples

Multiple Fields

Optional Fields

Advanced Examples

Array Extraction

Nested Objects

Table Extraction

Zod Schema Primer

Basic Types

Optional and Nullable

Arrays and Objects

Validation

Error Handling

Best Practices

Be Specific in Instructions

Design Schemas Carefully

Handle Missing Data

Test with Real Pages

Common Use Cases

E-commerce Product Data

Article Metadata

Contact Information

Reviews and Ratings

Performance Tips

Limitations

Natural Language Actions

AI Setup

Best Practices

JavaScript Evaluation

JavaScript SDK

Page API

AI Automation

​Signature

​Installation

​Basic Examples

​Simple Text Extraction

​Single Object Extraction

​Intermediate Examples

​Multiple Fields

​Optional Fields

​Advanced Examples

​Array Extraction

​Nested Objects

​Table Extraction

​Zod Schema Primer

​Basic Types

​Optional and Nullable

​Arrays and Objects

​Validation

​Error Handling

​Best Practices

​Be Specific in Instructions

​Design Schemas Carefully

​Handle Missing Data

​Test with Real Pages

​Common Use Cases

​E-commerce Product Data

​Article Metadata

​Contact Information

​Reviews and Ratings

​Performance Tips

​Limitations

​Related

Natural Language Actions

AI Setup

Best Practices

JavaScript Evaluation

Signature

Installation

Basic Examples

Simple Text Extraction

Single Object Extraction

Intermediate Examples

Multiple Fields

Optional Fields

Advanced Examples

Array Extraction

Nested Objects

Table Extraction

Zod Schema Primer

Basic Types

Optional and Nullable

Arrays and Objects

Validation

Error Handling

Best Practices

Be Specific in Instructions

Design Schemas Carefully

Handle Missing Data

Test with Real Pages

Common Use Cases

E-commerce Product Data

Article Metadata

Contact Information

Reviews and Ratings

Performance Tips

Limitations

Related