The page.extract() method uses AI to extract structured data from web pages. You can optionally provide a Zod schema for type-safe extraction with automatic validation.
Signature
// With schema (type-safe)
page.extract<T>(
instruction: string,
schema: z.ZodType<T>
): Promise<ExtractResult<T>>
// Without schema (string extraction)
page.extract(
instruction: string
): Promise<ExtractResult<{ extraction: string }>>
Parameters:
instruction - Natural language description of what data to extract
schema (optional) - Zod schema for validation and type safety
Returns: Promise<ExtractResult<T>>
{
success: boolean
data?: T // Typed based on your schema
error?: string
reasoning?: string // AI's explanation
}
Installation
You’ll need to install Zod for schema-based extraction:
Basic Examples
Extract text without a schema:
await page.goto('https://example.com/article')
const result = await page.extract('Extract the article title')
if (result.success) {
console.log(result.data.extraction)
// "How to Build Better Software"
}
Extract a structured object with type safety:
import { z } from 'zod'
const productSchema = z.object({
name: z.string(),
price: z.number(),
inStock: z.boolean()
})
await page.goto('https://shop.example.com/product/123')
const result = await page.extract(
'Extract the product information',
productSchema
)
if (result.success && result.data) {
console.log(result.data.name) // string
console.log(result.data.price) // number
console.log(result.data.inStock) // boolean
// TypeScript knows the exact types!
}
Multiple Fields
Extract complex objects with many fields:
const userSchema = z.object({
name: z.string(),
email: z.string().email(),
joinDate: z.string(),
verified: z.boolean(),
profileImage: z.string().url()
})
await page.goto('https://example.com/profile/johndoe')
const result = await page.extract(
'Extract user profile information',
userSchema
)
if (result.success && result.data) {
const user = result.data
console.log(`${user.name} (${user.email})`)
console.log(`Joined: ${user.joinDate}`)
console.log(`Verified: ${user.verified}`)
}
Optional Fields
Handle optional data with Zod:
const articleSchema = z.object({
title: z.string(),
author: z.string(),
publishDate: z.string(),
updatedDate: z.string().optional(), // May not be present
tags: z.array(z.string()).optional(),
readTime: z.number().optional()
})
const result = await page.extract(
'Extract article metadata',
articleSchema
)
Advanced Examples
Extract lists of items:
const searchResultsSchema = z.object({
query: z.string(),
results: z.array(z.object({
title: z.string(),
url: z.string().url(),
description: z.string(),
price: z.number().optional()
})),
totalResults: z.number()
})
await page.goto('https://example.com/search?q=laptop')
const result = await page.extract(
'Extract all search results with their details',
searchResultsSchema
)
if (result.success && result.data) {
console.log(`Found ${result.data.totalResults} results for "${result.data.query}"`)
result.data.results.forEach(item => {
console.log(`${item.title} - $${item.price}`)
console.log(item.url)
})
}
Nested Objects
Extract complex nested data structures:
const restaurantSchema = z.object({
name: z.string(),
rating: z.number(),
priceLevel: z.string(),
address: z.object({
street: z.string(),
city: z.string(),
state: z.string(),
zip: z.string()
}),
hours: z.object({
monday: z.string(),
tuesday: z.string(),
wednesday: z.string(),
thursday: z.string(),
friday: z.string(),
saturday: z.string(),
sunday: z.string()
}),
reviews: z.array(z.object({
author: z.string(),
rating: z.number(),
text: z.string(),
date: z.string()
}))
})
const result = await page.extract(
'Extract complete restaurant information including address, hours, and recent reviews',
restaurantSchema
)
Extract data from HTML tables:
const tableSchema = z.object({
headers: z.array(z.string()),
rows: z.array(z.array(z.string()))
})
await page.goto('https://example.com/data')
const result = await page.extract(
'Extract the pricing table data',
tableSchema
)
if (result.success && result.data) {
// Print as CSV
console.log(result.data.headers.join(','))
result.data.rows.forEach(row => {
console.log(row.join(','))
})
}
Zod Schema Primer
Basic Types
import { z } from 'zod'
z.string() // string
z.number() // number
z.boolean() // boolean
z.date() // Date object
z.string().url() // URL string
z.string().email() // Email string
z.string().uuid() // UUID string
z.literal('specific') // Exact value
z.enum(['a', 'b', 'c']) // One of several values
Optional and Nullable
z.string().optional() // string | undefined
z.string().nullable() // string | null
z.string().nullish() // string | null | undefined
z.string().default('hello') // string with default value
Arrays and Objects
z.array(z.string()) // string[]
z.object({ // { name: string, age: number }
name: z.string(),
age: z.number()
})
Validation
z.string().min(3) // At least 3 characters
z.string().max(100) // At most 100 characters
z.number().positive() // Must be positive
z.number().int() // Must be integer
z.number().min(0).max(100) // Between 0 and 100
Error Handling
Handle validation errors gracefully:
const schema = z.object({
price: z.number().positive(),
email: z.string().email()
})
const result = await page.extract(
'Extract product price and contact email',
schema
)
if (result.success && result.data) {
// Data is valid and typed
console.log('Price:', result.data.price)
console.log('Email:', result.data.email)
} else {
// Extraction failed or validation failed
console.error('Extraction error:', result.error)
// See AI reasoning
if (result.reasoning) {
console.log('AI reasoning:', result.reasoning)
}
}
Best Practices
Be Specific in Instructions
Clear instructions lead to better extraction results
Good:
await page.extract(
'Extract the product name, price in USD, and availability status from the product details section',
productSchema
)
Bad:
await page.extract('Get the info', productSchema) // Too vague
Design Schemas Carefully
Match your schema to the actual data structure:
// If prices include currency symbols
const priceSchema = z.string() // "$29.99"
// Not: z.number() (would fail validation)
// Or extract and parse
const priceSchema = z.string().transform(str =>
parseFloat(str.replace(/[$,]/g, ''))
)
Handle Missing Data
Use optional fields for data that might not be present:
const schema = z.object({
title: z.string(),
// These might not always be present
subtitle: z.string().optional(),
author: z.string().optional(),
rating: z.number().optional()
})
Test with Real Pages
Always test extraction with actual pages:
// Test extraction
const result = await page.extract(instruction, schema)
if (!result.success) {
console.error('Extraction failed:', result.error)
// Adjust instruction or schema
}
Common Use Cases
E-commerce Product Data
const productSchema = z.object({
name: z.string(),
brand: z.string(),
price: z.number(),
originalPrice: z.number().optional(),
discount: z.number().optional(),
rating: z.number(),
reviewCount: z.number(),
inStock: z.boolean(),
images: z.array(z.string().url()),
description: z.string()
})
const result = await page.extract(
'Extract all product details',
productSchema
)
Article Metadata
const articleSchema = z.object({
headline: z.string(),
subheadline: z.string().optional(),
author: z.string(),
publishDate: z.string(),
readingTime: z.number(),
tags: z.array(z.string()),
summary: z.string()
})
const result = await page.extract(
'Extract article metadata and summary',
articleSchema
)
const contactSchema = z.object({
name: z.string(),
email: z.string().email().optional(),
phone: z.string().optional(),
address: z.string().optional(),
website: z.string().url().optional(),
socialMedia: z.object({
twitter: z.string().optional(),
linkedin: z.string().optional(),
facebook: z.string().optional()
}).optional()
})
const result = await page.extract(
'Extract contact information from the page',
contactSchema
)
Reviews and Ratings
const reviewsSchema = z.object({
overallRating: z.number(),
totalReviews: z.number(),
reviews: z.array(z.object({
author: z.string(),
rating: z.number(),
title: z.string(),
text: z.string(),
date: z.string(),
helpful: z.number().optional()
}))
})
const result = await page.extract(
'Extract product reviews and ratings',
reviewsSchema
)
- Be specific - Clear instructions reduce processing time
- Use appropriate schemas - Don’t over-complicate schemas
- Extract once - Cache results instead of re-extracting
- Batch extraction - Extract multiple fields at once rather than separate calls
Limitations
AI extraction has some limitations:
- Requires API calls (adds latency and cost)
- May not work on obfuscated or heavily JavaScript-rendered content
- Accuracy depends on page structure and instruction clarity
- Rate limits apply based on your AI provider