The page.extract() method uses AI to extract structured data from web pages. You can optionally provide a Zod schema for type-safe extraction with automatic validation.
Signature
// With schema (type-safe)
page . extract < T >(
instruction : string ,
schema : z . ZodType < T >
): Promise < ExtractResult < T >>
// Without schema (string extraction)
page . extract (
instruction : string
): Promise < ExtractResult < { extraction : string } >>
Parameters:
instruction - Natural language description of what data to extract
schema (optional) - Zod schema for validation and type safety
Returns: Promise<ExtractResult<T>>
{
success : boolean
data ?: T // Typed based on your schema
error ?: string
reasoning ?: string // AI's explanation
}
Installation
You’ll need to install Zod for schema-based extraction:
Basic Examples
Extract text without a schema:
await page . goto ( 'https://example.com/article' )
const result = await page . extract ( 'Extract the article title' )
if ( result . success ) {
console . log ( result . data . extraction )
// "How to Build Better Software"
}
Extract a structured object with type safety:
import { z } from 'zod'
const productSchema = z . object ({
name: z . string (),
price: z . number (),
inStock: z . boolean ()
})
await page . goto ( 'https://shop.example.com/product/123' )
const result = await page . extract (
'Extract the product information' ,
productSchema
)
if ( result . success && result . data ) {
console . log ( result . data . name ) // string
console . log ( result . data . price ) // number
console . log ( result . data . inStock ) // boolean
// TypeScript knows the exact types!
}
Multiple Fields
Extract complex objects with many fields:
const userSchema = z . object ({
name: z . string (),
email: z . string (). email (),
joinDate: z . string (),
verified: z . boolean (),
profileImage: z . string (). url ()
})
await page . goto ( 'https://example.com/profile/johndoe' )
const result = await page . extract (
'Extract user profile information' ,
userSchema
)
if ( result . success && result . data ) {
const user = result . data
console . log ( ` ${ user . name } ( ${ user . email } )` )
console . log ( `Joined: ${ user . joinDate } ` )
console . log ( `Verified: ${ user . verified } ` )
}
Optional Fields
Handle optional data with Zod:
const articleSchema = z . object ({
title: z . string (),
author: z . string (),
publishDate: z . string (),
updatedDate: z . string (). optional (), // May not be present
tags: z . array ( z . string ()). optional (),
readTime: z . number (). optional ()
})
const result = await page . extract (
'Extract article metadata' ,
articleSchema
)
Advanced Examples
Extract lists of items:
const searchResultsSchema = z . object ({
query: z . string (),
results: z . array ( z . object ({
title: z . string (),
url: z . string (). url (),
description: z . string (),
price: z . number (). optional ()
})),
totalResults: z . number ()
})
await page . goto ( 'https://example.com/search?q=laptop' )
const result = await page . extract (
'Extract all search results with their details' ,
searchResultsSchema
)
if ( result . success && result . data ) {
console . log ( `Found ${ result . data . totalResults } results for " ${ result . data . query } "` )
result . data . results . forEach ( item => {
console . log ( ` ${ item . title } - $ ${ item . price } ` )
console . log ( item . url )
})
}
Nested Objects
Extract complex nested data structures:
const restaurantSchema = z . object ({
name: z . string (),
rating: z . number (),
priceLevel: z . string (),
address: z . object ({
street: z . string (),
city: z . string (),
state: z . string (),
zip: z . string ()
}),
hours: z . object ({
monday: z . string (),
tuesday: z . string (),
wednesday: z . string (),
thursday: z . string (),
friday: z . string (),
saturday: z . string (),
sunday: z . string ()
}),
reviews: z . array ( z . object ({
author: z . string (),
rating: z . number (),
text: z . string (),
date: z . string ()
}))
})
const result = await page . extract (
'Extract complete restaurant information including address, hours, and recent reviews' ,
restaurantSchema
)
Extract data from HTML tables:
const tableSchema = z . object ({
headers: z . array ( z . string ()),
rows: z . array ( z . array ( z . string ()))
})
await page . goto ( 'https://example.com/data' )
const result = await page . extract (
'Extract the pricing table data' ,
tableSchema
)
if ( result . success && result . data ) {
// Print as CSV
console . log ( result . data . headers . join ( ',' ))
result . data . rows . forEach ( row => {
console . log ( row . join ( ',' ))
})
}
Zod Schema Primer
Basic Types
import { z } from 'zod'
z . string () // string
z . number () // number
z . boolean () // boolean
z . date () // Date object
z . string (). url () // URL string
z . string (). email () // Email string
z . string (). uuid () // UUID string
z . literal ( 'specific' ) // Exact value
z . enum ([ 'a' , 'b' , 'c' ]) // One of several values
Optional and Nullable
z . string (). optional () // string | undefined
z . string (). nullable () // string | null
z . string (). nullish () // string | null | undefined
z . string (). default ( 'hello' ) // string with default value
Arrays and Objects
z . array ( z . string ()) // string[]
z . object ({ // { name: string, age: number }
name: z . string (),
age: z . number ()
})
Validation
z . string (). min ( 3 ) // At least 3 characters
z . string (). max ( 100 ) // At most 100 characters
z . number (). positive () // Must be positive
z . number (). int () // Must be integer
z . number (). min ( 0 ). max ( 100 ) // Between 0 and 100
Error Handling
Handle validation errors gracefully:
const schema = z . object ({
price: z . number (). positive (),
email: z . string (). email ()
})
const result = await page . extract (
'Extract product price and contact email' ,
schema
)
if ( result . success && result . data ) {
// Data is valid and typed
console . log ( 'Price:' , result . data . price )
console . log ( 'Email:' , result . data . email )
} else {
// Extraction failed or validation failed
console . error ( 'Extraction error:' , result . error )
// See AI reasoning
if ( result . reasoning ) {
console . log ( 'AI reasoning:' , result . reasoning )
}
}
Best Practices
Be Specific in Instructions
Clear instructions lead to better extraction results
Good:
await page . extract (
'Extract the product name, price in USD, and availability status from the product details section' ,
productSchema
)
Bad:
await page . extract ( 'Get the info' , productSchema ) // Too vague
Design Schemas Carefully
Match your schema to the actual data structure:
// If prices include currency symbols
const priceSchema = z . string () // "$29.99"
// Not: z.number() (would fail validation)
// Or extract and parse
const priceSchema = z . string (). transform ( str =>
parseFloat ( str . replace ( / [ $, ] / g , '' ))
)
Handle Missing Data
Use optional fields for data that might not be present:
const schema = z . object ({
title: z . string (),
// These might not always be present
subtitle: z . string (). optional (),
author: z . string (). optional (),
rating: z . number (). optional ()
})
Test with Real Pages
Always test extraction with actual pages:
// Test extraction
const result = await page . extract ( instruction , schema )
if ( ! result . success ) {
console . error ( 'Extraction failed:' , result . error )
// Adjust instruction or schema
}
Common Use Cases
E-commerce Product Data
const productSchema = z . object ({
name: z . string (),
brand: z . string (),
price: z . number (),
originalPrice: z . number (). optional (),
discount: z . number (). optional (),
rating: z . number (),
reviewCount: z . number (),
inStock: z . boolean (),
images: z . array ( z . string (). url ()),
description: z . string ()
})
const result = await page . extract (
'Extract all product details' ,
productSchema
)
Article Metadata
const articleSchema = z . object ({
headline: z . string (),
subheadline: z . string (). optional (),
author: z . string (),
publishDate: z . string (),
readingTime: z . number (),
tags: z . array ( z . string ()),
summary: z . string ()
})
const result = await page . extract (
'Extract article metadata and summary' ,
articleSchema
)
const contactSchema = z . object ({
name: z . string (),
email: z . string (). email (). optional (),
phone: z . string (). optional (),
address: z . string (). optional (),
website: z . string (). url (). optional (),
socialMedia: z . object ({
twitter: z . string (). optional (),
linkedin: z . string (). optional (),
facebook: z . string (). optional ()
}). optional ()
})
const result = await page . extract (
'Extract contact information from the page' ,
contactSchema
)
Reviews and Ratings
const reviewsSchema = z . object ({
overallRating: z . number (),
totalReviews: z . number (),
reviews: z . array ( z . object ({
author: z . string (),
rating: z . number (),
title: z . string (),
text: z . string (),
date: z . string (),
helpful: z . number (). optional ()
}))
})
const result = await page . extract (
'Extract product reviews and ratings' ,
reviewsSchema
)
Be specific - Clear instructions reduce processing time
Use appropriate schemas - Don’t over-complicate schemas
Extract once - Cache results instead of re-extracting
Batch extraction - Extract multiple fields at once rather than separate calls
Limitations
AI extraction has some limitations:
Requires API calls (adds latency and cost)
May not work on obfuscated or heavily JavaScript-rendered content
Accuracy depends on page structure and instruction clarity
Rate limits apply based on your AI provider
Natural Language Actions Perform actions with page.act()
AI Setup Configure AI agents and providers
Best Practices Effective AI automation patterns
JavaScript Evaluation Manual data extraction with evaluate()