Uncategorized

Example of Markdown Imported.

Example of Markdown Imported.

Back to our Main Markdown Page. This is an example of how our plugin creates WordPress content from real markdown files. The layout, css, anchors are parsed from our 50kb plugin.

The markdown is used as documentation for building a web scraping automation. If you need consultation on that and if this document seems alien language, we are more than welcome to consulting discussions.

The following is all generated from our plugin.


Playwright Multi-Purpose Automation

This document provides comprehensive guidance for using Playwright as an all-purpose automation tool for PDF generation, HTML saving, and Server-Side Rendering (SSR) cache with multi-architecture support.

📋 Table of Contents

Architecture Compatibility

⚠️ CRITICAL: All services must support both Apple Silicon (ARM64) and Intel (x86/x64) architectures across Node.js versions.

📖 See: ARCHITECTURE COMPATIBILITY.md for detailed multi-architecture requirements, Docker configurations, and testing procedures.

Node.js Dependencies

Core Requirements

  • Node.js Version: node:24-slim (Latest LTS – Required)
  • Playwright: ^1.40.0 (latest stable)
  • Architecture: Auto-detects and downloads correct Chromium binaries

Core Playwright Service Dependencies

{
  "dependencies": {
    "express": "^4.18.2",
    "bullmq": "^5.58.5",
    "redis": "^5.8.2",
    "playwright": "^1.40.0",
    "mime-types": "^2.1.35",
    "compression": "^1.7.4"
  },
  "devDependencies": {
    "nodemon": "^3.0.0"
  },
  "engines": {
    "node": ">=18.0.0"
  }
}

Installation Commands

# Install Node.js dependencies
npm install

# Install Playwright browsers (architecture auto-detected)
npx playwright install chromium
npx playwright install-deps chromium

Python Dependencies

Core Requirements

  • Python: 3.11+ recommended
  • Playwright: 1.40.0 (matches Node.js version)
  • Architecture: Auto-detects and downloads correct Chromium binaries

Core Python Service Dependencies

flask==2.3.3
playwright==1.40.0
gunicorn==21.2.0
nest-asyncio==1.5.8
redis==4.6.2

Installation Commands

# Install Python dependencies
pip install -r requirements.txt

# Install Playwright browsers (architecture auto-detected)
playwright install chromium

Core Use Cases

Playwright serves as a powerful all-purpose automation tool for:

1. PDF Generation

  • Web-to-PDF: Convert any webpage or HTML content to high-quality PDFs
  • Dynamic Content: Handle JavaScript-rendered content and interactive elements
  • Custom Formatting: Control page size, margins, headers, footers
  • Batch Processing: Generate multiple PDFs efficiently via queue system

2. HTML Saving & Caching

  • Complete Pages: Save fully rendered HTML with all assets loaded
  • Static Site Generation: Pre-render dynamic content for faster delivery
  • Content Archival: Preserve web content exactly as rendered
  • Offline Access: Create local copies of web applications

3. Server-Side Rendering (SSR) Cache

  • Performance Optimization: Pre-render pages for faster initial load
  • SEO Enhancement: Provide search engines with fully rendered content
  • Dynamic Caching: Cache complex, data-driven pages efficiently
  • Progressive Enhancement: Serve static content while JavaScript loads

Architecture Overview

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Client Apps   │───▶│  Job Processor  │───▶│  Playwright     │
│   (Web/API)     │    │    (BullMQ)     │    │   Workers       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │                        │
                              ▼                        ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │  Redis Queue    │    │  File Storage   │
                       │   (Jobs)        │    │  (PDFs/HTML)    │
                       └─────────────────┘    └─────────────────┘

BullMQ Integration

Version Requirements

  • BullMQ: 5.58.5 (latest stable)
  • Redis: ^5.8.2 (Official Node.js Redis client – recommended over IORedis)
  • Redis Server: 7.0+ (server requirement)

Job Processor Implementation

Queue Configuration

import { Queue, Worker } from 'bullmq';
import { createClient } from 'redis';

// Redis connection (Official Redis client)
const connection = createClient({
  socket: {
    host: process.env.REDIS_HOST || 'localhost',
    port: process.env.REDIS_PORT || 6379,
  },
  password: process.env.REDIS_PASSWORD,
  database: process.env.REDIS_DB || 0,
});

// Connect to Redis
await connection.connect();

// Handle connection errors
connection.on('error', (err) => console.error('Redis Client Error', err));
connection.on('connect', () => console.log('Connected to Redis'));

// Create queue
const screenshotQueue = new Queue('screenshot-jobs', { 
  connection,
  defaultJobOptions: {
    removeOnComplete: 100,
    removeOnFail: 50,
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 2000,
    },
  },
});

Worker Implementation

// Multi-purpose Playwright worker
const worker = new Worker('playwright-jobs', async (job) => {
  const { type, url, options } = job.data;
  
  // Launch Playwright browser with detected Chromium path
  const browser = await chromium.launch({
    headless: true,
    executablePath: CHROMIUM_PATH,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  try {
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: 'networkidle' });
    
    switch (type) {
      case 'pdf':
        const pdf = await page.pdf({
          format: options.format || 'A4',
          margin: options.margin || { top: '1in', right: '1in', bottom: '1in', left: '1in' },
          printBackground: true
        });
        return { success: true, pdf: pdf.toString('base64') };
        
      case 'html':
        const html = await page.content();
        return { success: true, html, title: await page.title() };
        
      case 'cache':
        const cachedHtml = await page.content();
        // Save to cache storage (Redis, file system, etc.)
        return { success: true, cached: true, key: options.cacheKey };
        
      default:
        throw new Error(`Unknown job type: ${type}`);
    }
  } finally {
    await browser.close();
  }
}, { 
  connection,
  concurrency: process.env.WORKER_CONCURRENCY || 5,
});

Job Types and Examples

// Core job types for all-purpose automation
const JobTypes = {
  PDF_GENERATION: 'pdf',
  HTML_SAVING: 'html', 
  SSR_CACHE: 'cache'
};

// PDF Generation
await playwrightQueue.add(JobTypes.PDF_GENERATION, {
  url: 'https://example.com/report',
  options: {
    format: 'A4',
    margin: { top: '0.5in', right: '0.5in', bottom: '0.5in', left: '0.5in' },
    displayHeaderFooter: true,
    headerTemplate: '<div style="font-size:10px;">Report - <span class="date"></span></div>'
  }
});

// HTML Saving
await playwrightQueue.add(JobTypes.HTML_SAVING, {
  url: 'https://dynamic-app.com/dashboard',
  options: {
    waitFor: 'networkidle',
    includeAssets: true
  }
});

// SSR Cache
await playwrightQueue.add(JobTypes.SSR_CACHE, {
  url: 'https://spa-app.com/product/123',
  options: {
    cacheKey: 'product_123_rendered',
    ttl: 3600 // 1 hour cache
  }
});

Python Integration with BullMQ

# While BullMQ is Node.js-native, Python services can:
# 1. Use Redis directly for job communication
# 2. Use python-rq as alternative
# 3. Communicate via HTTP APIs with Node.js job processor

import redis
import json

# Redis client for job communication (Python redis package)
redis_client = redis.Redis(host='localhost', port=6379, db=0)

# Add job via Redis (compatible with BullMQ format)
def add_job_to_queue(queue_name, job_data):
    job = {
        'data': job_data,
        'opts': {'attempts': 3}
    }
    redis_client.lpush(f"bull:{queue_name}:waiting", json.dumps(job))

Migration from IORedis to Official Redis Client

Why Migrate?

  • Official Support: The redis package is officially maintained by Redis
  • Better Security: Fewer vulnerabilities and faster security updates
  • Active Maintenance: More frequent updates and better long-term support
  • Smaller Bundle: Reduced dependency footprint

Breaking Changes

// OLD: IORedis
import IORedis from 'ioredis';
const redis = new IORedis({
  host: 'localhost',
  port: 6379,
  maxRetriesPerRequest: 3,
});

// NEW: Official Redis Client
import { createClient } from 'redis';
const redis = createClient({
  socket: {
    host: 'localhost',
    port: 6379,
  },
});
await redis.connect(); // Required explicit connection

Configuration Differences

// IORedis configuration
const ioRedisConfig = {
  host: 'localhost',
  port: 6379,
  password: 'secret',
  db: 0,
  retryDelayOnFailover: 100,
  maxRetriesPerRequest: 3,
};

// Official Redis client configuration
const redisConfig = {
  socket: {
    host: 'localhost',
    port: 6379,
    connectTimeout: 5000,
    commandTimeout: 5000,
  },
  password: 'secret',
  database: 0,
};

BullMQ Connection Update

// Update your existing BullMQ setup
import { Queue, Worker } from 'bullmq';
import { createClient } from 'redis';

// Create shared connection for BullMQ
async function createRedisConnection() {
  const connection = createClient({
    socket: {
      host: process.env.REDIS_HOST || 'localhost',
      port: parseInt(process.env.REDIS_PORT) || 6379,
    },
    password: process.env.REDIS_PASSWORD,
    database: parseInt(process.env.REDIS_DB) || 0,
  });
  
  await connection.connect();
  return connection;
}

// Use with BullMQ
const connection = await createRedisConnection();
const queue = new Queue('jobs', { connection });
const worker = new Worker('jobs', async (job) => {
  // Process job
}, { connection });

Development Setup

Local Development Environment

# 1. Start Redis server
docker run -d --name redis -p 6379:6379 redis:7-alpine

# 2. Install Node.js dependencies
cd job-processor
npm install
npx playwright install chromium

# 3. Install Python dependencies
cd ../scraper-api
pip install -r requirements.txt
playwright install chromium

# 4. Start services
npm run dev  # Node.js services
python app.py  # Python services

Docker Development

# Build multi-architecture images
docker buildx build --platform linux/amd64,linux/arm64 -t job-processor:latest .
docker buildx build --platform linux/amd64,linux/arm64 -t scraper-api:latest .

# Run with docker-compose
docker-compose up -d

Environment Variables

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=

# Job Processor
WORKER_CONCURRENCY=5
MAX_RETRIES=3

# Playwright Configuration  
PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/ms-playwright/chromium-*/chrome-linux/chrome

# Architecture Detection
NODE_ENV=production
PLATFORM_ARCH=auto

Performance Considerations

Resource Management

  • Memory: 2GB minimum per worker process
  • CPU: Optimal at 2-4 cores per worker
  • Disk: 1GB for Chromium binaries per architecture

Scaling Guidelines

// Horizontal scaling configuration
const workerConfig = {
  concurrency: Math.max(1, Math.floor(os.cpus().length / 2)),
  maxStalledCount: 1,
  stalledInterval: 30 * 1000,
  maxmemoryPolicy: 'allkeys-lru'
};

Monitoring and Health Checks

// Health check endpoint
app.get('/health', async (req, res) => {
  const queueHealth = await screenshotQueue.getJobCounts();
  const workerHealth = worker.isRunning();
  
  res.json({
    status: 'healthy',
    queue: queueHealth,
    worker: { running: workerHealth },
    architecture: os.arch(),
    nodeVersion: process.version
  });
});

Testing and Validation

Required Test Coverage

  • ✅ Multi-architecture builds (ARM64 + x86/x64)
  • ✅ Node.js version compatibility (18 + 24)
  • ✅ BullMQ job processing reliability
  • ✅ Playwright browser launch consistency
  • ✅ Memory leak prevention
  • ✅ Error handling and retry logic

Integration Testing

# Run comprehensive test suite
npm test -- --coverage
pytest --cov=src tests/

# Architecture-specific testing
docker buildx build --platform linux/arm64 -t test:arm64 .
docker buildx build --platform linux/amd64 -t test:amd64 .

Comprehensive Implementation Examples

PDF Generation Service

// Advanced PDF generation with custom options
async function generatePDF(url, options = {}) {
  const browser = await chromium.launch({
    headless: true,
    executablePath: CHROMIUM_PATH,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  try {
    const page = await browser.newPage();
    
    // Set viewport for consistent rendering
    await page.setViewportSize({ width: 1200, height: 800 });
    
    // Navigate and wait for content
    await page.goto(url, { waitUntil: 'networkidle' });
    
    // Wait for specific elements if needed
    if (options.waitForSelector) {
      await page.waitForSelector(options.waitForSelector);
    }
    
    const pdf = await page.pdf({
      format: options.format || 'A4',
      margin: options.margin || { 
        top: '1in', 
        right: '1in', 
        bottom: '1in', 
        left: '1in' 
      },
      printBackground: true,
      displayHeaderFooter: options.displayHeaderFooter || false,
      headerTemplate: options.headerTemplate || '',
      footerTemplate: options.footerTemplate || '',
      landscape: options.landscape || false,
      scale: options.scale || 1
    });
    
    return pdf;
  } finally {
    await browser.close();
  }
}

// Usage examples
const reportPDF = await generatePDF('https://app.com/report', {
  format: 'A4',
  displayHeaderFooter: true,
  headerTemplate: '<div style="font-size:10px; width:100%; text-align:center;">Monthly Report</div>',
  footerTemplate: '<div style="font-size:10px; width:100%; text-align:center;">Page <span class="pageNumber"></span> of <span class="totalPages"></span></div>'
});

HTML Saving & Archival Service

// Complete HTML saving with assets
async function saveCompleteHTML(url, options = {}) {
  const browser = await chromium.launch({
    headless: true,
    executablePath: CHROMIUM_PATH,
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  try {
    const page = await browser.newPage();
    
    // Intercept and save resources if needed
    const resources = [];
    if (options.saveAssets) {
      await page.route('**/*', (route) => {
        const request = route.request();
        resources.push({
          url: request.url(),
          method: request.method(),
          headers: request.headers()
        });
        route.continue();
      });
    }
    
    await page.goto(url, { waitUntil: 'networkidle' });
    
    // Wait for dynamic content
    if (options.waitTime) {
      await page.waitForTimeout(options.waitTime);
    }
    
    const html = await page.content();
    const title = await page.title();
    const screenshot = options.includeScreenshot ? 
      await page.screenshot({ fullPage: true }) : null;
    
    return {
      html,
      title,
      url,
      timestamp: new Date().toISOString(),
      resources: options.saveAssets ? resources : [],
      screenshot: screenshot ? screenshot.toString('base64') : null
    };
  } finally {
    await browser.close();
  }
}

// Usage
const savedPage = await saveCompleteHTML('https://dynamic-app.com/dashboard', {
  saveAssets: true,
  includeScreenshot: true,
  waitTime: 2000
});

SSR Cache Implementation

// Server-Side Rendering cache with Redis
import { createClient } from 'redis';

class SSRCache {
  constructor(redisClient) {
    this.redis = redisClient;
    this.defaultTTL = 3600; // 1 hour
  }
  
  async renderAndCache(url, cacheKey, options = {}) {
    // Check cache first
    const cached = await this.redis.get(cacheKey);
    if (cached && !options.forceRefresh) {
      return JSON.parse(cached);
    }
    
    // Render with Playwright
    const browser = await chromium.launch({
      headless: true,
      executablePath: CHROMIUM_PATH,
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });
    
    try {
      const page = await browser.newPage();
      
      // Set user agent and viewport for consistent rendering
      await page.setUserAgent('Mozilla/5.0 (compatible; SSRCache/1.0)');
      await page.setViewportSize({ width: 1200, height: 800 });
      
      await page.goto(url, { waitUntil: 'networkidle' });
      
      // Wait for SPA to fully load
      if (options.waitForSelector) {
        await page.waitForSelector(options.waitForSelector);
      }
      
      const html = await page.content();
      const title = await page.title();
      const meta = await page.evaluate(() => {
        const metaTags = {};
        document.querySelectorAll('meta').forEach(meta => {
          if (meta.name) metaTags[meta.name] = meta.content;
          if (meta.property) metaTags[meta.property] = meta.content;
        });
        return metaTags;
      });
      
      const result = {
        html,
        title,
        meta,
        url,
        cached: true,
        timestamp: new Date().toISOString()
      };
      
      // Cache the result
      const ttl = options.ttl || this.defaultTTL;
      await this.redis.setEx(cacheKey, ttl, JSON.stringify(result));
      
      return result;
    } finally {
      await browser.close();
    }
  }
  
  async invalidateCache(pattern) {
    const keys = await this.redis.keys(pattern);
    if (keys.length > 0) {
      await this.redis.del(keys);
    }
    return keys.length;
  }
}

// Usage
const ssrCache = new SSRCache(redisConnection);

// Cache SPA pages for SEO
const cachedPage = await ssrCache.renderAndCache(
  'https://spa-app.com/product/123',
  'product_page_123',
  {
    ttl: 1800, // 30 minutes
    waitForSelector: '.product-details'
  }
);

// Serve cached HTML
app.get('/product/:id', async (req, res) => {
  const cacheKey = `product_page_${req.params.id}`;
  const cached = await ssrCache.renderAndCache(
    `https://spa-app.com/product/${req.params.id}`,
    cacheKey
  );
  
  res.send(cached.html);
});

Critical Path Detection Functions

Both Node.js and Python services include essential functions for detecting the correct Chromium path across architectures. These functions are mandatory for reliable cross-platform operation.

Node.js Implementation

// Smart browser detection for cross-platform compatibility
function detectBestChromiumPath() {
  const arch = os.arch();
  const platform = os.platform();
  
  console.log(`🔍 Detecting browser: platform=${platform}, arch=${arch}`);
  
  // Try Playwright's downloaded browsers first (works best on x64)
  const playwrightPaths = [
    process.env.HOME + '/.cache/ms-playwright/chromium-*/chrome-linux/chrome',
    '/root/.cache/ms-playwright/chromium-*/chrome-linux/chrome',
    process.env.HOME + '/.cache/ms-playwright/chromium_headless_shell-*/chrome-linux/headless_shell',
    '/root/.cache/ms-playwright/chromium_headless_shell-*/chrome-linux/headless_shell'
  ];
  
  // System browsers (fallback, especially for ARM64)
  const systemPaths = [
    '/usr/bin/chromium',
    '/usr/bin/chromium-browser', 
    '/usr/bin/google-chrome',
    '/usr/bin/google-chrome-stable'
  ];
  
  // For x64, prefer Playwright browsers; for ARM64, prefer system browsers
  const pathsToTry = (arch === 'x64' || arch === 'x86_64') 
    ? [...playwrightPaths, ...systemPaths]
    : [...systemPaths, ...playwrightPaths];
  
  for (const pathPattern of pathsToTry) {
    if (pathPattern.includes('*')) {
      // Handle glob patterns for Playwright paths
      try {
        const matches = require('glob').sync(pathPattern);
        for (const match of matches) {
          if (require('fs').existsSync(match)) {
            console.log(`✅ Found Playwright browser: ${match}`);
            return match;
          }
        }
      } catch (error) {
        continue;
      }
    } else {
      if (require('fs').existsSync(pathPattern)) {
        console.log(`✅ Found system browser: ${pathPattern}`);
        return pathPattern;
      }
    }
  }
  
  console.log(`⚠️ No browser found, using default system path: /usr/bin/chromium`);
  return '/usr/bin/chromium';
}

// Usage at startup
const CHROMIUM_PATH = detectBestChromiumPath();

Python Implementation

import os
import glob
import platform
import logging

def detect_best_chromium_path():
    arch = platform.machine().lower()
    system = platform.system().lower()
    
    logging.info(f"🔍 Detecting browser: platform={system}, arch={arch}")
    
    # Try Playwright's downloaded browsers first (works best on x64)
    playwright_paths = [
        os.path.expanduser('~/.cache/ms-playwright/chromium-*/chrome-linux/chrome'),
        '/root/.cache/ms-playwright/chromium-*/chrome-linux/chrome',
        os.path.expanduser('~/.cache/ms-playwright/chromium_headless_shell-*/chrome-linux/headless_shell'),
        '/root/.cache/ms-playwright/chromium_headless_shell-*/chrome-linux/headless_shell'
    ]
    
    # System browsers (fallback, especially for ARM64)
    system_paths = [
        '/usr/bin/chromium',
        '/usr/bin/chromium-browser', 
        '/usr/bin/google-chrome',
        '/usr/bin/google-chrome-stable'
    ]
    
    # For x64, prefer Playwright browsers; for ARM64, prefer system browsers
    paths_to_try = playwright_paths + system_paths if 'x86_64' in arch or 'amd64' in arch else system_paths + playwright_paths
    
    for path_pattern in paths_to_try:
        if '*' in path_pattern:
            # Handle glob patterns for Playwright paths
            try:
                matches = glob.glob(path_pattern)
                for match in matches:
                    if os.path.exists(match) and os.access(match, os.X_OK):
                        logging.info(f"✅ Found Playwright browser: {match}")
                        return match
            except Exception:
                continue
        else:
            if os.path.exists(path_pattern) and os.access(path_pattern, os.X_OK):
                logging.info(f"✅ Found system browser: {path_pattern}")
                return path_pattern
    
    logging.warning(f"⚠️ No browser found, using default system path: /usr/bin/chromium")
    return '/usr/bin/chromium'

# Usage at startup
CHROMIUM_PATH = detect_best_chromium_path()

Integration with Playwright Launch

// Node.js usage
const browser = await chromium.launch({
  headless: true,
  executablePath: CHROMIUM_PATH,
  args: ['--no-sandbox', '--disable-setuid-sandbox']
});
# Python usage
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        executable_path=CHROMIUM_PATH,
        args=['--no-sandbox', '--disable-setuid-sandbox']
    )

Troubleshooting

Common Issues

  1. Chromium path mismatch: Use detectBestChromiumPath() functions above
  2. Architecture incompatibility: Check Docker TARGETPLATFORM
  3. BullMQ connection errors: Validate Redis connectivity
  4. Memory exhaustion: Adjust worker concurrency
  5. Version conflicts: Ensure consistent Playwright versions

Debugging Commands

# Check Playwright installation
npx playwright --version
playwright --version

# Verify Chromium binary
file $(npx playwright --version | grep chromium | cut -d' ' -f2)

# Test BullMQ connection
node -e "import('redis').then(r => r.createClient().connect().then(c => c.ping()).then(console.log))"

# Architecture verification
uname -m && node -e "console.log(os.arch())" && python -c "import platform; print(platform.machine())"

Next Steps: Review the Architecture Compatibility Guide for detailed implementation requirements and testing procedures.

Leave a Reply

Your email address will not be published. Required fields are marked *