Meta Tag Generator

Technical documentation of the Meta Tag Generator: PHP-based URL crawler, multicrawl with Server-Sent Events, multi-provider AI architecture with rule-based fallback, and outputs for Title, Description, Keywords, Open Graph, Twitter Card, JSON-LD and Robots.

v1.0 7 AI providers Single URL + Multicrawl PHP · Vanilla JS Open Source
Go to tool GitHub

About the Meta Tag Generator

The Meta Tag Generator crawls any URL server-side and extracts the full page content – H1, H2 headings, paragraphs, existing meta tags, JSON-LD blocks and body text. This content is passed to an AI provider (or a rule-based fallback) which generates optimised meta tags based on what the page is actually about.

Version 1.0 introduces Multicrawl mode: the tool automatically discovers all URLs of a domain via sitemap.xml (with robots.txt and link-extraction fallback), then processes every page sequentially with live progress updates via Server-Sent Events. This makes it possible to generate meta tags for an entire website in one run.

The AI provider is fully swappable via a single config line – from no AI (free, rule-based) to Gemini, Claude, GPT-4o, Perplexity, Grok or any OpenAI-compatible endpoint.

Tool scope

  • Modes: Single URL or Multicrawl (full domain)
  • Sitemap discovery: sitemap.xml → robots.txt → link extraction fallback
  • Max URLs per run: 500 (configurable)
  • Live updates: Server-Sent Events (SSE) during multicrawl
  • Retry logic: Auto-retry on 503/429 (3 attempts, exponential delay)
  • AI providers: 7 (none/rule-based, Anthropic, OpenAI, Google, Perplexity, Grok, OwnAI)
  • Output types: Title, Description, Keywords, Open Graph, Twitter Card, JSON-LD, Robots
  • Rate limiting: File-based, per IP, configurable window
  • Deployment: Single PHP directory, no framework, no database
  • License: MIT – self-hosted, open source

Technical details

Crawler (Crawler.php)

  • Protocol: HTTP/HTTPS via PHP cURL
  • Redirects: Up to 5 hops
  • Timeout: 10 seconds per request
  • SSL: CURLOPT_SSL_VERIFYPEER enabled
  • User-Agent: MetaTagGenerator/1.0
  • Content limit: 50,000 characters (configurable)
  • Parser: PHP DOMDocument + DOMXPath
  • Encoding: HTML-ENTITIES via mb_convert_encoding

Content extraction

  • Title: <title> tag
  • Meta tags: description, keywords, author, robots, canonical
  • Open Graph: og:title, og:description, og:image, og:type
  • Twitter Card: twitter:card, title, description
  • H1: First heading element
  • H2s: Up to 5 subheadings
  • Paragraphs: Up to 6 paragraphs >80 chars
  • JSON-LD: All <script type="application/ld+json"> blocks
  • Body text: Stripped of nav/header/footer/scripts
  • Page type: Auto-detected from JSON-LD or URL pattern

Multicrawl (multicrawl.php + SitemapExtractor.php)

  • URL discovery: sitemap.xml → sitemap_index.xml → robots.txt → link extraction
  • Max URLs: 500 per run (configurable)
  • Live updates: Server-Sent Events (SSE)
  • SSE events: status, urls_found, progress, result, error, done
  • Retry logic: 3 attempts on 503/429 (2s, 5s, 10s delays)
  • Request delay: Configurable ms between requests
  • Domain filtering: Only same-domain URLs processed
  • Deduplication: URL normalisation, fragments stripped

AI provider system

  • Interface: ProviderInterface with generate(string): string
  • Factory: ProviderFactory::create($config)
  • none: Rule-based, no API key required
  • anthropic: Claude via /v1/messages
  • openai: GPT-4o via /v1/chat/completions
  • google: Gemini via generateContent API
  • perplexity: OpenAI-compatible endpoint
  • grok: xAI via api.x.ai
  • ownai: Any OpenAI-compatible custom endpoint

Generator (Generator.php)

  • Rule-based title: Strips domain suffix, trims to 60 chars
  • Rule-based description: First complete sentences up to 155 chars
  • AI prompt: 4,000 chars of page content + H1, H2s, keywords
  • AI output: JSON with title, description, keywords, suggestions
  • Fallback: Rule-based if AI response is unparseable
  • OG type: article or website based on page type
  • JSON-LD type: Article, Product, Organization or WebPage
  • Robots: 3 variants – standard, AI open, AI block

API (api.php)

  • Method: POST, JSON body
  • Action crawl: Fetches URL, returns page data + provider info
  • Action generate: Takes page data + overrides, returns all tags
  • CORS: Origin-restricted to deploying domain
  • Rate limiting: File-based per IP, 10 req / 60s (configurable)
  • Error handling: JSON error responses with HTTP status codes
  • Response format: {ok, data} or {ok, error}

Frontend (index.html)

  • JavaScript: Vanilla ES6+, no framework
  • Mode toggle: Single URL / Multicrawl
  • Single URL flow: Crawl → auto-generate or manual entry → generate
  • Multicrawl flow: Enter domain → live progress table via SSE
  • Char counters: Live feedback for title (30–60) and description (120–160)
  • Tabs: Title & Desc, Open Graph, Twitter Card, JSON-LD, Robots
  • Copy buttons: Per output block, clipboard API
  • AI badge: Shows active provider name

File structure

  • api.php: Request handler, rate limiter, CORS
  • multicrawl.php: SSE endpoint for full-domain crawl + generation
  • config.php: Provider, API key, model, crawler settings (reads .env)
  • Crawler.php: URL fetch + DOM content extraction
  • Generator.php: Rule-based + AI tag generation
  • SitemapExtractor.php: URL discovery via sitemap + fallbacks
  • providers/: ProviderInterface, Factory, 6 provider classes
  • tmp/: Rate limit JSON files (needs write permission)

Known limitations (v1.0)

  • JS-rendered pages: No headless browser – JS-only content not crawled
  • Login-protected pages: No authentication support
  • AI response time: 3–10s per page depending on provider
  • Multicrawl speed: Sequential by design – parallel crawl would exceed free tier rate limits
  • Rate limiting: File-based only, no distributed cache
  • JSON-LD output: Simplified schema, not a full structured data audit

Built by Sören Meier, 2026 · github.com/soeren777/metatag-generator
Stack: PHP 8.1+ · cURL · DOMDocument · SSE · Vanilla JS · Lighttpd on Raspberry Pi 5