Documentation

Meta Tag Generator

Technical documentation of the Meta Tag Generator: PHP-based URL crawler, background job queue for crawling multiple URLs without timeouts, multi-provider AI architecture with rule-based fallback, and outputs for Title, Description, Keywords, Open Graph, Twitter Card, JSON-LD and Robots.

v1.0 7 AI providers Single URL + Crawl multiple URLs Retry + XLSX export PHP · Vanilla JS Open Source

Go to tool GitHub

About the Meta Tag Generator

The Meta Tag Generator crawls any URL server-side and extracts the full page content – H1, H2 headings, paragraphs, existing meta tags, JSON-LD blocks and body text. Both Single URL and Multiple URLs modes use a background PHP CLI worker – api.php responds immediately once the worker starts, eliminating timeout risks from slow AI responses or 503 retries.

Version 1.0 introduces Crawl multiple URLs mode: the tool automatically discovers all URLs of a domain via sitemap.xml (with robots.txt and link-extraction fallback), then processes every page sequentially. A background PHP CLI worker runs independently of the web server – no HTTP timeouts, works reliably on sites with 20+ pages. Live progress is shown via job polling every 2 seconds.

The AI provider is fully swappable via a single config line – from no AI (free, rule-based) to Gemini, Claude, GPT-4o, Perplexity, Grok or any OpenAI-compatible endpoint.

Tool scope

Modes: Single URL or Crawl multiple URLs (full domain)
Sitemap discovery: sitemap.xml → robots.txt → link extraction fallback
Max URLs per run: 500 (configurable)
Live updates: Job polling every 2s (background worker architecture)
Retry logic: Auto-retry on 503/429 (3 attempts, exponential delay)
Retry failed pages: Manual retry of failed URLs after a crawl run – preserves successful results
XLSX export: Export results via SheetJS – Excel, LibreOffice, Google Sheets compatible
AI providers: 7 (none/rule-based, Anthropic, OpenAI, Google, Perplexity, Grok, OwnAI)
Output types: Title, Description, Keywords, Open Graph, Twitter Card, JSON-LD, Robots
Rate limiting: File-based, per IP, configurable window
Deployment: Single PHP directory, no framework, no database
License: MIT – self-hosted, open source

Technical details

Crawler (Crawler.php)

Protocol: HTTP/HTTPS via PHP cURL
Redirects: Up to 5 hops
Timeout: 10 seconds per request
SSL: CURLOPT_SSL_VERIFYPEER enabled
User-Agent: MetaTagGenerator/1.0
Content limit: 50,000 characters (configurable)
Parser: PHP DOMDocument + DOMXPath
Encoding: HTML-ENTITIES via mb_convert_encoding

Content extraction

Title: <title> tag
Meta tags: description, keywords, author, robots, canonical
Open Graph: og:title, og:description, og:image, og:type
Twitter Card: twitter:card, title, description
H1: First heading element
H2s: Up to 5 subheadings
Paragraphs: Up to 6 paragraphs >80 chars
JSON-LD: All <script type="application/ld+json"> blocks
Body text: Stripped of nav/header/footer/scripts
Page type: Auto-detected from JSON-LD or URL pattern

Crawl multiple URLs (multicrawl-worker.php + job-status.php)

URL discovery: sitemap.xml → sitemap_index.xml → robots.txt → link extraction
Max URLs: 500 per run (configurable)
Architecture: Background PHP CLI worker via nohup
Job state: Written to /tmp/metatag-jobs/JOB_ID.json after each URL
Progress polling: Frontend polls job-status.php every 2s
Job cleanup: Auto-deleted after 1 hour
Retry logic: 3 attempts on 503/429 (2s, 5s, 10s delays)
Request delay: Configurable ms between requests
Domain filtering: Only same-domain URLs processed
Deduplication: URL normalisation, fragments stripped

AI provider system

Interface: ProviderInterface with generate(string): string
Factory: ProviderFactory::create($config)
none: Rule-based, no API key required
anthropic: Claude via /v1/messages
openai: GPT-4o via /v1/chat/completions
google: Gemini via generateContent API
perplexity: OpenAI-compatible endpoint
grok: xAI via api.x.ai
ownai: Any OpenAI-compatible custom endpoint

Generator (Generator.php)

Rule-based title: Strips domain suffix, trims to 60 chars
Rule-based description: First complete sentences up to 155 chars
AI prompt: 4,000 chars of page content + H1, H2s, keywords
AI prompt length rules: title 50–60 / description 150–160 chars marked as hard limits, with examples and a self-check instruction before responding
AI output: JSON with title, description, keywords, suggestions
Post-processing: title and description run through trimToLength() as a safety net, regardless of AI compliance
trimToLength() – 3-step cut: (1) cut at sentence end (./!/?) if past half the limit, no ellipsis needed; (2) else cut at last natural separator (|, –, —, :, ,), no ellipsis; (3) else cut at word boundary, strip trailing stopwords (prepositions/articles/conjunctions, DE+EN), then append "…"
Fallback: Rule-based if AI response is unparseable
OG type: article or website based on page type
JSON-LD type: Article, Product, Organization or WebPage
Robots: 3 variants – standard, AI open, AI block

API (api.php)

Method: POST, JSON body
Action crawl: Fetches URL, returns page data + provider info
Action generate: Starts background worker, returns immediately – no timeout risk
Action start_multicrawl: Creates job, starts multicrawl-worker.php via nohup
Action retry_multicrawl: Creates new job with only failed URLs, merges results on completion
CORS: Origin-restricted to deploying domain
Rate limiting: File-based per IP, 10 req / 60s (configurable)
Error handling: JSON error responses with HTTP status codes
Response format: {ok, data} or {ok, error}

Frontend (index.html)

JavaScript: Vanilla ES6+, no framework
Mode toggle: Single URL / Crawl multiple URLs
Single URL flow: Crawl → worker generates in background → result polled and rendered
Crawl multiple URLs flow: Enter domain → background job → live progress table via polling
Retry failed pages: After a crawl run, a retry button appears for failed URLs – starts a new job with only those URLs, merges results back into the existing table without touching successful rows
XLSX export: Export results as .xlsx via SheetJS (CDN) – compatible with Excel, LibreOffice and Google Sheets without encoding issues
Char counters: Live feedback for title (30–60) and description (120–160)
Tabs: Title & Desc, Open Graph, Twitter Card, JSON-LD, Robots
Copy buttons: Per output block, clipboard API
AI badge: Shows active provider name

File structure

api.php: Request handler, rate limiter, CORS – actions: crawl, generate, start_multicrawl, retry_multicrawl
generate-worker.php: Background CLI worker for Single URL generation
multicrawl-worker.php: Background CLI worker for crawl multiple URLs – also accepts direct URL list for retry runs
multicrawl.php: Legacy SSE endpoint (superseded by job queue)
job-status.php: Job progress endpoint, polled every 2s
config.php: Provider, API key, model, crawler settings (reads .env)
Crawler.php: URL fetch + DOM content extraction
Generator.php: Rule-based + AI tag generation with trimToLength() post-processing
SitemapExtractor.php: URL discovery via sitemap + fallbacks
providers/: ProviderInterface, Factory, 6 provider classes
tmp/: Rate limit JSON files (needs write permission)

Known limitations (v1.0)

JS-rendered pages: No headless browser – JS-only content not crawled
Login-protected pages: No authentication support
AI response time: 3–10s per page depending on provider
Multicrawl speed: Sequential by design – parallel crawl would exceed free tier rate limits
Rate limiting: File-based only, no distributed cache
JSON-LD output: Simplified schema, not a full structured data audit

Built by Sören Meier, 2026 · github.com/soeren777/metatag-generator
Stack: PHP 8.1+ · cURL · DOMDocument · SSE · Vanilla JS · Lighttpd on Raspberry Pi 5