Web Crawler

Technical documentation of the server-side web crawler. Covers architecture, URL normalisation pipeline, deduplication, internal PageRank engine, adaptive priority scoring, orphan detection and all export formats.

v1.0 250 pages default limit 10 analysis signals Node.js · Express · Cheerio · Axios Port 5000
Go to tool

Overview

The Web Crawler is a lightweight but feature-rich server-side crawling engine built entirely in Node.js without browser automation. Starting from a given base URL it recursively discovers internal links, analyses each page and produces a comprehensive dataset covering SEO signals, content quality, canonical state, link structure and sitemap data.

The crawler integrates with an existing sitemap.xml if present, uses it as a seed list and enriches crawl results with original priority and lastmod values. After crawling, a multi-stage post-processing pipeline runs duplicate detection, internal PageRank calculation, orphan detection and priority scoring before the final result is returned.

Capabilities at a glance

  • Max pages: 250 (configurable per request)
  • Concurrency: Sequential with 200 ms politeness delay
  • Redirect handling: Up to 5 hops (301/302/303/307/308)
  • robots.txt: Full parsing, wildcard support, optional
  • Sitemap integration: Reads existing sitemap.xml as seed
  • Duplicate detection: Path-normalised key comparison
  • Internal PageRank: 25 iterations, damping 0.85
  • Orphan detection: 3 categories (true / sitemap / near)
  • Thin content scoring: 5 weighted signals
  • Export formats: JSON, CSV, TXT, XML Sitemap
  • JS navigation: Not supported (static HTML only)
  • Dependencies: Express, Axios, Cheerio

URL Normalisation Pipeline

Every URL discovered during crawling passes through a multi-step normalisation pipeline before being added to the queue. This prevents duplicate visits caused by trailing slashes, index files, tracking parameters or inconsistent query string ordering.

Normalisation steps

  • Trailing slash after query string: Removed (e.g. ?tid=2/ → ?tid=2)
  • Double slashes: Collapsed to single slash
  • index.php / index.html: Replaced with /
  • Trailing slash enforcement: Added to directory-style paths
  • Tracking parameters: Stripped (UTM, fbclid, gclid, ref etc.)
  • Query string sorting: Alphabetical – prevents order-based duplicates
  • Path + query conflict: Trailing slash removed if query params present

URL validation rules

  • Same domain only: External links discarded
  • Protocol: http: and https: only
  • Max query params: 1 (prevents filter/forum explosions)
  • PHP with params: Whitelist only (showthread, viewtopic, topic.php etc.)
  • Binary assets: Images, fonts, CSS, JS, PDF, ZIP filtered out
  • Feeds: RSS, Atom in all variants excluded

Exclude patterns (43 rules)

  • WordPress: /wp-admin/, /wp-login, /wp-json/, login, cart, checkout
  • Forum systems: MyBB, phpBB, vBulletin, SMF action URLs
  • System pages: /print/, /feed/, /xmlrpc, /cgi-bin/, /admin/, /uploads/
  • Action parameters: action=, sort=, filter=, doing_wp_cron= etc.

Tracking parameters stripped

  • Analytics: utm_source, utm_medium, utm_campaign, utm_term, utm_content
  • Ad platforms: fbclid, gclid, msclkid, mc_eid
  • General: ref, source, affiliate

Page Analysis & Metadata Extraction

Each crawled page is parsed with Cheerio and a comprehensive set of signals is extracted. Navigation, header and footer elements are removed before text analysis to avoid inflating word counts with boilerplate content.

SEO signals

  • Title: Text content, length tracked
  • Meta description: Presence and content
  • Meta robots: noindex detection
  • X-Robots-Tag header: noindex / none detection
  • Canonical tag: Extracted, normalised, compared to page URL
  • H1 / H2 / H3 count: All heading levels counted
  • Image count: All <img> tags
  • Internal link count: All <a href> tags
  • External link count: Links to other domains

Content quality signals

  • Word count: Words >2 chars after nav/header/footer removal
  • Text/HTML ratio: Body text length ÷ raw HTML length
  • Last modified: HTTP header → article:modified_time → dateModified
  • JSON-LD: Organization / WebSite schema extraction (email, description)
  • Redirect detection: Final URL after all hops recorded

Thin content scoring (thinScore 0.0–1.0)

  • Word count <300: −0.3
  • Word count <100: −0.2 (cumulative)
  • Text/HTML ratio <0.1: −0.2
  • Link count <2: −0.1
  • External link ratio >70%: −0.2 (link directory signal)
  • Result: Clamped to 0.0 minimum, rounds to 1 decimal

Per-page output fields

  • url, title, description, bodyText, jsonLd
  • canonical, isCanonical, isNoIndex
  • h1Count, h2Count, h3Count
  • imageCount, linkCount, externalLinkCount
  • wordCount, textHtmlRatio, thinScore, qualityScore
  • depth, rawDepth, lastModified, changefreq
  • priority, pageRank, pageRankRaw
  • source, inOriginalSitemap
  • isDuplicate, duplicateOf
  • isOrphan, isSitemapOrphan, isNearOrphan, incomingLinks
  • isLegalPage

Sitemap Integration

Before crawling begins, the crawler fetches the existing sitemap.xml from the target domain's root. The sitemap URLs are added to the crawl queue with depth=null (meaning: present in sitemap but not yet reached via hyperlinks). This ensures that sitemap-only pages are always checked even if they are not linked from the crawled pages.

The original sitemap's priority and lastmod values are stored per URL and used later in the hybrid priority calculation. Pages found both in the sitemap and via crawling receive the BOTH source flag; those found only via crawling get CRAWLED; those only in the sitemap get a null depth and SITEMAP_ONLY treatment in the priority engine.

Duplicate Detection

After all pages are crawled, a path-normalised deduplication pass runs over the full results. The normalisation key is derived from the URL pathname, lowercased and with trailing slashes stripped. The first occurrence keeps isDuplicate=false; all subsequent occurrences for the same key are marked isDuplicate=true with a duplicateOf reference.

This detects URL variants such as /blog/page1 vs /blog/page1?ref=share, or /page/ vs /page/index.html. Duplicate pages are excluded from all frontend display views, CSV/TXT/JSON exports and the generated sitemap.xml.

Internal PageRank Engine

After crawling and deduplication, a simplified internal PageRank calculation runs over all crawled pages. The algorithm uses 25 iterations with a damping factor of 0.85, consistent with the standard PageRank formulation. Outgoing link counts are approximated from each page's linkCount field.

The raw PageRank values are normalised to a 0.1–1.0 scale. The normalised value (pageRank) is then used as a ±0.1 additive bonus/penalty in the final priority engine: pages with pageRank=1.0 receive +0.1, pages at 0.5 receive ±0, pages at 0.0 receive −0.1.

Adaptive Priority Engine

The final page priority (used in the generated sitemap.xml) is computed in three stages. The result is clamped to 0.1–1.0 and rounded to one decimal place. The root URL always receives priority 1.0.

Stage 1 – Depth-based base priority

  • Depth 0 (root): 1.0
  • Depth 1: 0.8
  • Depth 2: 0.6
  • Depth 3: 0.5
  • Depth 4+: 0.3
  • Sitemap-only (depth=null): Median site depth used

Stage 2 – Hybrid weighting

  • BOTH (crawled + sitemap): 70% crawler + 30% original priority
  • SITEMAP_ONLY: 50% crawler + 50% original priority
  • CRAWLED only: 100% crawler priority

Stage 3 – Final priority signals

  • PageRank bonus: (pageRank − 0.5) × 0.2 → range ±0.1
  • Thin content penalty: (thinScore − 1.0) × 0.1 → max −0.1
  • Legal page floor: Minimum 0.3, changefreq forced yearly
  • Orphan penalty: True orphan −0.2, near-orphan −0.1, sitemap-orphan −0.1

Adaptive changefreq

  • Root (/): daily
  • Legal pages: yearly (forced)
  • Blog/news paths: weekly
  • Depth ≤1: weekly
  • lastmod ≤1 day old: daily
  • lastmod ≤14 days: weekly
  • lastmod ≤365 days: monthly
  • lastmod >365 days: yearly

Orphan Detection

After the PageRank pass, every page is classified for link isolation. Three categories are distinguished to avoid false positives for pages that are legitimately only in the sitemap.

True orphan

  • Condition: 0 incoming links, not in original sitemap, no crawled parent path
  • Priority penalty: −0.2
  • Severity: Critical – page may be inaccessible

Sitemap orphan

  • Condition: 0 incoming links but present in original sitemap
  • Priority penalty: −0.1
  • Severity: Warning – page exists but internal linking missing

Near-orphan

  • Condition: Exactly 1 incoming link AND depth >2
  • Priority penalty: −0.1
  • Severity: Notice – weakly linked, may be hard to discover

Legal Page Protection

Pages matching known legal/trust URL patterns (impressum, imprint, datenschutz, privacy, terms, agb, contact, about, kontakt, ueber-uns) are never filtered or penalised. Their priority floor is 0.3, changefreq is forced to yearly, and their qualityScore is clamped to a minimum of 0 regardless of thin content signals. This ensures trust signals for search engines are never accidentally removed from generated sitemaps.

Export Formats

All exports apply the same deduplication filter: noindex pages, duplicate URLs and pages marked isDuplicate are excluded. The totalPages count in the JSON export reflects the filtered set.

JSON

  • Content: Full crawl dataset including all analysis fields
  • totalPages: Reflects deduplicated count
  • Use case: Integration, further processing, archiving

CSV

  • Columns: Page, Title, Canonical, isCanonical, H1, H2, Links, Images
  • Use case: Spreadsheet analysis, client reporting

TXT

  • Format: URL | Title | Canonical:Yes/No | H1:n H2:n Links:n Images:n
  • Use case: Quick overview, copy-paste, plain text audits

XML Sitemap

  • Standard: Sitemap Protocol 0.9
  • Filter: noindex, duplicates, non-canonical, binary assets excluded
  • Fields: loc, lastmod (today), priority (depth-based), changefreq
  • Use case: Direct submission to Search Console

Tree Visualisation

After a completed crawl, the frontend can generate an SVG tree diagram of the entire site structure. The tree is built from all crawled pages, grouped by URL path segments into a hierarchical node structure. Layout is computed with a recursive space-partitioning algorithm that distributes nodes proportionally based on subtree width.

The SVG is generated entirely client-side without external libraries and can be downloaded as a standalone file. Node colours indicate depth: root in red, first-level categories in purple, deeper levels in blue. Each node shows page title, path, H1/link/image counts and a child-count badge.

Backend Architecture

Server

  • Runtime: Node.js (CommonJS)
  • Framework: Express
  • HTTP client: Axios (redirects, timeouts, status validation)
  • HTML parser: Cheerio
  • Port: 5000
  • CORS: ai-ready-check.de + llmshub.de allowed

Network

  • Main page timeout: 10 seconds
  • robots.txt timeout: 5 seconds
  • Sitemap fetch timeout: 5 seconds
  • Max redirects: 5 hops
  • Politeness delay: 200 ms between requests
  • User-Agent: Mozilla/5.0 (compatible; Crawler/1.0)

API endpoints

  • POST /api/webcrawl/start: Start crawl, returns full result
  • GET /api/webcrawl/health: Health check
  • GET /api/fetch-sitemap?url=: CORS proxy for sitemap fetching
  • Security: Origin-restricted, rate limited

Frontend

  • JavaScript: Vanilla ES6+, no framework
  • Deduplication: Client-side filter on all views and exports
  • Search: Real-time URL filter
  • Canonical filter: Toggle to show only canonical pages
  • SVG visualisation: Client-side tree generation and download
  • Stats recalculation: All stats computed from filtered pages

Known Limitations

Current limitations

  • No JavaScript execution: SPA / JS-rendered navigation not followed
  • No caching: Every crawl fetches all pages fresh
  • Sequential requests: No parallel page fetching (politeness by design)
  • PageRank approximation: linkCount used as outgoing link proxy, not actual link graph
  • No authentication: Password-protected pages not crawlable
  • No cookie handling: Session-dependent pages may return empty content
  • Memory-bound: All pages held in memory during crawl – large sites >500 pages may strain Pi 5 RAM
  • No incremental crawling: Full re-crawl required for updates
Technical implementation: server.js (WebsiteCrawler class) · crawler.js (frontend) | Production system running on Raspberry Pi 5 · Node.js · Port 5000