Technical documentation of the server-side web crawler. Covers architecture, URL normalisation pipeline, deduplication, internal PageRank engine, adaptive priority scoring, orphan detection and all export formats.
The Web Crawler is a lightweight but feature-rich server-side crawling engine built entirely in Node.js without browser automation. Starting from a given base URL it recursively discovers internal links, analyses each page and produces a comprehensive dataset covering SEO signals, content quality, canonical state, link structure and sitemap data.
The crawler integrates with an existing sitemap.xml if present, uses it as a seed list and enriches crawl results with original priority and lastmod values. After crawling, a multi-stage post-processing pipeline runs duplicate detection, internal PageRank calculation, orphan detection and priority scoring before the final result is returned.
Capabilities at a glance
Max pages: 250 (configurable per request)
Concurrency: Sequential with 200 ms politeness delay
Redirect handling: Up to 5 hops (301/302/303/307/308)
robots.txt: Full parsing, wildcard support, optional
Sitemap integration: Reads existing sitemap.xml as seed
Every URL discovered during crawling passes through a multi-step normalisation pipeline before being added to the queue. This prevents duplicate visits caused by trailing slashes, index files, tracking parameters or inconsistent query string ordering.
Normalisation steps
Trailing slash after query string: Removed (e.g. ?tid=2/ → ?tid=2)
Double slashes: Collapsed to single slash
index.php / index.html: Replaced with /
Trailing slash enforcement: Added to directory-style paths
Each crawled page is parsed with Cheerio and a comprehensive set of signals is extracted. Navigation, header and footer elements are removed before text analysis to avoid inflating word counts with boilerplate content.
SEO signals
Title: Text content, length tracked
Meta description: Presence and content
Meta robots: noindex detection
X-Robots-Tag header: noindex / none detection
Canonical tag: Extracted, normalised, compared to page URL
H1 / H2 / H3 count: All heading levels counted
Image count: All <img> tags
Internal link count: All <a href> tags
External link count: Links to other domains
Content quality signals
Word count: Words >2 chars after nav/header/footer removal
Text/HTML ratio: Body text length ÷ raw HTML length
Last modified: HTTP header → article:modified_time → dateModified
Before crawling begins, the crawler fetches the existing sitemap.xml from the target domain's root. The sitemap URLs are added to the crawl queue with depth=null (meaning: present in sitemap but not yet reached via hyperlinks). This ensures that sitemap-only pages are always checked even if they are not linked from the crawled pages.
The original sitemap's priority and lastmod values are stored per URL and used later in the hybrid priority calculation. Pages found both in the sitemap and via crawling receive the BOTH source flag; those found only via crawling get CRAWLED; those only in the sitemap get a null depth and SITEMAP_ONLY treatment in the priority engine.
Duplicate Detection
After all pages are crawled, a path-normalised deduplication pass runs over the full results. The normalisation key is derived from the URL pathname, lowercased and with trailing slashes stripped. The first occurrence keeps isDuplicate=false; all subsequent occurrences for the same key are marked isDuplicate=true with a duplicateOf reference.
This detects URL variants such as /blog/page1 vs /blog/page1?ref=share, or /page/ vs /page/index.html. Duplicate pages are excluded from all frontend display views, CSV/TXT/JSON exports and the generated sitemap.xml.
Internal PageRank Engine
After crawling and deduplication, a simplified internal PageRank calculation runs over all crawled pages. The algorithm uses 25 iterations with a damping factor of 0.85, consistent with the standard PageRank formulation. Outgoing link counts are approximated from each page's linkCount field.
The raw PageRank values are normalised to a 0.1–1.0 scale. The normalised value (pageRank) is then used as a ±0.1 additive bonus/penalty in the final priority engine: pages with pageRank=1.0 receive +0.1, pages at 0.5 receive ±0, pages at 0.0 receive −0.1.
Adaptive Priority Engine
The final page priority (used in the generated sitemap.xml) is computed in three stages. The result is clamped to 0.1–1.0 and rounded to one decimal place. The root URL always receives priority 1.0.
Stage 1 – Depth-based base priority
Depth 0 (root): 1.0
Depth 1: 0.8
Depth 2: 0.6
Depth 3: 0.5
Depth 4+: 0.3
Sitemap-only (depth=null): Median site depth used
Stage 2 – Hybrid weighting
BOTH (crawled + sitemap): 70% crawler + 30% original priority
SITEMAP_ONLY: 50% crawler + 50% original priority
CRAWLED only: 100% crawler priority
Stage 3 – Final priority signals
PageRank bonus: (pageRank − 0.5) × 0.2 → range ±0.1
After the PageRank pass, every page is classified for link isolation. Three categories are distinguished to avoid false positives for pages that are legitimately only in the sitemap.
True orphan
Condition: 0 incoming links, not in original sitemap, no crawled parent path
Priority penalty: −0.2
Severity: Critical – page may be inaccessible
Sitemap orphan
Condition: 0 incoming links but present in original sitemap
Priority penalty: −0.1
Severity: Warning – page exists but internal linking missing
Near-orphan
Condition: Exactly 1 incoming link AND depth >2
Priority penalty: −0.1
Severity: Notice – weakly linked, may be hard to discover
Legal Page Protection
Pages matching known legal/trust URL patterns (impressum, imprint, datenschutz, privacy, terms, agb, contact, about, kontakt, ueber-uns) are never filtered or penalised. Their priority floor is 0.3, changefreq is forced to yearly, and their qualityScore is clamped to a minimum of 0 regardless of thin content signals. This ensures trust signals for search engines are never accidentally removed from generated sitemaps.
Export Formats
All exports apply the same deduplication filter: noindex pages, duplicate URLs and pages marked isDuplicate are excluded. The totalPages count in the JSON export reflects the filtered set.
JSON
Content: Full crawl dataset including all analysis fields
totalPages: Reflects deduplicated count
Use case: Integration, further processing, archiving
After a completed crawl, the frontend can generate an SVG tree diagram of the entire site structure. The tree is built from all crawled pages, grouped by URL path segments into a hierarchical node structure. Layout is computed with a recursive space-partitioning algorithm that distributes nodes proportionally based on subtree width.
The SVG is generated entirely client-side without external libraries and can be downloaded as a standalone file. Node colours indicate depth: root in red, first-level categories in purple, deeper levels in blue. Each node shows page title, path, H1/link/image counts and a child-count badge.
Backend Architecture
Server
Runtime: Node.js (CommonJS)
Framework: Express
HTTP client: Axios (redirects, timeouts, status validation)
HTML parser: Cheerio
Port: 5000
CORS: ai-ready-check.de + llmshub.de allowed
Network
Main page timeout: 10 seconds
robots.txt timeout: 5 seconds
Sitemap fetch timeout: 5 seconds
Max redirects: 5 hops
Politeness delay: 200 ms between requests
User-Agent: Mozilla/5.0 (compatible; Crawler/1.0)
API endpoints
POST /api/webcrawl/start: Start crawl, returns full result
GET /api/webcrawl/health: Health check
GET /api/fetch-sitemap?url=: CORS proxy for sitemap fetching
Security: Origin-restricted, rate limited
Frontend
JavaScript: Vanilla ES6+, no framework
Deduplication: Client-side filter on all views and exports
Search: Real-time URL filter
Canonical filter: Toggle to show only canonical pages
SVG visualisation: Client-side tree generation and download
Stats recalculation: All stats computed from filtered pages
Known Limitations
Current limitations
No JavaScript execution: SPA / JS-rendered navigation not followed
No caching: Every crawl fetches all pages fresh
Sequential requests: No parallel page fetching (politeness by design)
PageRank approximation: linkCount used as outgoing link proxy, not actual link graph
No authentication: Password-protected pages not crawlable
No cookie handling: Session-dependent pages may return empty content
Memory-bound: All pages held in memory during crawl – large sites >500 pages may strain Pi 5 RAM
No incremental crawling: Full re-crawl required for updates
Technical implementation: server.js (WebsiteCrawler class) · crawler.js (frontend) |
Production system running on Raspberry Pi 5 · Node.js · Port 5000