Challenge Framing
Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.
Grounded domain intelligence for a high-context real estate search experience.
This project sits at the intersection of web data acquisition, retrieval design, and answer trust. The system was designed to make public real estate content queryable without losing source traceability or domain precision.
Overview
Built a domain-specific intelligence workflow that crawls dynamic real estate pages, converts them into structured corpora, indexes multiple chunk variants, and serves evidence-backed answers through RAG, CAG, and CRAG service paths.
Problem
Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.
Approach
I built a config-driven ingestion-to-answering pipeline with browser-based crawling, multi-strategy chunking, vector indexing, semantic caching, and confidence-triggered corrective retrieval.
Built a domain-specific intelligence workflow that crawls dynamic real estate pages, converts them into structured corpora, indexes multiple chunk variants, and serves evidence-backed answers through RAG, CAG, and CRAG service paths.
Challenge Framing
Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.
Solution Strategy
I built a config-driven ingestion-to-answering pipeline with browser-based crawling, multi-strategy chunking, vector indexing, semantic caching, and confidence-triggered corrective retrieval.
Project Highlights
Retrieval engineering, context design, semantic caching, corrective retrieval, and experimentation-oriented AI system architecture.
Core Stack
Key Features
Playwright-based crawler captures JavaScript-rendered property content with domain filtering and polite traversal.
Multiple chunking strategies stay first-class so retrieval quality can be tuned by corpus behavior instead of assumption.
LCEL pipelines combine retriever, formatter, prompt, and model layers to keep answers evidence-backed.
CRAG expands search breadth when heuristics indicate low-confidence context.
Each layer stays explicit so reviewers can quickly understand where ingestion, orchestration, persistence, and model-serving responsibilities live.
Browser automation captures rendered pages and converts them into structured markdown and JSONL artifacts.
Chunk variants and metadata enrich the corpus before embedding and persistence.
RAG, CAG, and CRAG services coordinate retrieval, cache reuse, confidence checks, and answer generation.
The pipeline section keeps the most important engineering steps visible without collapsing them into generic bullet lists.
Collect rendered pages, remove noisy DOM sections, and preserve useful metadata such as source URL and depth.
Convert HTML into markdown and JSONL so the corpus is readable to both developers and downstream chunkers.
Generate retrieval units across five chunking strategies and embed them into a Qdrant collection.
Route queries through standard RAG, semantic cache lookup, or corrective retrieval depending on context quality.
This timeline keeps the implementation story concise: what was framed first, what was hardened next, and what ultimately made the project production-ready.
Defined crawling boundaries, content extraction rules, and artifact formats for dynamic real estate pages.
Implemented multiple chunking patterns to evaluate recall, context richness, and retrieval footprint.
Layered semantic caching and CRAG on top of the base RAG service to improve latency and recovery from weak context.
This section is intentionally recruiter-friendly and engineer-friendly at the same time: each challenge is tied to a concrete design choice and a specific outcome.
Challenge
Solution
Used Playwright rendering waits, traversal rules, and content cleanup before conversion.
Outcome
Created a stable corpus instead of brittle static scrape output.
Challenge
Solution
Stored multiple chunk representations with strategy metadata rather than forcing one segmentation approach.
Outcome
Made retrieval tuning evidence-driven instead of guess-driven.
Challenge
Solution
Added a CRAG confidence heuristic that expands retrieval when overlap, richness, and diversity are weak.
Outcome
Improved robustness without introducing a separate reranking service.
The emphasis here is signal, not decoration: key numbers, verifiable outcomes, and the context needed to interpret them responsibly.
Retrieval Units
691
Persisted across chunk artifacts for repeatable evaluation.
Chunk Strategies
5
Semantic, fixed, sliding, parent-child, and late chunking.
FAQ Cache Seeds
48
Preloaded prompts for low-latency domain responses.
Adaptive Retrieval
4 -> 8
CRAG expands top-k when confidence falls below threshold.
Key Results
Research + Business Impact
Shows how fragmented property content can become an assistant-ready experience without losing citation quality.
Demonstrates retrieval experimentation, config-driven orchestration, and a clean path from prototype to service packaging.