Back To Projects
Module 01

Grounded domain intelligence for a high-context real estate search experience.

Prime Lands Real Estate Intelligence Platform

This project sits at the intersection of web data acquisition, retrieval design, and answer trust. The system was designed to make public real estate content queryable without losing source traceability or domain precision.

Overview

Built a domain-specific intelligence workflow that crawls dynamic real estate pages, converts them into structured corpora, indexes multiple chunk variants, and serves evidence-backed answers through RAG, CAG, and CRAG service paths.

Problem

Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.

Approach

I built a config-driven ingestion-to-answering pipeline with browser-based crawling, multi-strategy chunking, vector indexing, semantic caching, and confidence-triggered corrective retrieval.

Project Overview

A modern engineering case study, structured for both recruiters and builders

Built a domain-specific intelligence workflow that crawls dynamic real estate pages, converts them into structured corpora, indexes multiple chunk variants, and serves evidence-backed answers through RAG, CAG, and CRAG service paths.

Challenge Framing

Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.

Solution Strategy

I built a config-driven ingestion-to-answering pipeline with browser-based crawling, multi-strategy chunking, vector indexing, semantic caching, and confidence-triggered corrective retrieval.

Project Highlights

  • Five chunking strategies to compare retrieval coverage and context preservation.
  • Evidence-aware answer generation with source URL traceability.
  • Confidence-triggered CRAG flow when first-pass retrieval is weak.
  • Semantic FAQ and history cache to reduce repeated inference cost.
Tech Stack

Built with tools chosen for reliability and iteration speed

Retrieval engineering, context design, semantic caching, corrective retrieval, and experimentation-oriented AI system architecture.

Core Stack

  • Python
  • LangChain
  • Qdrant
  • Playwright
  • OpenAI
  • Supabase

Key Features

Dynamic site ingestion

Playwright-based crawler captures JavaScript-rendered property content with domain filtering and polite traversal.

Retrieval experimentation

Multiple chunking strategies stay first-class so retrieval quality can be tuned by corpus behavior instead of assumption.

Grounded answer generation

LCEL pipelines combine retriever, formatter, prompt, and model layers to keep answers evidence-backed.

Corrective retrieval

CRAG expands search breadth when heuristics indicate low-confidence context.

Architecture

System architecture designed as a readable engineering story

Each layer stays explicit so reviewers can quickly understand where ingestion, orchestration, persistence, and model-serving responsibilities live.

01

Acquisition

Browser automation captures rendered pages and converts them into structured markdown and JSONL artifacts.

PlaywrightBeautifulSoup4markdownify
02

Knowledge Layer

Chunk variants and metadata enrich the corpus before embedding and persistence.

PyYAMLtiktokenQdrant
03

Answering Layer

RAG, CAG, and CRAG services coordinate retrieval, cache reuse, confidence checks, and answer generation.

LangChain CoreLCELOpenAIOpenRouter
AI Pipeline

Pipeline stages broken down as a readable execution path

The pipeline section keeps the most important engineering steps visible without collapsing them into generic bullet lists.

01

Crawl

Collect rendered pages, remove noisy DOM sections, and preserve useful metadata such as source URL and depth.

PlaywrightBeautifulSoup4
02

Structure

Convert HTML into markdown and JSONL so the corpus is readable to both developers and downstream chunkers.

markdownifyJSONL
03

Chunk + Embed

Generate retrieval units across five chunking strategies and embed them into a Qdrant collection.

QdrantOpenAI embeddingsPyYAML
04

Serve

Route queries through standard RAG, semantic cache lookup, or corrective retrieval depending on context quality.

LCELOpenAIOpenRouter
Timeline

A case-study flow that explains how the system took shape

This timeline keeps the implementation story concise: what was framed first, what was hardened next, and what ultimately made the project production-ready.

Phase 01

Domain ingestion design

Defined crawling boundaries, content extraction rules, and artifact formats for dynamic real estate pages.

Phase 02

Retrieval strategy comparison

Implemented multiple chunking patterns to evaluate recall, context richness, and retrieval footprint.

Phase 03

Adaptive answering

Layered semantic caching and CRAG on top of the base RAG service to improve latency and recovery from weak context.

Challenges

Technical constraints, decisions, and the reasoning behind them

This section is intentionally recruiter-friendly and engineer-friendly at the same time: each challenge is tied to a concrete design choice and a specific outcome.

Challenge

Crawling a JavaScript-rendered website without losing visible content.

Solution

Used Playwright rendering waits, traversal rules, and content cleanup before conversion.

Outcome

Created a stable corpus instead of brittle static scrape output.

Challenge

Preserving retrieval quality across heterogeneous page shapes.

Solution

Stored multiple chunk representations with strategy metadata rather than forcing one segmentation approach.

Outcome

Made retrieval tuning evidence-driven instead of guess-driven.

Challenge

Handling weak first-pass retrieval on ambiguous queries.

Solution

Added a CRAG confidence heuristic that expands retrieval when overlap, richness, and diversity are weak.

Outcome

Improved robustness without introducing a separate reranking service.

Results

Metrics and outcomes presented for quick technical review

The emphasis here is signal, not decoration: key numbers, verifiable outcomes, and the context needed to interpret them responsibly.

Retrieval Units

691

Persisted across chunk artifacts for repeatable evaluation.

Chunk Strategies

5

Semantic, fixed, sliding, parent-child, and late chunking.

FAQ Cache Seeds

48

Preloaded prompts for low-latency domain responses.

Adaptive Retrieval

4 -> 8

CRAG expands top-k when confidence falls below threshold.

Key Results

  • Built an end-to-end domain assistant workflow from crawl to grounded answer generation.
  • Persisted 691 retrieval units and enabled repeatable chunk strategy evaluation.
  • Reduced repeated query cost with semantic FAQ and history cache layers.
  • Improved answer trust through evidence URL formatting and source-backed outputs.

Research + Business Impact

Business value

Shows how fragmented property content can become an assistant-ready experience without losing citation quality.

Engineering value

Demonstrates retrieval experimentation, config-driven orchestration, and a clean path from prototype to service packaging.