Prime Lands Real Estate Intelligence Platform

This project sits at the intersection of web data acquisition, retrieval design, and answer trust. The system was designed to make public real estate content queryable without losing source traceability or domain precision.

View Repository

Case Study

Overview

Project Overview

What I built, the problem, and the solution

Built a domain-specific intelligence workflow that crawls dynamic real estate pages, converts them into structured corpora, indexes multiple chunk variants, and serves evidence-backed answers through RAG, CAG, and CRAG service paths.

Challenge Framing

Prime Lands content lived across dynamic, JavaScript-rendered pages with no retrieval-ready structure, which made answer quality brittle and hard to ground with generic prompting alone.

Solution Strategy

I built a config-driven ingestion-to-answering pipeline with browser-based crawling, multi-strategy chunking, vector indexing, semantic caching, and confidence-triggered corrective retrieval.

Project Highlights

Five chunking strategies to compare retrieval coverage and context preservation.Evidence-aware answer generation with source URL traceability.Confidence-triggered CRAG flow when first-pass retrieval is weak.Semantic FAQ and history cache to reduce repeated inference cost.

Gallery

Product screens and workflow snapshots

Selected screens show the visible product experience and the operational surfaces behind each project. Projects without captured assets keep a structured placeholder until real screenshots are available.

Module 01Frame 01

Prime Lands Real Estate Intelligence Platform

PythonLangChain

Crawler and corpus stage

Browser-based ingestion transforms rendered site pages into markdown and JSONL artifacts ready for indexing.

Module 01Frame 02

Prime Lands Real Estate Intelligence Platform

PythonLangChain

Retrieval strategy matrix

The corpus is indexed through multiple chunk profiles so recall and context fidelity can be compared directly.

Module 01Frame 03

Prime Lands Real Estate Intelligence Platform

PythonLangChain

Adaptive answering flow

Queries move across cache, retriever, and corrective stages based on similarity and confidence signals.

Tech Stack

Built with tools chosen for reliability and iteration speed

Retrieval engineering, context design, semantic caching, corrective retrieval, and experimentation-oriented AI system architecture.

AI / ML

Python
LangChain
OpenAI

Data

Qdrant
Supabase

DevOps

Playwright

Key Features

Dynamic site ingestion

Playwright-based crawler captures JavaScript-rendered property content with domain filtering and polite traversal.

Retrieval experimentation

Multiple chunking strategies stay first-class so retrieval quality can be tuned by corpus behavior instead of assumption.

Grounded answer generation

LCEL pipelines combine retriever, formatter, prompt, and model layers to keep answers evidence-backed.

Corrective retrieval

CRAG expands search breadth when heuristics indicate low-confidence context.

Architecture

System architecture designed as a readable engineering story

Each layer stays explicit so reviewers can quickly understand where interface, orchestration, persistence, and service responsibilities live.

Acquisition

Browser automation captures rendered pages and converts them into structured markdown and JSONL artifacts.

PlaywrightBeautifulSoup4markdownify

Knowledge Layer

Chunk variants and metadata enrich the corpus before embedding and persistence.

PyYAMLtiktokenQdrant

Answering Layer

RAG, CAG, and CRAG services coordinate retrieval, cache reuse, confidence checks, and answer generation.

LangChain CoreLCELOpenAIOpenRouter

System Flow

Key stages broken down as a readable execution path

The pipeline section keeps the most important engineering steps visible without collapsing them into generic bullet lists.

Crawl

Collect rendered pages, remove noisy DOM sections, and preserve useful metadata such as source URL and depth.

PlaywrightBeautifulSoup4

Structure

Convert HTML into markdown and JSONL so the corpus is readable to both developers and downstream chunkers.

markdownifyJSONL

Chunk + Embed

Generate retrieval units across five chunking strategies and embed them into a Qdrant collection.

QdrantOpenAI embeddingsPyYAML

Serve

Route queries through standard RAG, semantic cache lookup, or corrective retrieval depending on context quality.

LCELOpenAIOpenRouter

Timeline

A case-study flow that explains how the system took shape

This timeline keeps the implementation story concise: what was framed first, what was hardened next, and what ultimately made the project production-ready.

Phase 01

Domain ingestion design

Defined crawling boundaries, content extraction rules, and artifact formats for dynamic real estate pages.

Phase 02

Retrieval strategy comparison

Implemented multiple chunking patterns to evaluate recall, context richness, and retrieval footprint.

Phase 03

Adaptive answering

Layered semantic caching and CRAG on top of the base RAG service to improve latency and recovery from weak context.

Challenges

Technical constraints, decisions, and the reasoning behind them

Each challenge is tied to a concrete design choice and a specific outcome.

Solution

Used Playwright rendering waits, traversal rules, and content cleanup before conversion.

Outcome

Created a stable corpus instead of brittle static scrape output.

Solution

Stored multiple chunk representations with strategy metadata rather than forcing one segmentation approach.

Outcome

Made retrieval tuning evidence-driven instead of guess-driven.

Solution

Added a CRAG confidence heuristic that expands retrieval when overlap, richness, and diversity are weak.

Outcome

Improved robustness without introducing a separate reranking service.

Results

Metrics and outcomes presented for quick technical review

The emphasis here is signal, not decoration: key numbers, verifiable outcomes, and the context needed to interpret them responsibly.

691

Retrieval Units

Persisted across chunk artifacts for repeatable evaluation.

Chunk Strategies

Semantic, fixed, sliding, parent-child, and late chunking.

FAQ Cache Seeds

Preloaded prompts for low-latency domain responses.

4 -> 8

Adaptive Retrieval

CRAG expands top-k when confidence falls below threshold.

Key Results

Built an end-to-end domain assistant workflow from crawl to grounded answer generation.
Persisted 691 retrieval units and enabled repeatable chunk strategy evaluation.
Reduced repeated query cost with semantic FAQ and history cache layers.
Improved answer trust through evidence URL formatting and source-backed outputs.

Business Impact

Business value

Shows how fragmented property content can become an assistant-ready experience without losing citation quality.

Engineering value

Demonstrates retrieval experimentation, config-driven orchestration, and a clean path from prototype to service packaging.

Continue