Abstract
The traditional real estate search experience—filtering by price, bedrooms, and location—fails to capture what actually makes a home feel right. Buyers struggle to articulate preferences beyond surface-level criteria, while listings remain opaque collections of photos and bullet points that don't answer the questions that matter most.
This whitepaper presents an AI-powered real estate matching system that reimagines property discovery through conversational AI, multi-modal semantic search, and continuous taste learning. The system replaces rigid filter-and-browse with natural language understanding, learns buyer preferences from both explicit feedback and implicit behavior, and delivers personalized recommendations that improve with every interaction.
1. Executive Summary
1.1 The Problem
Home buying is broken at the discovery layer. Current platforms force buyers into a filter-first paradigm that:
- Reduces homes to checkboxes: Bedrooms, bathrooms, price—missing the nuances that define livability
- Ignores contextual needs: "Near good coffee shops" or "quiet streets for evening walks" have no filter
- Fails to learn: Swiping left on 50 homes teaches the system nothing about why
- Treats all buyers identically: A remote worker and a young family see the same listings
1.2 Our Solution
This system creates a buying experience that mirrors working with a knowledgeable human agent who:
- Understands natural language: "I want a mid-century modern home with natural light and space for a home office"
- Learns your taste: Every interaction refines the model of what you're looking for
- Sees beyond the listing: Multi-modal understanding of images, descriptions, and location context
- Proactively matches: New listings are scored against your learned preferences automatically
1.3 Key Innovations
- Conversational Search Agent: Natural language interface built on Gemini 2.0 Flash with structured tool calling for search execution
- Multi-Modal Embeddings: Four distinct embedding spaces capturing description, amenity, location, and visual characteristics
- Hybrid Retrieval: Elasticsearch combining BM25 lexical search with kNN vector similarity
- Taste Learning Engine: Continuous preference modeling from explicit ratings, implicit behavior, and conversational cues
- Mastra.ai Orchestration: Agent workflow framework managing tool execution, memory, and state
1.4 Results
- 3.2x improvement in time-to-relevant-listing vs. traditional filter search
- 78% of users found their eventual choice within the first 10 recommendations
- Semantic understanding correctly interprets 89% of natural language property queries
- Taste model convergence within 5-7 interactions for most users
2. Motivation & Problem Definition
2.1 The Filter Paradigm Failure
Every major real estate platform—Zillow, Redfin, Realtor.com—operates on the same fundamental model: expose a set of structured filters, let users narrow down, present paginated results. This approach made sense when listings were sparse and search technology limited. It fails in the modern context for several reasons:
Filters can't capture preference nuance. A buyer might want "natural light" but there's no filter for that. They want "a neighborhood that feels walkable" but walkability scores are crude proxies. They want "modern but warm, not sterile"—no filter exists for aesthetic temperature.
Users don't know their filters upfront. Preferences emerge through exposure. A buyer thinks they need 4 bedrooms until they see a brilliantly designed 3-bedroom. They think they want new construction until they fall for a renovated craftsman. Filters lock in assumptions prematurely.
Filter combinations explode to nothing. Stack enough filters and you get zero results. Users then start removing constraints, losing track of what matters most. The system offers no guidance on which filters to relax.
2.2 The Information Asymmetry
Listings are optimized for legal compliance and broad appeal, not for answering buyer questions:
| What Buyers Want to Know | What Listings Say |
|---|---|
| Will this home work for remote work? | "4 bed / 3 bath" |
| Is the kitchen actually functional? | "Updated kitchen with granite counters" |
| What's the neighborhood like at night? | "Great location!" |
| Will my furniture fit? | "Spacious living room" |
| Is this a good investment long-term? | "Motivated seller!" |
This gap forces buyers to visit properties in person to answer basic questions that could be resolved with better information architecture.
2.3 The Learning Gap
Current platforms waste enormous signal. Every swipe, every lingered-on photo, every discarded listing contains information about preference. Yet:
- Explicit feedback is rarely captured
- Implicit behavior (time on listing, photo sequence) is ignored
- Cross-session learning is minimal or nonexistent
- Taste evolution over time isn't modeled
The result: a user who has viewed 200 listings gets the same experience as a new visitor.
2.4 Design Requirements
| Requirement | Description |
|---|---|
| Natural Language Understanding | Process complex, unstructured queries |
| Multi-Modal Matching | Match across text, images, and location |
| Continuous Learning | Improve with every interaction |
| Explainable Recommendations | Users understand why a home was suggested |
| Real-Time Performance | Sub-second response times for search |
| Scale to Millions | Handle full MLS inventory efficiently |
3. System Overview
3.1 High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Conversational UI │ │
│ │ (Chat Interface + Property Cards) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ AGENT LAYER (Mastra.ai) │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Search │ │ Taste │ │ Listing │ │
│ │ Agent │ │ Learning │ │ Analysis │ │
│ │ (Gemini 2.0) │ │ Engine │ │ Agent │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Elasticsearch Hybrid Search │ │
│ │ (BM25 + kNN Vector Similarity) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Description │ │ Amenity │ │ Location │ │
│ │ Embeddings │ │ Embeddings │ │ Embeddings │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Image │ │ User │ │
│ │ Embeddings │ │ Preference │ │
│ └───────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Listing │ │ User │ │ Interaction │ │
│ │ Database │ │ Profiles │ │ Events │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
3.2 Core Components
Search Agent: The conversational interface powered by Gemini 2.0 Flash. Interprets natural language queries, manages multi-turn dialogue, and orchestrates tool calls for search execution.
Taste Learning Engine: Builds and maintains preference vectors from user feedback. Combines explicit ratings (likes, dislikes) with implicit signals (view duration, photo engagement, return visits).
Listing Analysis Agent: Enriches raw listing data with semantic annotations. Extracts style, condition, layout quality, and neighborhood characteristics from photos and descriptions.
Hybrid Retrieval: Elasticsearch cluster combining traditional BM25 scoring with dense vector similarity across multiple embedding spaces.
3.3 Workflow Overview
- User initiates search via natural language: "Show me modern homes with good natural light under $800K"
- Search Agent parses intent and extracts structured criteria + semantic preferences
- Query embedding generated for semantic matching
- Hybrid retrieval executes against Elasticsearch with combined scoring
- User preference vector applied to re-rank results for personalization
- Results presented with explanations: "Matched because: Modern aesthetic (92%), Natural light (87%), Under budget"
- User feedback captured to update taste model
System Architecture Overview
End-to-end architecture from property ingestion to personalized recommendations
4. Multi-Modal Embedding Architecture
The system uses four distinct embedding spaces, each capturing different aspects of property matching:
4.1 Description Embeddings
Generated from listing descriptions using a fine-tuned sentence transformer. Captures:
- Architectural style and aesthetic language
- Condition and update status
- Lifestyle fit (family-friendly, entertainer's dream, etc.)
- Unique selling points and differentiators
// Example embedding generation
const descriptionEmbedding = await embeddingModel.encode({
text: listing.description,
model: 'text-embedding-3-large',
dimensions: 1024
});
4.2 Amenity Embeddings
Structured feature encoding that goes beyond binary presence/absence:
// Amenity encoding captures quality and context
{
"pool": { "present": true, "type": "in-ground", "condition": "updated" },
"kitchen": { "style": "modern", "appliances": "high-end", "layout": "open" },
"garage": { "spaces": 2, "type": "attached", "features": ["ev-charger"] }
}
The amenity embedding space allows queries like "good kitchen for serious cooking" to match listings with professional-grade appliances and functional layouts.
4.3 Location Embeddings
Captures neighborhood characteristics beyond lat/lng:
- Walkability context: Nearby amenities, coffee shops, restaurants, parks
- School quality: Ratings, distance, specialized programs
- Commute patterns: Transit access, typical commute times to business districts
- Neighborhood character: Quiet residential, urban vibrant, suburban family
// Location feature extraction
const locationFeatures = await enrichLocation({
coordinates: listing.coordinates,
sources: ['yelp', 'walkscore', 'census', 'transit']
});
const locationEmbedding = encodeLocationFeatures(locationFeatures);
4.4 Image Embeddings
Visual understanding using CLIP-based models to capture:
- Architectural style (modern, traditional, mid-century, craftsman)
- Interior design aesthetic (minimalist, cozy, luxurious, dated)
- Light quality and spaciousness
- Condition and maintenance level
- View quality and outdoor space
// Multi-image embedding aggregation
const imageEmbeddings = await Promise.all(
listing.photos.map(photo => clipModel.encode(photo))
);
// Weighted aggregation favoring hero images
const visualEmbedding = aggregateImageEmbeddings(imageEmbeddings, {
heroWeight: 2.0,
kitchenWeight: 1.5,
exteriorWeight: 1.3
});
4.5 Embedding Fusion
Final listing representation combines all four spaces with learned weights:
listing_vector = (
α × description_embedding +
β × amenity_embedding +
γ × location_embedding +
δ × image_embedding
)
// Where weights are user-specific based on stated priorities
// e.g., visual-first buyers have higher δ
Multi-Modal Embedding Architecture
Four specialized embedding types capture different aspects of property semantics
5. Hybrid Retrieval System
5.1 Why Hybrid?
Pure vector search excels at semantic similarity but fails on exact matches. Pure lexical search handles keywords but misses conceptual relevance. Real estate queries demand both:
| Query Type | Best Approach |
|---|---|
| "123 Main Street" | Lexical (exact match) |
| "Modern homes with natural light" | Vector (semantic) |
| "3 bed craftsman in Wallingford" | Hybrid (both) |
5.2 Elasticsearch Configuration
The index schema supports both dense vectors and traditional text fields:
{
"mappings": {
"properties": {
"description": { "type": "text", "analyzer": "english" },
"address": { "type": "text", "analyzer": "standard" },
"price": { "type": "long" },
"bedrooms": { "type": "integer" },
"description_vector": {
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "cosine"
},
"amenity_vector": {
"type": "dense_vector",
"dims": 512,
"index": true,
"similarity": "cosine"
},
"location_vector": {
"type": "dense_vector",
"dims": 256,
"index": true,
"similarity": "cosine"
},
"image_vector": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "cosine"
}
}
}
}
5.3 Query Construction
Queries combine boolean filters, BM25 scoring, and kNN vector search:
{
"query": {
"bool": {
"filter": [
{ "range": { "price": { "lte": 800000 } } },
{ "range": { "bedrooms": { "gte": 2 } } }
],
"should": [
{
"match": {
"description": {
"query": "modern natural light",
"boost": 1.0
}
}
}
]
}
},
"knn": [
{
"field": "description_vector",
"query_vector": [0.12, -0.34, ...],
"k": 50,
"num_candidates": 200,
"boost": 2.0
},
{
"field": "image_vector",
"query_vector": [0.56, 0.12, ...],
"k": 50,
"num_candidates": 200,
"boost": 1.5
}
]
}
5.4 Score Fusion
BM25 and kNN scores are normalized and combined using Reciprocal Rank Fusion (RRF):
function reciprocalRankFusion(rankings: RankedList[], k: number = 60): ScoredResult[] {
const scores = new Map();
for (const ranking of rankings) {
for (let i = 0; i < ranking.results.length; i++) {
const docId = ranking.results[i].id;
const rrfScore = 1 / (k + i + 1);
scores.set(docId, (scores.get(docId) || 0) + rrfScore);
}
}
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.map(([id, score]) => ({ id, score }));
}
Hybrid Retrieval with RRF
Combining BM25 keyword search with vector similarity using Reciprocal Rank Fusion
6. Taste Learning Engine
6.1 Signal Collection
The taste model ingests multiple feedback channels:
| Signal Type | Weight | Example |
|---|---|---|
| Explicit Positive | 1.0 | User clicks "Love it" or saves listing |
| Explicit Negative | -0.8 | User clicks "Not for me" |
| Extended View | 0.3 | User spends 30+ seconds on listing |
| Photo Deep-Dive | 0.4 | User views 5+ photos |
| Return Visit | 0.6 | User returns to same listing |
| Quick Dismiss | -0.2 | User views <3 seconds |
| Conversational Cue | 0.5 | "I love this style" in chat |
6.2 Preference Vector Update
User preference is maintained as a weighted vector in the same embedding space as listings:
function updatePreferenceVector(
currentPreference: number[],
listingVector: number[],
signal: FeedbackSignal
): number[] {
const weight = signalWeights[signal.type];
const learningRate = 0.1;
const decayFactor = 0.95; // Slight decay to allow preference evolution
return currentPreference.map((val, i) => {
const delta = (listingVector[i] - val) * weight * learningRate;
return val * decayFactor + delta;
});
}
6.3 Preference Dimensions
Rather than a single preference vector, we maintain separate preference dimensions:
- Style preference: Modern vs. traditional, minimalist vs. ornate
- Space preference: Open floor plan vs. defined rooms, indoor vs. outdoor focus
- Location preference: Urban vs. suburban, walkable vs. car-dependent
- Condition preference: Move-in ready vs. fixer potential
- Value preference: Premium finishes vs. good bones
6.4 Personalized Re-Ranking
After hybrid retrieval, results are re-ranked by preference alignment:
function personalizedRerank(
results: SearchResult[],
userPreference: UserPreference
): SearchResult[] {
return results
.map(result => ({
...result,
personalizedScore: (
result.retrievalScore * 0.6 +
cosineSimilarity(result.vector, userPreference.vector) * 0.4
)
}))
.sort((a, b) => b.personalizedScore - a.personalizedScore);
}
6.5 Cold Start Handling
For new users, we employ several strategies:
- Onboarding questions: Brief preference survey during signup
- Popularity fallback: New users see generally well-liked listings
- Explicit first feedback: Prompt for reaction on first 3 listings
- Similar user bootstrapping: Initialize from users with similar stated preferences
Taste Learning Engine
Continuous learning from explicit and implicit user signals
7. Conversational Search Agent
7.1 Agent Architecture
The search agent is built on Mastra.ai's agent framework with Gemini 2.0 Flash as the reasoning engine:
const searchAgent = new Agent({
name: 'PropertySearchAgent',
model: google('gemini-2.0-flash'),
instructions: `You are a knowledgeable real estate search assistant.
Help users find their perfect home by understanding their needs,
asking clarifying questions, and presenting relevant properties.
When searching, extract both structured criteria (price, beds, location)
and semantic preferences (style, feel, lifestyle fit).
Explain why each property matches the user's needs.`,
tools: {
searchProperties,
getListingDetails,
saveToFavorites,
updatePreferences,
getNeighborhoodInfo
}
});
7.2 Intent Classification
User messages are classified into intent categories:
| Intent | Example | Action |
|---|---|---|
| Search | "Show me modern homes in Capitol Hill" | Execute hybrid search |
| Refine | "Actually, make that under $700K" | Modify current search |
| Clarify | "What's the neighborhood like?" | Provide context |
| Compare | "How does this compare to the last one?" | Side-by-side analysis |
| Feedback | "I love this style but need more space" | Update preferences + refine |
7.3 Tool Calling
The agent uses structured tool calls for search execution:
const searchProperties = createTool({
id: 'search_properties',
description: 'Search for properties matching criteria',
inputSchema: z.object({
query: z.string().describe('Natural language search query'),
filters: z.object({
minPrice: z.number().optional(),
maxPrice: z.number().optional(),
minBeds: z.number().optional(),
maxBeds: z.number().optional(),
propertyTypes: z.array(z.string()).optional(),
neighborhoods: z.array(z.string()).optional(),
}).optional(),
semanticPreferences: z.array(z.string()).optional(),
limit: z.number().default(10)
}),
execute: async ({ query, filters, semanticPreferences, limit }) => {
const queryEmbedding = await generateQueryEmbedding(query);
const results = await hybridSearch({
embedding: queryEmbedding,
filters,
semanticBoosts: semanticPreferences,
limit
});
return formatResultsForAgent(results);
}
});
7.4 Conversation Memory
The agent maintains session context for multi-turn refinement:
interface SearchSession {
currentCriteria: SearchCriteria;
viewedListings: string[];
feedbackHistory: FeedbackEvent[];
conversationSummary: string;
lastSearchResults: SearchResult[];
}
This allows natural refinement: "Show me more like the second one, but with a bigger yard."
7.5 Explanation Generation
Each result includes a personalized explanation:
// Example explanation
{
"listingId": "12345",
"matchScore": 0.89,
"explanation": {
"summary": "Strong match for your modern aesthetic preference",
"matchReasons": [
{ "factor": "Architectural style", "score": 0.94, "detail": "Clean lines and open floor plan match your stated preference" },
{ "factor": "Natural light", "score": 0.88, "detail": "South-facing windows and skylights" },
{ "factor": "Location", "score": 0.82, "detail": "Walkable to coffee shops you'd like" }
],
"considerations": [
"Smaller yard than your typical preference",
"Street parking only"
]
}
}
Conversational Search Agent
Mastra.ai agent architecture with Gemini 2.0 Flash
8. Key Technical Challenges & Solutions
8.1 Embedding Quality for Real Estate
Problem: Generic embedding models don't capture real estate-specific semantics. "Updated kitchen" and "renovated kitchen" should be near-synonyms; "cozy" might mean "small."
Solution: Domain-specific fine-tuning using contrastive learning on listing pairs. We collected 50K listing pairs with known similarity relationships and fine-tuned the base embedding model.
// Contrastive pairs examples
{ anchor: "Modern farmhouse with shiplap walls",
positive: "Contemporary country home with wood paneling",
negative: "Traditional colonial with formal dining" }
{ anchor: "Chef's kitchen with Viking range",
positive: "Gourmet kitchen with professional appliances",
negative: "Galley kitchen with basic appliances" }
8.2 Image Understanding at Scale
Problem: Processing millions of listing photos with CLIP-style models is computationally expensive.
Solution: Tiered processing pipeline:
- Tier 1: Fast classification (exterior/interior/kitchen/bathroom) for all images
- Tier 2: Full embedding for hero images only (first 5 photos)
- Tier 3: On-demand deep analysis when user requests detail
8.3 Preference Drift
Problem: User preferences change during the search process. Early feedback may not reflect evolved taste.
Solution: Time-weighted preference updates with explicit phase detection:
function getTimeWeight(eventAge: Duration): number {
const hoursSinceEvent = eventAge.toHours();
// Recent events weighted much higher
if (hoursSinceEvent < 24) return 1.0;
if (hoursSinceEvent < 72) return 0.8;
if (hoursSinceEvent < 168) return 0.5;
return 0.3;
}
8.4 Balancing Exploration vs. Exploitation
Problem: Pure preference matching creates filter bubbles. Users miss potentially great options outside their stated preferences.
Solution: Controlled exploration injection:
- 10-15% of results are "stretch" recommendations outside typical matches
- Stretch results are explicitly labeled: "Outside your usual preferences, but..."
- Positive feedback on stretch results significantly updates preference model
8.5 Real-Time Inventory Updates
Problem: MLS data updates frequently. Listings go pending, prices change, new properties hit market.
Solution: Streaming ingestion with embedding queue:
// New listing pipeline
mlsStream
.filter(event => event.type === 'NEW_LISTING')
.map(event => enrichListingData(event.listing))
.map(enriched => generateAllEmbeddings(enriched))
.forEach(indexed => notifyMatchingUsers(indexed));
Users with matching preferences receive proactive notifications for new listings that score above threshold.
9. Evaluation & Results
9.1 Search Quality Metrics
Evaluated against traditional filter-based search on 1,000 user search sessions:
| Metric | Filter Search | AI Search | Improvement |
|---|---|---|---|
| Time to first relevant result | 4.2 min | 1.3 min | 3.2x faster |
| Listings viewed before shortlist | 47 | 12 | 74% reduction |
| User-rated relevance (1-5) | 3.1 | 4.3 | 39% higher |
| Search refinement iterations | 6.8 | 2.4 | 65% reduction |
9.2 Semantic Understanding Accuracy
Tested on 500 natural language queries with human-labeled intent:
| Query Type | Accuracy |
|---|---|
| Style/aesthetic preferences | 91% |
| Lifestyle requirements | 87% |
| Location/neighborhood | 94% |
| Complex multi-factor | 82% |
| Overall | 89% |
9.3 Taste Learning Convergence
Measured how quickly the preference model aligns with user's true preferences:
- After 3 interactions: 62% alignment with eventual preferences
- After 5 interactions: 78% alignment
- After 10 interactions: 91% alignment
Most users reach stable preference models within 5-7 interactions.
9.4 User Satisfaction
Post-session survey results (n=200):
| Question | Score (1-5) |
|---|---|
| "The system understood what I was looking for" | 4.4 |
| "Recommendations improved over time" | 4.2 |
| "I found properties I wouldn't have found with filters" | 4.6 |
| "I would use this over traditional search" | 4.5 |
9.5 Qualitative Feedback
- "It actually understood 'modern but warm'—that's never worked before"
- "I didn't know I wanted a courtyard until it showed me one"
- "The explanations helped me understand my own preferences better"
- "Finally, a search that learns instead of making me start over"
10. Conclusion
This system demonstrates that the future of real estate search lies not in more filters, but in deeper understanding. By combining conversational AI, multi-modal embeddings, hybrid retrieval, and continuous taste learning, we've created a property discovery experience that mirrors the intuition of a skilled human agent.
The results validate the approach: 3.2x faster time-to-relevance, 74% fewer listings viewed before shortlisting, and user satisfaction scores significantly higher than traditional filter-based search. More importantly, users report discovering properties they never would have found through conventional means.
Key technical contributions include:
- A multi-modal embedding architecture that captures the full dimensionality of what makes a home desirable
- Hybrid retrieval combining the precision of structured queries with the nuance of semantic search
- A taste learning engine that builds accurate preference models from minimal explicit feedback
- Explainable recommendations that help users understand—and refine—their own preferences
The real estate industry has long been ripe for AI transformation. This system represents a meaningful step toward that future: technology that doesn't just process listings faster, but fundamentally understands what home buyers are looking for—even when they can't fully articulate it themselves.