Building a scalable real estate data ETL pipeline
Context
While working with a government agency client, I faced an interesting data engineering challenge. They needed to collect and analyze comprehensive data about real estate agencies and their properties from a proprietary Multiple Listing Service (MLS) system. The scope? Hundreds of thousands of property records that needed to be extracted, transformed, and analyzed while carefully respecting API rate limits and handling potential failures.
Note: Out of respect for API terms and client confidentiality, I've anonymized the specific platform and client details. However, the technical challenge is common in enterprise data integration: working with rate-limited APIs, handling large datasets efficiently, and maintaining data integrity throughout the ETL process.
What made this particularly interesting was the balance between aggressive data collection and being a good API citizen – I needed to get this done efficiently but without overwhelming the source system.
What I Built
I developed a resilient ETL (Extract, Transform, Load) pipeline that could:
- Extract property data from the MLS API in configurable batches
- Process and deduplicate agency information on the fly
- Transform the raw data into clean, analysis-ready CSV format
- Handle interruptions gracefully with built-in checkpointing
The system was designed to be both robust and considerate – implementing smart rate limiting, token management, and data integrity checks throughout the pipeline.
Technical Breakdown
Stack & Tools
- TypeScript/Node.js: Core implementation
- Axios: HTTP client for API interactions
- File System (fs): For data persistence and checkpointing
- Path: Cross-platform file path handling
Key Architecture Decisions
- Chunked Data Processing
const batchSize = 100;
const maxPropertiesPerFile = 70000;
async function fetchAllProperties() {
// Create output directory if it doesn't exist
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir);
}
// Find the last file I was working on
let currentFileIndex = 0;
let allProperties: any[] = [];
// ... chunked processing logic
}
I implemented a chunked processing system that:
- Fetches data in small batches (100 properties)
- Splits output into manageable files (~70K properties each)
- Enables resume-ability if the process fails
This approach proved crucial when dealing with the full dataset of nearly 700,000 properties.
- Resilient Error Handling
try {
const response = await axios.post<SearchResponse>(
API_ENDPOINT,
requestPayload,
{ headers },
);
} catch (error: any) {
if (error?.response?.status === 401) {
console.error(
"Token expired! Please update the authorization token and run again",
);
return;
}
// For other errors, wait and retry
await delay(5000);
continue;
}
The system handles:
- Token expiration gracefully
- Network failures with automatic retries
- Rate limiting through intelligent delays
- Smart Deduplication
const agencyMap = new Map<string, Agency>();
// During property processing
properties.forEach((property) => {
const { agency } = property;
if (agency?.email) {
agencyMap.set(agency.email, {
email: agency.email,
name: agency.name,
phone: agency.phone,
websiteUrl: agency.websiteUrl,
});
}
});
Used email as a unique key to deduplicate agency information across properties, ensuring data consistency while minimizing memory usage.
Edge Cases Handled
- Partial File Completion: The system tracks progress and can resume from the last successful batch
- API Token Management: Graceful handling of token expiration with clear error messages
- Data Quality: Handling of missing or malformed data without breaking the pipeline
- Process Interruption: File-based checkpointing enables resume-ability
- Memory Management: Streaming approach to handle large datasets efficiently
What I Learned
- Batch Processing Trade-offs
Initially, I tried processing all data in memory. This worked fine during testing with small datasets but quickly became problematic when dealing with the full dataset. Breaking the data into chunks with file-based checkpointing proved more reliable, though slightly slower.
The key insight? Sometimes trading raw speed for reliability is the right choice, especially when dealing with large-scale data extraction.
- Rate Limiting Strategy
await delay(Math.random() * 1000); // Random delay between 0-1 seconds
Instead of fixed delays, implementing random delays between requests helped avoid predictable patterns that might trigger API defenses. This simple change made our script behave more like natural traffic.
- Data Integrity > Speed
Using a Map for deduplication was more efficient than array-based filtering, especially when dealing with tens of thousands of records. The slight memory overhead was worth it for the guaranteed uniqueness and O(1) lookup times.
- Progress Visibility
console.log(
`Saved ${skip + allProperties.length}/${total} properties (${(
((skip + allProperties.length) / total) *
100
).toFixed(2)}%)`,
);
Adding detailed progress logging helped track long-running processes and identify bottlenecks. This became invaluable when the client asked for status updates or when debugging issues.
What's Next?
-
Performance Optimization
- Implement parallel processing with worker threads
- Investigate streaming CSV generation for lower memory usage
- Add batch size auto-tuning based on API response times
-
Data Validation
- Add JSON schema validation for API responses
- Implement data quality scoring
- Add automated anomaly detection
-
Monitoring & Observability
- Add proper metrics collection
- Implement real-time progress tracking
- Create a dashboard for pipeline status
-
Configuration Management
- Move hardcoded values to configuration files
- Add environment-specific settings
- Implement feature flags for different processing modes
Key Takeaways
In this case I learned importance of building data pipelines that are not just functional, but also resilient and maintainable. The extra time spent on error handling and progress tracking paid off many times over during the actual data collection phase.
It definitely highlighted how technical challenges often require balancing competing concerns – in this case, speed vs. reliability, and thoroughness vs. API courtesy.
The end result was a robust system that successfully processed nearly 700,000 property records, extracting valuable insights for our client while maintaining data integrity and system reliability throughout the process.
Note: This case study has been anonymized to protect client confidentiality while preserving the technical insights and learning opportunities from the project.