Building a scalable real estate data ETL pipeline

Context

While working with a government agency client, I faced an interesting data engineering challenge. They needed to collect and analyze comprehensive data about real estate agencies and their properties from a proprietary Multiple Listing Service (MLS) system. The scope? Hundreds of thousands of property records that needed to be extracted, transformed, and analyzed while carefully respecting API rate limits and handling potential failures.

Note: Out of respect for API terms and client confidentiality, I've anonymized the specific platform and client details. However, the technical challenge is common in enterprise data integration: working with rate-limited APIs, handling large datasets efficiently, and maintaining data integrity throughout the ETL process.

What made this particularly interesting was the balance between aggressive data collection and being a good API citizen – I needed to get this done efficiently but without overwhelming the source system.

What I Built

I developed a resilient ETL (Extract, Transform, Load) pipeline that could:

Extract property data from the MLS API in configurable batches
Process and deduplicate agency information on the fly
Transform the raw data into clean, analysis-ready CSV format
Handle interruptions gracefully with built-in checkpointing

The system was designed to be both robust and considerate – implementing smart rate limiting, token management, and data integrity checks throughout the pipeline.

Technical Breakdown

Stack & Tools

TypeScript/Node.js: Core implementation
Axios: HTTP client for API interactions
File System (fs): For data persistence and checkpointing
Path: Cross-platform file path handling

Key Architecture Decisions

Chunked Data Processing

const batchSize = 100;
const maxPropertiesPerFile = 70000;

async function fetchAllProperties() {
  // Create output directory if it doesn't exist
  if (!fs.existsSync(outputDir)) {
    fs.mkdirSync(outputDir);
  }

  // Find the last file I was working on
  let currentFileIndex = 0;
  let allProperties: any[] = [];

  // ... chunked processing logic
}

I implemented a chunked processing system that:

Fetches data in small batches (100 properties)
Splits output into manageable files (~70K properties each)
Enables resume-ability if the process fails

This approach proved crucial when dealing with the full dataset of nearly 700,000 properties.

Resilient Error Handling

try {
  const response = await axios.post<SearchResponse>(
    API_ENDPOINT,
    requestPayload,
    { headers },
  );
} catch (error: any) {
  if (error?.response?.status === 401) {
    console.error(
      "Token expired! Please update the authorization token and run again",
    );
    return;
  }
  // For other errors, wait and retry
  await delay(5000);
  continue;
}

The system handles:

Token expiration gracefully
Network failures with automatic retries
Rate limiting through intelligent delays

Smart Deduplication

const agencyMap = new Map<string, Agency>();

// During property processing
properties.forEach((property) => {
  const { agency } = property;
  if (agency?.email) {
    agencyMap.set(agency.email, {
      email: agency.email,
      name: agency.name,
      phone: agency.phone,
      websiteUrl: agency.websiteUrl,
    });
  }
});

Used email as a unique key to deduplicate agency information across properties, ensuring data consistency while minimizing memory usage.

Edge Cases Handled

Partial File Completion: The system tracks progress and can resume from the last successful batch
API Token Management: Graceful handling of token expiration with clear error messages
Data Quality: Handling of missing or malformed data without breaking the pipeline
Process Interruption: File-based checkpointing enables resume-ability
Memory Management: Streaming approach to handle large datasets efficiently

What I Learned

Batch Processing Trade-offs

Initially, I tried processing all data in memory. This worked fine during testing with small datasets but quickly became problematic when dealing with the full dataset. Breaking the data into chunks with file-based checkpointing proved more reliable, though slightly slower.

The key insight? Sometimes trading raw speed for reliability is the right choice, especially when dealing with large-scale data extraction.

Rate Limiting Strategy

await delay(Math.random() * 1000); // Random delay between 0-1 seconds

Instead of fixed delays, implementing random delays between requests helped avoid predictable patterns that might trigger API defenses. This simple change made our script behave more like natural traffic.

Data Integrity > Speed

Using a Map for deduplication was more efficient than array-based filtering, especially when dealing with tens of thousands of records. The slight memory overhead was worth it for the guaranteed uniqueness and O(1) lookup times.

Progress Visibility

console.log(
  `Saved ${skip + allProperties.length}/${total} properties (${(
    ((skip + allProperties.length) / total) *
    100
  ).toFixed(2)}%)`,
);

Adding detailed progress logging helped track long-running processes and identify bottlenecks. This became invaluable when the client asked for status updates or when debugging issues.

What's Next?

Performance Optimization
- Implement parallel processing with worker threads
- Investigate streaming CSV generation for lower memory usage
- Add batch size auto-tuning based on API response times
Data Validation
- Add JSON schema validation for API responses
- Implement data quality scoring
- Add automated anomaly detection
Monitoring & Observability
- Add proper metrics collection
- Implement real-time progress tracking
- Create a dashboard for pipeline status
Configuration Management
- Move hardcoded values to configuration files
- Add environment-specific settings
- Implement feature flags for different processing modes

Key Takeaways

In this case I learned importance of building data pipelines that are not just functional, but also resilient and maintainable. The extra time spent on error handling and progress tracking paid off many times over during the actual data collection phase.

It definitely highlighted how technical challenges often require balancing competing concerns – in this case, speed vs. reliability, and thoroughness vs. API courtesy.

The end result was a robust system that successfully processed nearly 700,000 property records, extracting valuable insights for our client while maintaining data integrity and system reliability throughout the process.

Note: This case study has been anonymized to protect client confidentiality while preserving the technical insights and learning opportunities from the project.