Migrating a large excel use-case to our platform

Context

A pretty cool moment for our platform was securing a partnership with a major Dutch government agency in the infrastructure sector. The task was large: migrate highly complex compliance data for over 150 of their clients from intricate Excel spreadsheets into our live web application. Each client's data was provided by multiple assessors, included numerous evidence files, and was critical to their operations.

To be honest that was not just a "technical task"; it was a high-stakes project to prove our platform's capability, solidify a key partnership, and save hundreds of hours of manual work. The success of this migration would provide mental ease to our team and to our partner. It would also allow them to use our platform with their data - as intended.

For confidentiality, I’ll refer to the client as the "Agency," but the core challenge is a common one in the enterprise world: bridging the gap between legacy data formats (like Excel) and modern web platforms, at scale.

What I Built

I developed an end-to-end, automated data migration pipeline that could:

Parse Complex Excel Files: Ingest and interpret multi-sheet .xlsm files containing nested data structures and varied formatting.
Programmatically Upload Files: Reverse-engineer our platform's S3 file upload mechanism to handle evidence files (images, PDFs, etc.) just like a real user would.
Automate Form Submissions: Use our platform's GraphQL API to create workflows, fill out complex forms, and associate the uploaded evidence.
Run in Multiple Environments: Operate seamlessly across testing and production environments with different configurations and IDs.
Benchmark Performance: Track and report on execution time to estimate the total duration for the full-scale migration.

Deep Dive: Parsing and Mapping Complex Excel Data

The Real-World Data Model

The Agency provided a directory of folders, each representing a client (or "team"). Inside each folder:

A .xlsm Excel file with multiple sheets ("Home", "Data", "Import")
A Documents/ subfolder with evidence files (PDFs, images, spreadsheets, etc.)

Each client had three assessors ("Undertaking", "Initial Approval", "Oversight"), each with their own set of answers and evidence. The data model I built looked like this:

interface OrganizationData {
  organizationSlug: string;
  teams: Team[];
}

interface Team {
  name: string;
  uniqueId: string;
  type_of_organization: string;
  small: boolean;
  assessors: Assessor[];
}

interface Assessor {
  name: string;
  evidence: string[];
  backdatedCreatedAt?: string;
  formData: any[];
}

Excel Parsing: Edge Cases and Lessons

Parsing the Excel files was far from trivial. Here are some of the real-world challenges I faced:

Non-Uniform Sheets: Not every file had the same sheet names or structure. Some had missing or extra sheets, requiring defensive code.
Merged Cells and Inconsistent Data: Some data was in merged cells, or in different columns depending on the client. I had to write logic to detect and adapt to these variations.
Dutch Date Formats: Dates were in Dutch (e.g., "12-mrt-24"). I wrote a custom parser to handle these, including edge cases like multiple dates in a single cell.
Reference Number Mapping: Each row in the "Data" sheet corresponded to a question, identified by a reference number (e.g., "5.2.4"). I built a mapping system to find the right row for each reference, even if the order varied.
Assessor-Specific Logic: The meaning of each column changed depending on the assessor. For "Undertaking", a value in column V meant "present"; for "Oversight", the same column could mean something else, and the value itself might be a code ("O1", "O2", etc.) that needed to be mapped to a score.

Example: Dutch Date Parsing

function parseDutchDate(dateStr: string): Date | null {
  // Handles formats like "12-mrt-24" and "12/13-mrt-24"
  // ... see code for full implementation
}

Example: Multi-Assessor Data Extraction

const assessors: Assessor[] = [
  {
    name: "Undertaking",
    evidence: evidenceFiles,
    formData: createAssessorFormData(
      dataSheet,
      importSheet,
      "Undertaking",
      referenceNumbers,
    ),
  },
  // ... Initial Approval, Oversight
];

Example: Evidence File Discovery

const evidenceFiles = fs
  .readdirSync(documentsPath)
  .filter((file) =>
    [
      ".pdf",
      ".xlsx",
      ".xls",
      ".doc",
      ".docx",
      ".webp",
      ".jpg",
      ".jpeg",
      ".gif",
      ".csv",
      ".txt",
    ].some((ext) => file.endsWith(ext)),
  )
  .map((file) => path.join(teamDir, "Documents", file));

Automating File Uploads: S3, MIME Types, and Error Handling

Reverse-Engineering the Upload Flow

The platform used a three-step S3 upload process, but there was no documentation. I used browser DevTools to:

Watch the GraphQL mutation that returned a signed S3 URL and required fields
Observe the PUT request to S3, including required headers and form fields
See the follow-up mutation that finalized the upload

Handling MIME Types and File Validation

Uploading files at scale meant handling a wide variety of file types and edge cases:

MIME Type Detection: I built a function to map file extensions to MIME types, defaulting to application/octet-stream for unknowns.
File Size Limits: The script checked for files over 100MB and skipped or flagged them.
Empty Files: Zero-byte files were detected and logged.
Throttling: To avoid rate limits and S3 errors, uploads were throttled with a small delay between each file.

Example: MIME Type Detection

function getMimeType(filePath: string): string {
  const extension = path.extname(filePath).toLowerCase();
  // ... mapping logic
  return mimeTypes[extension] || "application/octet-stream";
}

Example: Throttled Uploads

async function throttledUpload(
  file: string,
  delayMs: number = 10,
): Promise<any> {
  // ... upload logic
  await new Promise((resolve) => setTimeout(resolve, delayMs));
}

Example: Error Handling

try {
  await fs.promises.access(filePath, fs.constants.R_OK);
} catch (error) {
  throw new Error(`File not accessible: ${filePath}. Error: ${error.message}`);
}

Dynamic, Multi-Environment Automation

Test vs. Production: No Hardcoding Allowed

The migration had to run on both a test account and the real production environment. Each had different tenant IDs, workflow IDs, and team IDs. To make the script portable:

All IDs were looked up dynamically by name or slug at runtime
.env files were used for environment-specific configuration
The script could be pointed at any organization, workflow, or team with a simple config change

Example: Dynamic Workflow Lookup

const availableWorkflows = await availableWorkflowsForUser(tenantId);
const workflowCollectionId = availableWorkflows[0].workflowCollectionId;

Performance Benchmarking and Reporting

Why It Mattered

With hundreds of clients and thousands of files, knowing how long the migration would take was critical for planning and client communication. I built a timing metrics module to:

Track start and end times for each team and assessor
Calculate average time per client
Output a detailed timing report as JSON

Example: Timing Metrics

const metrics = {
  startTime: Date.now(),
  clientsProcessed: 0,
  clientTimings: [],
  endTime: null,
};
// ... see code for full implementation

Lessons Learned: Requirements, Communication, and Scope

The Hidden Complexity of "Obvious" Requirements

One of the biggest surprises was how much detail was hidden in the "obvious" parts of the requirements. For example:

The CSM team would describe a field as "just a date", but the Excel files had multiple date formats, sometimes in Dutch, sometimes with multiple dates in one cell.
The mapping between Excel columns and form fields was not always 1:1; sometimes it depended on the assessor, or on the value of another cell.
Some evidence files were missing, duplicated, or in unexpected formats.

The Importance of Asking (and Re-asking) Questions

I learned to:

Never assume the requirements are fully known up front
Ask for sample data and edge cases
Walk through the process with the CSM team, step by step
Document every mapping and transformation rule

Scope Creep and Change Management

Even after the initial migration logic was built, new requirements kept emerging:

"Can we add a new field for improvement suggestions?"
"Some clients have more than three assessors, can we support that?"
"We need to backdate the creation date for Oversight submissions."

I built the script to be as flexible as possible, but also learned to communicate clearly about what changes would require rework.

Real-World Impact and Next Steps

What This Enabled

Massive Time Savings: What would have taken weeks of manual data entry was completed in hours.
Data Quality: Automated validation and error logging caught issues that would have been missed by hand.
Repeatability: The script can be re-run for new data drops, or adapted for other clients with similar needs.

What I'd Improve Next

Dry Run Mode: Simulate the migration and output a report of what would be changed, without making any API calls.
Better Error Reporting: Aggregate errors and warnings into a single report for easier review.
UI for CSMs: Build a simple web interface so non-engineers can run migrations themselves.
Parallelization: Carefully parallelize uploads and submissions to further speed up the process, while respecting rate limits.

Key Takeaways

This project helped me understand a handful of realities regarding enterprise-scale data migration. It was a mix of meticulous planning, technical reverse-engineering, and constant communication. It showed me that automation, even in the form of a purpose-built script, can provide immense value, strengthen partnerships, and serve as a powerful sales and retention tool. Also for sure showed me that sometimes the most complex technical challenges are best solved by first asking the right human questions.