Migrating a large excel use-case to our platform
Context
A pretty cool moment for our platform was securing a partnership with a major Dutch government agency in the infrastructure sector. The task was large: migrate highly complex compliance data for over 150 of their clients from intricate Excel spreadsheets into our live web application. Each client's data was provided by multiple assessors, included numerous evidence files, and was critical to their operations.
To be honest that was not just a "technical task"; it was a high-stakes project to prove our platform's capability, solidify a key partnership, and save hundreds of hours of manual work. The success of this migration would provide mental ease to our team and to our partner. It would also allow them to use our platform with their data - as intended.
For confidentiality, I’ll refer to the client as the "Agency," but the core challenge is a common one in the enterprise world: bridging the gap between legacy data formats (like Excel) and modern web platforms, at scale.
What I Built
I developed an end-to-end, automated data migration pipeline that could:
- Parse Complex Excel Files: Ingest and interpret multi-sheet
.xlsm
files containing nested data structures and varied formatting. - Programmatically Upload Files: Reverse-engineer our platform's S3 file upload mechanism to handle evidence files (images, PDFs, etc.) just like a real user would.
- Automate Form Submissions: Use our platform's GraphQL API to create workflows, fill out complex forms, and associate the uploaded evidence.
- Run in Multiple Environments: Operate seamlessly across testing and production environments with different configurations and IDs.
- Benchmark Performance: Track and report on execution time to estimate the total duration for the full-scale migration.
Deep Dive: Parsing and Mapping Complex Excel Data
The Real-World Data Model
The Agency provided a directory of folders, each representing a client (or "team"). Inside each folder:
- A
.xlsm
Excel file with multiple sheets ("Home", "Data", "Import") - A
Documents/
subfolder with evidence files (PDFs, images, spreadsheets, etc.)
Each client had three assessors ("Undertaking", "Initial Approval", "Oversight"), each with their own set of answers and evidence. The data model I built looked like this:
interface OrganizationData {
organizationSlug: string;
teams: Team[];
}
interface Team {
name: string;
uniqueId: string;
type_of_organization: string;
small: boolean;
assessors: Assessor[];
}
interface Assessor {
name: string;
evidence: string[];
backdatedCreatedAt?: string;
formData: any[];
}
Excel Parsing: Edge Cases and Lessons
Parsing the Excel files was far from trivial. Here are some of the real-world challenges I faced:
- Non-Uniform Sheets: Not every file had the same sheet names or structure. Some had missing or extra sheets, requiring defensive code.
- Merged Cells and Inconsistent Data: Some data was in merged cells, or in different columns depending on the client. I had to write logic to detect and adapt to these variations.
- Dutch Date Formats: Dates were in Dutch (e.g., "12-mrt-24"). I wrote a custom parser to handle these, including edge cases like multiple dates in a single cell.
- Reference Number Mapping: Each row in the "Data" sheet corresponded to a question, identified by a reference number (e.g., "5.2.4"). I built a mapping system to find the right row for each reference, even if the order varied.
- Assessor-Specific Logic: The meaning of each column changed depending on the assessor. For "Undertaking", a value in column V meant "present"; for "Oversight", the same column could mean something else, and the value itself might be a code ("O1", "O2", etc.) that needed to be mapped to a score.
Example: Dutch Date Parsing
function parseDutchDate(dateStr: string): Date | null {
// Handles formats like "12-mrt-24" and "12/13-mrt-24"
// ... see code for full implementation
}
Example: Multi-Assessor Data Extraction
const assessors: Assessor[] = [
{
name: "Undertaking",
evidence: evidenceFiles,
formData: createAssessorFormData(
dataSheet,
importSheet,
"Undertaking",
referenceNumbers,
),
},
// ... Initial Approval, Oversight
];
Example: Evidence File Discovery
const evidenceFiles = fs
.readdirSync(documentsPath)
.filter((file) =>
[
".pdf",
".xlsx",
".xls",
".doc",
".docx",
".webp",
".jpg",
".jpeg",
".gif",
".csv",
".txt",
].some((ext) => file.endsWith(ext)),
)
.map((file) => path.join(teamDir, "Documents", file));
Automating File Uploads: S3, MIME Types, and Error Handling
Reverse-Engineering the Upload Flow
The platform used a three-step S3 upload process, but there was no documentation. I used browser DevTools to:
- Watch the GraphQL mutation that returned a signed S3 URL and required fields
- Observe the
PUT
request to S3, including required headers and form fields - See the follow-up mutation that finalized the upload
Handling MIME Types and File Validation
Uploading files at scale meant handling a wide variety of file types and edge cases:
- MIME Type Detection: I built a function to map file extensions to MIME types, defaulting to
application/octet-stream
for unknowns. - File Size Limits: The script checked for files over 100MB and skipped or flagged them.
- Empty Files: Zero-byte files were detected and logged.
- Throttling: To avoid rate limits and S3 errors, uploads were throttled with a small delay between each file.
Example: MIME Type Detection
function getMimeType(filePath: string): string {
const extension = path.extname(filePath).toLowerCase();
// ... mapping logic
return mimeTypes[extension] || "application/octet-stream";
}
Example: Throttled Uploads
async function throttledUpload(
file: string,
delayMs: number = 10,
): Promise<any> {
// ... upload logic
await new Promise((resolve) => setTimeout(resolve, delayMs));
}
Example: Error Handling
try {
await fs.promises.access(filePath, fs.constants.R_OK);
} catch (error) {
throw new Error(`File not accessible: ${filePath}. Error: ${error.message}`);
}
Dynamic, Multi-Environment Automation
Test vs. Production: No Hardcoding Allowed
The migration had to run on both a test account and the real production environment. Each had different tenant IDs, workflow IDs, and team IDs. To make the script portable:
- All IDs were looked up dynamically by name or slug at runtime
.env
files were used for environment-specific configuration- The script could be pointed at any organization, workflow, or team with a simple config change
Example: Dynamic Workflow Lookup
const availableWorkflows = await availableWorkflowsForUser(tenantId);
const workflowCollectionId = availableWorkflows[0].workflowCollectionId;
Performance Benchmarking and Reporting
Why It Mattered
With hundreds of clients and thousands of files, knowing how long the migration would take was critical for planning and client communication. I built a timing metrics module to:
- Track start and end times for each team and assessor
- Calculate average time per client
- Output a detailed timing report as JSON
Example: Timing Metrics
const metrics = {
startTime: Date.now(),
clientsProcessed: 0,
clientTimings: [],
endTime: null,
};
// ... see code for full implementation
Lessons Learned: Requirements, Communication, and Scope
The Hidden Complexity of "Obvious" Requirements
One of the biggest surprises was how much detail was hidden in the "obvious" parts of the requirements. For example:
- The CSM team would describe a field as "just a date", but the Excel files had multiple date formats, sometimes in Dutch, sometimes with multiple dates in one cell.
- The mapping between Excel columns and form fields was not always 1:1; sometimes it depended on the assessor, or on the value of another cell.
- Some evidence files were missing, duplicated, or in unexpected formats.
The Importance of Asking (and Re-asking) Questions
I learned to:
- Never assume the requirements are fully known up front
- Ask for sample data and edge cases
- Walk through the process with the CSM team, step by step
- Document every mapping and transformation rule
Scope Creep and Change Management
Even after the initial migration logic was built, new requirements kept emerging:
- "Can we add a new field for improvement suggestions?"
- "Some clients have more than three assessors, can we support that?"
- "We need to backdate the creation date for Oversight submissions."
I built the script to be as flexible as possible, but also learned to communicate clearly about what changes would require rework.
Real-World Impact and Next Steps
What This Enabled
- Massive Time Savings: What would have taken weeks of manual data entry was completed in hours.
- Data Quality: Automated validation and error logging caught issues that would have been missed by hand.
- Repeatability: The script can be re-run for new data drops, or adapted for other clients with similar needs.
What I'd Improve Next
- Dry Run Mode: Simulate the migration and output a report of what would be changed, without making any API calls.
- Better Error Reporting: Aggregate errors and warnings into a single report for easier review.
- UI for CSMs: Build a simple web interface so non-engineers can run migrations themselves.
- Parallelization: Carefully parallelize uploads and submissions to further speed up the process, while respecting rate limits.
Key Takeaways
This project helped me understand a handful of realities regarding enterprise-scale data migration. It was a mix of meticulous planning, technical reverse-engineering, and constant communication. It showed me that automation, even in the form of a purpose-built script, can provide immense value, strengthen partnerships, and serve as a powerful sales and retention tool. Also for sure showed me that sometimes the most complex technical challenges are best solved by first asking the right human questions.