By Rustan Ace Corpuz — 09 Jul 2025

Scaling a Data Pipeline: Lessons from Migrating Our Investment Data Infrastructure

The Starting Point

When we first built our data pipeline for scraping investment intelligence from sources like Crunchbase, LinkedIn, and SimilarWeb, we optimized for simplicity. Our initial architecture served us well for the first few customers:

Original Architecture:

SSH into a VM and manage multiple tmux sessions
Run bash scripts for each scraper
Download scraped data from S3 to local machines
Process data in Jupyter notebooks on laptops
Upload processed data back to S3
Run manual scripts to import into Postgres (AWS Aurora)

This approach was straightforward and allowed us to iterate quickly. However, as our customer base grew and data volumes increased, we encountered several limitations that taught us valuable lessons about building scalable data infrastructure.

Key Challenges and What We Learned

1. Processing Bottlenecks

Challenge: Our data grew too large to process on individual laptops. What once took minutes started taking hours, and eventually became impossible as datasets exceeded local memory.

Lesson: Design for scale from the beginning. Even if you start small, architect your system to handle 10x or 100x your current data volume.

2. Lack of Replayability

Challenge: We initially built our pipeline as ETL (Extract, Transform, Load), with data cleaning happening at multiple stages. When we needed to reprocess historical data with updated logic, we couldn't easily replay the transformations.

Lesson: ELT (Extract, Load, Transform) provides much more flexibility. Keep raw data intact and version your transformation logic separately.

3. Missing Observability

Challenge: We had no real-time job statistics, error monitoring, or alerting. Failures were discovered manually, often hours or days later.

Lesson: Observability isn't optional. Build monitoring and alerting into your pipeline from day one.

4. Manual Everything

Challenge: Job scheduling meant SSHing into VMs and running scripts manually. Rotating proxy providers required editing bash files. This approach didn't scale with team growth.

Lesson: Automation pays dividends. The time invested in proper scheduling and configuration management is quickly recouped.

The New Architecture

After 3-4 months of careful migration, here's what we built:

Core Infrastructure Changes

1. Scrapy Framework

Replaced custom bash scripts with Scrapy
Proper rotating proxy integration
Advanced fingerprinting and anti-detection measures
Simplified codebase that new engineers could understand quickly

2. ELT with Proper Data Lifecycle

Terraform-managed S3 buckets with lifecycle policies
Raw data preservation for replayability
Clear separation between raw and processed data

3. Scalable Processing

Migrated from Pandas to Polars for better performance
Moved processing from local Jupyter notebooks to AWS Batch
Implemented proper job queuing and resource allocation

4. Modern Data Warehouse

S3 as our data lake
AWS Athena for SQL queries on S3 data
Eliminated the ETL bottleneck into Postgres

Implementation Details

Processing Migration:
Instead of downloading data locally, we now run Polars-based processing jobs on AWS Batch. This gives us:

Automatic scaling based on workload
Cost efficiency (pay only for compute used)
No local machine limitations

Monitoring Stack:

CloudWatch for infrastructure metrics
Custom dashboards for job statistics
Automated alerts for failures or anomalies
Historical job performance tracking

Developer Experience:

Infrastructure as Code with Terraform
Standardized development environment
Clear documentation and runbooks
New engineer onboarding reduced from weeks to days

Results and Benefits

The migration delivered significant improvements:

Performance: Historical data queries that took hours now complete in minutes
Reliability: Automated error recovery and monitoring reduced incidents by 90%
Scalability: Can handle 50x our previous data volume without architecture changes
Team Efficiency: Engineers spend time on features, not infrastructure firefighting
Cost Optimization: Better resource utilization reduced our AWS bill by 40%

Key Takeaways

Start simple, but plan for scale - Our initial architecture was fine for proving the concept, but we should have planned the migration path earlier
Invest in observability early - The lack of monitoring made debugging and optimization much harder than necessary
Frameworks over custom code - Scrapy gave us battle-tested patterns and reduced our maintenance burden significantly
Process data where it lives - Moving computation to the data (AWS Batch) rather than data to computation (laptops) was transformative
Version everything - From infrastructure (Terraform) to transformation logic, version control enables confidence in changes

Looking Forward

This migration taught us that technical debt isn't just about code quality—it's about architecture decisions that limit your growth. By investing 3-4 months in a proper rebuild, we've created a platform that will scale with our business for years to come.

For teams facing similar challenges: don't wait until the pain is unbearable. Start planning your migration when you first feel the constraints. The peace of mind from a well-architected system is worth every hour invested.