Scaling a Data Pipeline: Lessons from Migrating Our Investment Data Infrastructure

The Starting Point

When we first built our data pipeline for scraping investment intelligence from sources like Crunchbase, LinkedIn, and SimilarWeb, we optimized for simplicity. Our initial architecture served us well for the first few customers:

Original Architecture:

  • SSH into a VM and manage multiple tmux sessions
  • Run bash scripts for each scraper
  • Download scraped data from S3 to local machines
  • Process data in Jupyter notebooks on laptops
  • Upload processed data back to S3
  • Run manual scripts to import into Postgres (AWS Aurora)

This approach was straightforward and allowed us to iterate quickly. However, as our customer base grew and data volumes increased, we encountered several limitations that taught us valuable lessons about building scalable data infrastructure.

Key Challenges and What We Learned

1. Processing Bottlenecks

Challenge: Our data grew too large to process on individual laptops. What once took minutes started taking hours, and eventually became impossible as datasets exceeded local memory.

Lesson: Design for scale from the beginning. Even if you start small, architect your system to handle 10x or 100x your current data volume.

2. Lack of Replayability

Challenge: We initially built our pipeline as ETL (Extract, Transform, Load), with data cleaning happening at multiple stages. When we needed to reprocess historical data with updated logic, we couldn't easily replay the transformations.

Lesson: ELT (Extract, Load, Transform) provides much more flexibility. Keep raw data intact and version your transformation logic separately.

3. Missing Observability

Challenge: We had no real-time job statistics, error monitoring, or alerting. Failures were discovered manually, often hours or days later.

Lesson: Observability isn't optional. Build monitoring and alerting into your pipeline from day one.

4. Manual Everything

Challenge: Job scheduling meant SSHing into VMs and running scripts manually. Rotating proxy providers required editing bash files. This approach didn't scale with team growth.

Lesson: Automation pays dividends. The time invested in proper scheduling and configuration management is quickly recouped.

The New Architecture

After 3-4 months of careful migration, here's what we built:

Core Infrastructure Changes

1. Scrapy Framework

  • Replaced custom bash scripts with Scrapy
  • Proper rotating proxy integration
  • Advanced fingerprinting and anti-detection measures
  • Simplified codebase that new engineers could understand quickly

2. ELT with Proper Data Lifecycle

  • Terraform-managed S3 buckets with lifecycle policies
  • Raw data preservation for replayability
  • Clear separation between raw and processed data

3. Scalable Processing

  • Migrated from Pandas to Polars for better performance
  • Moved processing from local Jupyter notebooks to AWS Batch
  • Implemented proper job queuing and resource allocation

4. Modern Data Warehouse

  • S3 as our data lake
  • AWS Athena for SQL queries on S3 data
  • Eliminated the ETL bottleneck into Postgres

Implementation Details

Processing Migration:
Instead of downloading data locally, we now run Polars-based processing jobs on AWS Batch. This gives us:

  • Automatic scaling based on workload
  • Cost efficiency (pay only for compute used)
  • No local machine limitations

Monitoring Stack:

  • CloudWatch for infrastructure metrics
  • Custom dashboards for job statistics
  • Automated alerts for failures or anomalies
  • Historical job performance tracking

Developer Experience:

  • Infrastructure as Code with Terraform
  • Standardized development environment
  • Clear documentation and runbooks
  • New engineer onboarding reduced from weeks to days

Results and Benefits

The migration delivered significant improvements:

  1. Performance: Historical data queries that took hours now complete in minutes
  2. Reliability: Automated error recovery and monitoring reduced incidents by 90%
  3. Scalability: Can handle 50x our previous data volume without architecture changes
  4. Team Efficiency: Engineers spend time on features, not infrastructure firefighting
  5. Cost Optimization: Better resource utilization reduced our AWS bill by 40%

Key Takeaways

  1. Start simple, but plan for scale - Our initial architecture was fine for proving the concept, but we should have planned the migration path earlier
  2. Invest in observability early - The lack of monitoring made debugging and optimization much harder than necessary
  3. Frameworks over custom code - Scrapy gave us battle-tested patterns and reduced our maintenance burden significantly
  4. Process data where it lives - Moving computation to the data (AWS Batch) rather than data to computation (laptops) was transformative
  5. Version everything - From infrastructure (Terraform) to transformation logic, version control enables confidence in changes

Looking Forward

This migration taught us that technical debt isn't just about code quality—it's about architecture decisions that limit your growth. By investing 3-4 months in a proper rebuild, we've created a platform that will scale with our business for years to come.

For teams facing similar challenges: don't wait until the pain is unbearable. Start planning your migration when you first feel the constraints. The peace of mind from a well-architected system is worth every hour invested.