Lessons learned from processing millions of Celery tasks without everything catching fire

When we started building Sugar Connect (our SugarCRM integration platform), we had a few hundred users syncing data between SugarCRM and their other business tools. Fast forward two years, and we're handling tens of thousands of active users, processing millions of tasks daily, and orchestrating complex sync workflows.

Here's how we scaled—and all the things that broke along the way.

The Initial Architecture

Our stack was straightforward:

  • Django REST API running on GKE
  • Celery workers for async processing
  • RabbitMQ for message queuing
  • Redis for caching and Celery results
  • PostgreSQL for our main database

Everything worked great at small scale. Then growth happened.

Database Pooling: The First Bottleneck

Around 5,000 users, our database started choking. Celery workers were opening connections like there was no tomorrow. With default settings, each worker process maintained its own connection:

# What we had (bad)
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'HOST': 'cloudsql-proxy',
        'PORT': '5432',
        # No pooling configuration
    }
}

With 50 workers × 4 processes each = 200 connections just from workers. Add the API servers, and we were hitting Cloud SQL connection limits.

The fix was pgbouncer!

RabbitMQ Horizontal Scaling

Our next crisis hit around 10,000 users. RabbitMQ started dropping messages during peak sync times. A single RabbitMQ instance couldn't handle the message throughput.

We moved to a clustered setup but clustering introduced new problems. Queue mirroring killed performance. We had to carefully tune which queues needed HA:

# celery_config.py
task_queues = {
    'critical': {
        'exchange': 'critical',
        'routing_key': 'critical',
        'queue_arguments': {
            'x-ha-policy': 'all'  # Mirror to all nodes
        }
    },
    'bulk': {
        'exchange': 'bulk', 
        'routing_key': 'bulk',
        # No mirroring - can afford to lose these
    }
}

Redis Scaling and Split Brain

Redis was easier—until it wasn't. We started with a single Redis instance for both cache and Celery results. At scale, this became a single point of failure.

We split Redis by function:

# settings.py
CACHES = {
    'default': {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': 'redis://redis-cache:6379/0',
    }
}

# Separate Redis for Celery
CELERY_RESULT_BACKEND = 'redis://redis-celery:6379/0'
CELERY_BROKER_TRANSPORT_OPTIONS = {
    'master_name': 'redis-celery-master',
    'sentinel_kwargs': {'password': REDIS_PASSWORD},
}

Redis Sentinel handled failover, but we learned the hard way about split-brain scenarios when network partitions happened in GKE.

Read Replicas and Task Routing

With millions of tasks running daily, read queries started overwhelming our primary database. We added read replicas, but Celery doesn't support Django's database routing out of the box.

We built a custom database router:

class CeleryDatabaseRouter:
    def db_for_read(self, model, **hints):
        if getattr(hints.get('instance'), '_in_celery_task', False):
            # Route read-heavy tasks to replicas
            if hints.get('instance').__class__.__name__ in READ_HEAVY_MODELS:
                return 'replica'
        return 'default'

Task Orchestration Nightmares

Our sync workflows involved complex task chains:

@app.task
def sync_account(account_id):
    chain = (
        fetch_sugar_data.s(account_id) |
        transform_data.s() |
        group([
            sync_contacts.s(),
            sync_opportunities.s(),
            sync_activities.s()
        ]) |
        update_sync_status.s(account_id)
    )
    chain.apply_async()

At scale, these chains created cascading failures. One stuck task would block entire sync operations. We had to implement circuit breakers and timeouts at every level.

Race Conditions and Deadlocks

The real fun began with concurrent syncs. Users would trigger multiple syncs, causing race conditions:

# The problem
@app.task
def update_contact(contact_id):
    contact = Contact.objects.get(id=contact_id)
    contact.last_synced = timezone.now()
    contact.save()  # Race condition!

We moved to explicit locking:

@app.task
def update_contact(contact_id):
    with transaction.atomic():
        contact = Contact.objects.select_for_update().get(id=contact_id)
        contact.last_synced = timezone.now()
        contact.save()

But this created deadlocks when tasks accessed resources in different orders. We standardized locking order and added deadlock retry logic.

Long Migration Timeouts

Database migrations became a nightmare. Adding an index to a large table would timeout:

# migrations/0042_add_sync_index.py
class Migration(migrations.Migration):
    atomic = False  # Crucial for long migrations
    
    operations = [
        migrations.RunSQL(
            "CREATE INDEX CONCURRENTLY idx_sync_status ON sync_records(status);",
            reverse_sql="DROP INDEX idx_sync_status;"
        ),
    ]

We learned to:

  • Always use CONCURRENTLY for index creation
  • Run migrations during low-traffic windows
  • Split large migrations into smaller chunks

GKE Scaling Practices

Our GKE setup evolved to handle the load:

# worker-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: celery-workers
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: celery-workers
  minReplicas: 10
  maxReplicas: 100
  metrics:
  - type: External
    external:
      metric:
        name: rabbitmq_queue_messages_ready
        selector:
          matchLabels:
            queue: "default"
      target:
        type: AverageValue
        averageValue: "30"  # Tasks per worker

We used node pools for different workload types:

# High-memory pool for data processing tasks
gcloud container node-pools create processing-pool \
  --cluster=sugar-connect \
  --machine-type=n2-highmem-4 \
  --enable-autoscaling \
  --num-nodes=3 \
  --min-nodes=3 \
  --max-nodes=20

# Standard pool for API and light tasks  
gcloud container node-pools create standard-pool \
  --cluster=sugar-connect \
  --machine-type=n2-standard-4 \
  --enable-autoscaling \
  --num-nodes=5 \
  --min-nodes=5 \
  --max-nodes=50

Current State

Today, Sugar Connect handles:

  • 50,000+ active users
  • 10M+ tasks processed daily
  • 500+ requests per second during peak
  • 99.9% uptime (those three 9s were hard-earned)

The architecture isn't perfect, but it scales predictably. We can handle traffic spikes, long-running migrations don't cause downtime, and our engineers can actually sleep at night.

Key Takeaways

  1. Default Celery settings don't scale - Connection pooling, custom routing, and careful timeout tuning are mandatory
  2. Horizontal scaling introduces complexity - Distributed systems mean distributed problems
  3. Database operations in tasks are tricky - Race conditions and deadlocks will find you
  4. GKE autoscaling needs proper metrics - CPU/memory aren't enough; use queue depth
  5. Test your failure modes - That RabbitMQ cluster failover? Make sure it actually works

The journey from hundreds to tens of thousands of users taught us that scaling isn't just about adding more servers. It's about understanding where your system breaks and building resilience at every layer.