El blog de Trespams

[ x ]

Faig servir les cookies de Google Analytics per al control de visites i estadístiques..
És una pardalada, però la llei diu que us he d'avisar, ja veus. Així que si visitau aquest blog donau-vos per informats o sortiu ara mateix i netejau les cookies del vostre navegador. Si continuau llegint, suposaré que ja us està bé. Si vols saber com llevar les cookies del teu navegador: aquí ho pots trobar

Celery, Redis and Django

Disclaimer: This is my first English post is a free translation of the original catalan post

In previous posts have written about Celery and Django Celery, a system to manage queues and tasks in Python and Django.

Celery in its documentation recommends RabbitMQ as message broker, that is, as the application that receives and distributes the tasks that the application sends between the different workers we have configured in our system.

Once the worker has done the task it leaves the result (if we have configured to do it) to the results backend, usually it is the same one as the message broker, that is RabbitMQ acts as a message broker and as a result backend.

The architecture of Celery is very powerful in the sense it allows us to scale up and down and replace the parts we need to configure the application to our needs. So we could have applications that needs some sort of message or task distribution, but they don't need to deal with the complexity nor the system requirements of RabbitMQ. With Celery we can even use a database as a message broker where could save the results, we can replace the serialization routines, the results storage. So, although in the documentations we have a prefered configuration we can change it the needs of our application.

In this post I'll try to present is a configuration that fits in the middle of the complexity of a RabbitMQ solution but enough powerful to fit most application needs.

What's the problem we'll try to solve?

We have an application that we want to run in an small server or in a shared server that needs some sort os distributed tasks or periodic task than we want to manage inside the application itself. We would like:

  • Minimum complexity to install and configure the task system.
  • We'd like not to have a dedicated broker.
  • We'd like to monitor what's happening in our application
  • We'd like to manage our system easily.
  • We'd like to debug our application and run everything on a local server before run the application using a distributed task configuration.
  • We'd like our task broker could have very low memory requirements

We can imagine lots of scenarios in what that requirements would fit, a news aggregation, an e-comerce application that needs to send the invoices, an small document management system that makes some sort of format translation. That is, systems that need a small response time to the user and that could make the heavy task in an asynchronous way, where the reliability of the tasks system is not critical for the application

For that kind of applications Celery with RabbitMQ is overkill, so we're going to diet it a bit

The broker

We want the distribution of tasks to be powerful and flexible, but without the complexity of RabbitMQ. So what will be do is install Redis , a NoSQL a database that works in a similar way memcached does.

Redis is very fast and comsumes few machine resources, allows the persistence of periodic data and the application is generic enough to be used in our applications in addition to task management. A [presentation by Simon Wilson] (http://simonwillison.net/static/2010/redis-tutorial/) summarizes very well the possibilities of this database.

Redis has, however, and important requirement that we must know: in its standard configuration requires that all data has to fit in memory and that periodically synchroniZes the changes to disk. So we must monitor our application to be sure that Redis does not grow without notice consuming all the available memory.

Installing Redis

Readis is present in major Linux distribution, and in Debian based distros is enough to type

sudo apt-get install redis-server

as we're going to use Redis in Celery we must install also the Python API

pip install redis

inside our virtualenv (I suppose everybody is using virtualenv ...)

In a brand new installation in a Ubuntu 10.10 redis consumes 3271B of virtual memory and 1516B of resident memory in a single process.

In a production environment for sure we would like to configure some parameters:

  • bind, to link the redis instance to an IP
  • loglevel is verbose in the default configuration, in production notice or warning would be enough.

The configuration file for redis is in /etc/redis/redis.conf in the Ubuntu, is extensively documented to allow us to adapt it to our needs.

The results storage

As mentioned Celery also allows us to define where to store our data. Redis is a general purpose database, so in addition to the tasks broker we can use it to save the results of the tasks. As pointed before, we have to monotor Redis if we plan to store lot of data o if we store big results. Redis mantains all the database in memory.

Usually in a task/queue system we want to keep the results a just the time enought to see that everything is going well and then we don't need the results anymore.. That is, the results do not necessarily have to remain in the database, the amount of time we need to keep the results in the database would greatly depend on our application.

Let me explai myself. We use Celery in a B2C application to update the information we have about the hotels. We launch the update information periodicaly to update a the information and each task is able to run another taks. Once the information is received the information is processed. So the results just needs to be in the database the time that a worker needs to process it, after that we can delete it. As the process is quite fast is much simpler to make the results expire in 60 seconds than to write the code to delete it.

If we're need to create a task to send an invoice we do not need to save the invoice in Redis, we just need to update our database to mark the invoice as sent once the worker has finished the pdf generation and the mail is sent.

So if we want to mantain our low memory requirements we have to tune our application to not store a lot of information in the Redis database.

Using Redis as a broker and as a database makes us to reach our objective of reusing the technology, but we can use Redis as a cache backend for Django and to [store sessions]((https://bitbucket.org/dpaccoud/django-redis-sessions/src).

Our settings.py

First at all in our INSTALLED_APPLICATIONS we have to add djcelery and now we have to configure Redis as a broker and database backend.


import djcelery
djcelery.setup_loader() is the virtual image in which I have installed a fresh Ubuntu and that runs Redis, as in this post I'd like to emulate a simple production environment with two servers. As you can see I have no password protection and Redis runs in its default port.

Note that on BROKER_VHOST we have to configure the database Redis will use for the broker system. It can be the same one as the REDIS_DB but we could choose to have the results and the task communication in different databases. CELERYBEAT_SCHEDULER CELERY_TASK_RESULT_EXPIRES is just 10 seconds, time enough for our purposes. CELERYBEAT_SCHEDULER is configured to allow us to create periodic task from our Django application. As this needs new database tables, we would need to run syncdb to create the tables that the scheduler needs.

Development mode

On development on of the first goals is to be sure everything works properly, so we don't need the noise that the broker and storage puts on our development process. Celery has a special configuration


so our application would not use nor the worker neither the broker and is executed as a common application, just invokes the task in a synchronous way.

Lets start the workers

When you start with Celery it's important to have a global vision about what's happening in our application. I have found that terminator is a good tool to run our console commands, splitting our terminals in order to see what's happening.

So lets open a console in our application environment and run

python manage.py celeryd -E -B --loglevel=INFO -n w1.d820

This will run a worker, configured to run the default number of processors, which depends on the number of CPUs available on our server. We have configured the worker to send monitoring signals (-S) and to run an additional process to deal with the periodic tasks (-B).

It's important to remark that just one worker can have the -B parameter, so perhaps is better to make this fact more visible and run the periodic task process using a dedicated command

python manage.py celerybeat --loglevel=INFO

Running celerybeat as a standalone process it will inform us about its configuration

[2011-04-03 11:16:46,808: WARNING/MainProcess] celerybeat v2.2.5 is starting.
[2011-04-03 11:16:46,863: WARNING/MainProcess] __    -    ... __   -        _
Configuration ->
    . broker -> redis://@
    . loader -> djcelery.loaders.DjangoLoader
        . scheduler -> djcelery.schedulers.DatabaseScheduler

As we want to monitor the tasks and have more than just one worker is important to name them. This can be done with the -n parameter. I like to add the worker number and the server name. In the example the name of my laptop.

Run a second worker is as easy as:

python manage.py celeryd -E --loglevel=INFO -n w2.d820

 -------------- celery@w2.d820 v2.2.5
---- **** -----
--- * ***  * -- [Configuration]
-- * - **** ---   . broker:      redis://@
- ** ----------   . loader:      djcelery.loaders.DjangoLoader
- ** ----------   . logfile:     [stderr]@WARNING
- ** ----------   . concurrency: 2
- ** ----------   . events:      ON
- *** --- * ---   . beat:        OFF
-- ******* ----
--- ***** ----- [Queues]
--------------   . celery:      exchange:celery (direct) binding:celery

As we can see we have not add the -B parameter and Celery informs us that the beat process is off.

We can increase or decrease the number of default processes that the worker is going to star with the --concurrency parameter. The final number is a matter to test and see.

python manage.py celeryd -E --concurrency=10 -n w3.d820

Monitoring: what's happening in my application?

If we have added logs on our applications in each worker we can check and see the output, but perhaps we have no logs o our workers could be distributed in different servers. Celery provides us with some monitoring that are nice to know. To use such tools the first step is to start the monitoring service:

python manage.py celerymon

celerymon 2.2.5 is starting.
Configuration ->
    . broker -> redis://@
    . webserver -> http://localhost:8989
celerymon has started.

As we can see Celery has started a server on port 8989. We can connect to that server and see the registered workers and tasks. It's some sort of raw information but it could be enough

[{"heartbeats": [1301824037.784225],"hostname": "w1.d820"}, 
{"heartbeats": [1301824018.90294], "hostname": "w2.d820"}]

As we have configured the DatabaseScheduler we could see the tasks in the Django application itself, but there is another tool on colole mode that give us nearly realtime information, the celeryev

With python manage.py celeryev we will start an console application that will show us what tasks are being processed, we can see the result of each tasks and even revoke a task. If we want more control about the monitoring tools Celery provides an API to get the information, and of course you can look at the source code for celeryev.

It's important to monitor also the Redis server

sudo tail -n 100 -f /var/log/redis/redis-server.log

we'll see what's happening on the Redis side,

==> /var/log/redis/redis-server.log <==
[677] 03 Apr 09:56:10 - Accepted
[677] 03 Apr 09:56:10 - Client closed connection
[677] 03 Apr 09:56:10 - Client closed connection
[677] 03 Apr 09:56:10 - Accepted
[677] 03 Apr 09:56:10 - Client closed connection
[677] 03 Apr 09:56:10 - Client closed connection
[677] 03 Apr 09:56:10 - Accepted

Redis provides also a client console, redis-cli that we could use to get more information and make a lot of management task. Some useful commands are:

  • KEYS * shows us the active keys
  • DBSIZE informs us about the size of the active database
  • INFO give us a lot of information about our database, it's really useful to check the memory consumption

  • FLUSHDB cleans all the database removing all the keys

  • MONITOR shows us in real time what's happening, what commands are being executed, and the keys and information that is stored in the database.

To summarize

With Django, Celery and Redis we have a simple task distribution, scalable and with very small server requirements.

We can use Redis as a broker, as as data store and in other tasks of our Django application: as another database, as a session database, as a cache server.

If we want to work using tasks to split the work we have to

  • Develop our application thinking in tasks and asynchronous processes.
  • Install and configure Redis
  • Run the workers
  • Run celerybeat if we have periodic tasks
  • Run the monitor

And of course we have to monitor all the application. Enjoy!

blog comments powered by Disqus