Skip to content

Commit 65cc54e

Browse files
committed
Blogged about Pinterest Deploys
1 parent 7d68b2e commit 65cc54e

1 file changed

Lines changed: 89 additions & 0 deletions

File tree

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
layout: blog
3+
title: A better deploy strategy
4+
---
5+
6+
{{ page.title }}
7+
================
8+
9+
At Pinterest we like to deploy as often as possible. We want to increase the frequency with which we do deploys but we realize that every deploy impacts our users with intermittant 500 errors.
10+
11+
Our Setup
12+
---------
13+
14+
Pinterest has a fleet of appservers that run our front end django app. Each appserver runs python about 40 processes in tornado WSGI containers on about 40 different ports. Each appserver has an nginx instance in front of the tornados. It is configured something like this:
15+
16+
17+
{% highlight python %}
18+
upstream app_servers {
19+
fair;
20+
server localhost:8000 max_fails=1 fail_timeout=0s;
21+
server localhost:8001 max_fails=1 fail_timeout=0s;
22+
server localhost:8002 max_fails=1 fail_timeout=0s;
23+
server localhost:8003 max_fails=1 fail_timeout=0s;
24+
}
25+
26+
server {
27+
listen 80;
28+
server_name pinterest.com;
29+
location / {
30+
proxy_pass http://app_servers;
31+
include /etc/nginx/proxy.conf;
32+
}
33+
}
34+
{% endhighlight %}
35+
36+
The pool of python processes is defined by the nginx upstream module. We use the [fair][1] module to distribute load with preference to least busy, known to be up servers.
37+
38+
The Old Way
39+
-----------
40+
41+
Our old deploy system was to kill processes and allow [Supervisord][3] to restart them. This automatically loaded the new version of our code. This had the advantage of being fast (about 3 seconds) as well as being very simple. Unfortunately, this had the issue of killing all inprocess requests and giving those users 500 errors. New processes that were restarted were 'cold' i.e. nothing was internally cached (jinja templates, py files). To deploy more frequently we would need to come up with a way to avoid giving users 500s and avoid giving them cold processes.
42+
43+
The New Way
44+
-----------
45+
46+
Our solution is to restart processes a handful at a time. For each process we take the following steps:
47+
48+
### Disabling the Process
49+
We use iptables to redirect traffic to a known unused port on the app server. Nginx immediately recognizes that there is no longer a process listening on that port and distributes the traffic to alive ports instead. We did this the following command:
50+
51+
```
52+
iptables -t nat -I OUTPUT --src 0/0 --dst 127.0.0.1 -p tcp --dport {tornado_port} -j REDIRECT --to-ports 9
53+
```
54+
55+
In english this command means "At the nat level, redirect traffic from anywhere destined for localhost on port {tornado_port} to [port 9 (the disable port)](http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers#Well-known_ports)"
56+
57+
### Restart the Process
58+
59+
The first step is to wait for traffic to finish. This means waiting about a second. In practice we wait 5 seconds to be sure that all traffic has completed. Even though traffic is redirected away from this port in the previous step, requests can be completed as normal.
60+
61+
After the wait period, the process is killed and we wait for it to be restarted by [Supervisord][3].
62+
63+
### Warm Up the Process
64+
65+
To warm up the process we make a handful of http requests to the newly restarted port. We do this by using `127.0.0.2:{port_number}` to avoid being hit by our iptables rules. By default in ubuntu the loopback interfaces is configured to respond to `127.0.0.0` through `127.255.255.255` but our iptables rules only apply to the address that ngnix is using.
66+
67+
We are trying to accomplish the following goals when warming up our processes:
68+
69+
- Parse/compile/load into memory our most commonly used py files.
70+
- Cache our jinja templates. Rendering jinja templates is much slower the first time than subsequent times.
71+
72+
Why not cache everything? Because caching everything would lead to much higher memory usage on our app servers allowing us to serve fewer python processes per box.
73+
74+
### Re-enable Ports
75+
76+
To re-enable ports we just clear the iptable rule.
77+
78+
### Results
79+
80+
Our own internal metrics show very little user impact on a successful deploy. Users keep using Pinterest as normal through the deploy process. This has allowed us to do faster, more frequent deploys to our app servers.
81+
82+
83+
84+
85+
[1]: http://wiki.nginx.org/HttpUpstreamFairModule
86+
[2]: http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers#Well-known_ports
87+
[3]: http://supervisord.org/
88+
89+

0 commit comments

Comments
 (0)