Nginx: Mutual (Two way) SSL authentication for upstream HTTPS servers



Nginx  is a really good,  high performance reverse proxy server which supports Mutual Authentication  for incoming requests but doesn't support for upstream/backend servers.  In most of the deployments where nginx is used as a reverse proxy, it also acts as a SSL termination point where upstream requests are routed using either non SSL or one-way SSL connections.

Recently, I was working on a prototype to develop Api Gateway + reverse proxy which manages SSL certificates for different upstream backends and route incoming requests based on the routing rules. Some of the upstream backends expect to have their own certificates with mutual authentication.

After spending sometime on google search and seeing questions posted on stack overflow,  I realized that nginx doesn't support mutual/two-way authentications for upstreams so decided to implement support.  

I modified ngx_http_proxy_module.c to add following two new config parameters which allows to specify client certificate pem and client certificate key files. I have submitted my patch to nginx development community here

NOTE: Make sure you do not populate proxy_ssl_trusted_certificate.  As it is supported by nginx for upstream one way ssl, it takes preference over my config parameters.
 proxy_ssl_client_certificate  
 proxy_ssl_client_certificate_key  

1. Apply below patch to your Nginx source code and recompile.  I have verified this patch against nginx-1.4.7 
Download patch here

2.  Configure your nginx.conf as below.  See the comments for more details.

location /  
      {  
       default_type application/json;  
   
       #this enables client verification  
       proxy_ssl_verify on;  
       proxy_ssl_verify_depth 3;  
   
       #client certificate for upstream server  
       proxy_ssl_client_certificate /etc/upstream-a.pem;  
        
       #client key generated from upstream cert  
       proxy_ssl_client_certificate_key /etc/upstream-a.key;  
        
       #configure based on your security requirement  
       proxy_ssl_ciphers ALL;  
       proxy_ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;  
   
       #make sure to match the ssl server name from upstream server certificate   
       proxy_ssl_name "abc.company.com;  
       proxy_pass https://abc.company.com:9900;  
    }  

Hope this help.  Let me know if you have any questions.

Go vs Rust : Productivity vs Performance

Recently, I have been spending some time learning both Go and Rust languages and really excited about these languages evolving differently to solve different problems.

 I think Rust will attract developers from C, C++, Fortran and will be used for developing high performance systems like gaming, browsers, telco servers, distributed computing systems as well as low level, cpu efficient embedded/micro computers.

Go seems to be for the Python, Ruby and Java developers and will be used for enterprise applications, mobile apps and application servers.

With my 10+ years of experience in C++ writing applications for telecom service providers where latency and throughput is very important,  I really like Rust which simplifies C++ , eliminates memory corruption, improves compile time significantly and being claimed as blazing fast.


Today I came across Computer Language Benchmark comparing Rust and Go from one of the blog I was reading. These are microbenchmark and gives rough idea about how these languages perform for specific algorithm implementation.


Program Source CodeCPU secsElapsed secsMemory KBCode B≈ CPU Load
 binary-trees 
Rust22.055.97228,196788  96% 87% 95% 93%
Go68.4118.42266,624814  94% 94% 92% 93%
 pidigits 
Rust1.731.735,3081297  1% 100% 0% 0%
Go3.763.523,668674  4% 48% 7% 51%
 spectral-norm 
Rust7.872.065,1121020  95% 96% 96% 96%
Go15.703.961,816536  99% 99% 99% 99%
 reverse-complement 
Rust0.740.49257,1602015  12% 82% 41% 18%
Go0.930.77250,5641243  10% 69% 11% 35%
 fasta 
Rust4.995.004,8121224  1% 0% 100% 0%
Go7.267.271,0281036  1% 1% 1% 100%
 regex-dna 
Rust35.2611.95228,296763  65% 66% 87% 78%
Go47.4016.27543,868789  86% 64% 64% 78%
 fannkuch-redux 
Rust50.0112.817,0721180  99% 100% 97% 95%
Go67.1616.861,032900  100% 100% 100% 100%
 mandelbrot 
Rust20.145.0956,8241290  98% 99% 100% 99%
Go25.446.3932,276894  100% 100% 100% 100%
 n-body 
Rust20.9920.994,8241371  1% 0% 0% 100%
Go22.9522.951,0361310  0% 0% 100% 1%
 k-nucleotide 
Rust26.069.76152,5362113  42% 83% 43% 99%
Go30.938.42251,0241399  98% 91% 90% 90%

To get the comparative data across all these algorithms,   I calculated average for combined these algorithms for both the languages.

Results:  Average Elapsed seconds and code written for each language:


Language CPU (Elapsed seconds) Code (B)
Rust 7.585 1306.1
Go 10.483 959.5


These microbenchmarks shows Rust is ~30+% faster than Go but has ~30+% more code.


Given the computers are getting faster and cheaper but software becoming more complex and maintenance is expensive, I would use Go for an enterprise application.

Which language would you choose and for what kinds of applications?  



Fibonacci(50): Rust slower than Go ?

I my previous blog post "Fibonnaci (50) : Java > C > C++ > D > Go > Terra (Lua) > LuaJit (Lua) ",  I compared language performance using Fibonnaci algorithm where Rust language was not included.

Rust language is being clamined to  be "a system programming language that runs blazingly fast, prevents almost all crashes*, and eliminates data races".

I thought of running same fibonnaci algorithm against Rust to see how blazingly fast is it.

You can find fib.rs source code here on lang-compare  on my github.  I compared it against fib.go and fib.c 

FIBONACCI - 50
Language C (gcc-4.9.1):
gcc-4.9 -O3 fib.c
/usr/bin/time -lp ./a.out 50
LANGUAGE C: 12586269025
real        52.87
user        52.83
sys          0.01
    557056  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       163  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
       497  involuntary context switches

Language Go (go1.4beta1 darwin/amd64):
go build fib.go
/usr/bin/time -lp ./fib 50
LANGUAGE GO: 12586269025
real        96.56
user        96.47
sys          0.06
   1232896  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       300  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
      9010  voluntary context switches
      1338  involuntary context switches

Language Rust (0.13.0-nightly (45cbdec41 2014-11-07 00:02:18 +0000)):
/usr/local/bin/rustc --opt-level 3 fib.rs
/usr/bin/time -lp ./fib 50
LANGUAGE Rust: 12586269025
real       101.53
user       101.44
sys          0.01
   1040384  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
       229  page reclaims
        46  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         1  voluntary context switches
       914  involuntary context switches

Surprisingly results are as below where CPU usage (real) is as C (52.87) > Go (96.56)  > Rust (101.53).

Language CPU(real) Memory(Max resident size)
C 52.87 544K
Go 96.56 1204K
Rust 101.53 1016K




NOTE:  Is there a better way to read command line argument and convert into integer?   I felt it was very painful to figure that out.

E.g.
In C language:
long n = atol(argv[1]);

In Go language:
n, _ := strconv.Atoi(os.Args[1])

In Rust language:
let args = os::args(); let args = args.as_slice(); let n :u64 = from_str::<u64>(args[1].as_slice()).unwrap();



Fibonacci(50) performance : Java > C > C++ > D > Go > Terra (Lua) > Lua-JIT (Lua)

Recently,  I have been spending some time to learn D and Go  languages.  D Lang is an evolution of C++ where as Go is being claimed to be an evolution of C but I think  it is a Google's attempt to replace dependency on the Java.

I really like the simplicity of Go where the learning curve is very lean but accepting it as a system level programming would be a tough sell.   I think it is more for Java/Python developers for building the enterprise softwares rather than using it for hardware/device-drivers/embedded programming, but you never know as  Google is very successful selling  Android on low powered smart devices.

I did some performance bench for C, C++, D and Go  languages using the Fibonacci algorithm.

Results: (see updated results at the end)

For Fibonacci(25), C++ >= Go > C >= Lua-JIT > D > Lua-Terra > Java 1.6

For Fibonacci(50), Java > C > C++ > D-ldc > D-dmd > Go > Lua-Terra > Lua-JIT

Now surprisingly,  Java out performed C/C++  for Fibonacci (50) which hurts my ego :) !!

Language % C++ Speed Compiler/VM Flags
FIBONACCI-25
C++ 100.0000 Apple LLVM version 6.0 -O3
GO 100.0000 go version go1.3.3 darwin/amd64
C 77.7778 Apple LLVM version 6.0 -O3
LUA 77.7778 LuaJIT 2.0.3
D 63.6364 dmd  -m64 -O  -inline -noboundscheck
D 63.6364 ldc -m64 -O  -inline
LUA 43.7500 Terra
JAVA 1.6 43.4783 1.6.0_65-b14-462-11M4609)
FIBONACCI-50
JAVA 1.6 169.6710 1.6.0_65-b14-462-11M4609)
C 101.9846 Apple LLVM version 6.0 -O3
C++ 100.0000 Apple LLVM version 6.0 -O3
D 92.6376 ldc -m64 -O  -inline
D 81.7197 dmd  -m64 -O  -inline -noboundscheck
GO 76.7760 go version go1.3.3 darwin/amd64
LUA 43.9684 Terra
LUA 38.9649 LuaJIT 2.0.3





For Source code and results:  check out my github project.

Update:
It seems Clang on MacOS has some issue. I executed these on my Linux Virtual machine with gnu g++ and g++ is outperforming Java. My ego is intact :)

$g++ --version
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
$g++ -O3 fib.cpp
time ./a.out 50
real 0m47.991s
user 0m47.981s
sys 0m0.000s

$java -version
java version "1.7.0_51"
$javac fib.java
$time java fib 50
real 0m51.897s
user 0m51.815s
sys 0m0.113s

Fibonacci numbers: LuaJIT vs Terra

Recently I came across Terra Languange which is  a new low-level system programming language that is designed to interoperate seamlessly with the Lua programming language.

I thought of comparing the performance between Lua-Jit and Terra.  We all know Lua-JIT is extremely fast thanks to Mike Pall.

You can find fib.lua source code here as lang-compare  on my github.

FIBONACCI - 25

>time luajit-2.0.3 fib.lua 25

Running LUA-JIT 2.0.3 test
LANGUAGE LUA: 75025
real 0m0.009s
user 0m0.002s
sys 0m0.006s

Running fib.lua with Terra :
>time terra  fib.lua 25

Running Terra test
LANGUAGE LUA: 75025
real 0m0.016s
user 0m0.005s
sys 0m0.010s

Here LuaJIT is 77% faster than Terra for the same fib.lua file.

FIBONACCI - 50

>time luajit-2.0.3 fib.lua 50
Running LUA-JIT test
LANGUAGE LUA: 12586269025
real 3m8.326s
user 3m8.157s
sys 0m0.045s

>time terra  fib.lua 50
Running Terra test
LANGUAGE LUA: 12586269025
real 2m46.895s
user 2m46.734s
sys 0m0.043s

Here LuaJit is 56% slower than terra.   I ran the same tests multiple times and the results are consistently same where terra is out performing luajit by ~45% when ran for longer duration.



Apple Push Notification Service (APNS) Simulator

APNS simulator implementes APNS Specs for a simple and an enhanced push notification.
Prerequisite:
Once you have downloaded/installed LuaJit and luarocks, other dependencies can be installed using luarocks
e.g.
luarocks install copas

luarocks install LuaSec

luarocks install LuaLogging
Usage:   apns-sim.lua -t ssl_enabled [ -k ssl_key -c ssl_cert] [ -s server -p port -l loglevel -]

Here ssl_key  and ssl_cert fields are mandatory if ssl_enabled is set to true.  

ssl_enabled :  defaul false
server : default value is 127.0.0.1

port  :  default 8080

loglevel : default value is 'warn'

e.g. for ssl connection:
lua  apns-sim.lua -t true -k ./key.pem -c ./cert.pem

for non-ssl connection:  lua  apns-sim.lua 
When client connect to this simulator and send a push notification, you will see log entries on console.
Wed Oct 15 09:16:13 2014 INFO Received client connection  from '127.0.0.1:53444':
Wed Oct 15 09:16:13 2014 INFO Received notification: command=1; id=21; expiry=1413382573; token=adf3b210e7adf35f540f45b2697760d9d41081569dc4509ee98bb4d4c92a72ae; payload={"aps":{"alert":{"body":"Hello World"}}}

Improving performance of C# Binding for ZeroMQ (clrzmq)

In one of my project, I used ZeroMQ  for inter-process communication which is extremely fast, allows async IO, different messaging patterns and supported on multiple platforms/languages.

I used  following three messaging patterns.

  1. Publish/Subscribe: Where client subscribes to specific types of messages. When server reads these messages from hardware, it will publish to these clients.
  2. Request/Response: Client can send request to server who execute the request, interact with hardware and get's the response back. E.g Client can request to open a serial port or play an audio.
  3. Push/Pull: All clients will push the logs to the central logging server, central logging server pulls the messages and writes to the file.

As the development is done using C# on  Windows Embedded environment,  I use clrzmq which is a C# binding for ZeroMQ.   Based on my initial performance test, I realized that clrzmq is taking lot more CPU than I expected. 

I used RedGate's ANTS performance profiler for .NET which gives detail analysis on how much  CPU cycles are spent on each function and how many times it is called.

What I  found is that ZmqSocket.Receive() method spent it's time on 
  • SpinWait:17.1%
  • Stopwatch.GetElapsedDateTimeTicks: 4.1%
  • Stopwatch.StartNew: 2.4%
  • Receive: 73.3%
In which Receive() function  spent 64.4% of the time on SocketProxy.Receive()
  • ErrorProxy.get_ShouldTryAgain: 5.1%
  • SocketProxy.Receive: 64.4%
Now CPU Usage for SocketProxy.Receive()
  • DisposableIntPtr.Dispose: 11.1%
  • ZmqMsgT.Init:7.1% 
  • ZmqMsgT.Close:5.8%
  • SocketProxy.RetryIfInterrupted: 20.8%
See the attached Picture where  SocketProxy.Receive() uses 13142.42 CPU ticks
Average CPU ticks per request = 191,196,025 / 14,548 =  13142.42

As part of optimization,  I used pre-allocated raw buffer to send and receive data instead of ZmqMsg object, moved StopWatch and SpinWait code in to a limited scope where timeout is defined and longer than certain value.

After these optimization, SocketProxy.Receive() uses only 2696.54 CPU ticks which is almost 1/5 of original cpu usages. See the attached picture below. 
Average CPU ticks per request = 5,132,725,522 / 1,903,448 =  2696.54

Here is github link for optimized ZeroMQ library.

I am happy to say that my patch was accepted by clrzmq author and merge into the mainline clrzmq library.