Pythian Blog: Technical Track

Comparing CPU throughput of Azure and AWS EC2

After observing CPU core sharing with Amazon Web Services EC2, I thought it would be interesting to see if Microsoft Azure platform exhibits the same behavior.

Signing up for Azure’s 30-day trial gives $200 in credit to use over the next 30-day period: more than enough for this kind of testing. Creating a new virtual machine, using the “quick create” option with Oracle Linux, and choosing a 4-core “A3” standard instance.

I must say I like the machine naming into built-in “clouadpp.net” DNS that Azure uses: no mucking around with IP addresses. The VM provisioning definitely takes longer than AWS, though no more than a few minutes. And speaking of IP addresses, both start with 191.236. addresses assigned to Microsoft’s Brazilian subsidiary through the Latin American LACNIC registry, due to the lack of north american IP addresses.

Checking out the CPU specs as reported to the OS:

[azureuser@marc-cpu ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.990
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

2.2GHz rather than 2.6GHz, but otherwise the same family and architecture as the E5-2670 under AWS. Identified as a single-socket, 4-core processor, without hyperthreads at all.

Running the tests

[azureuser@marc-cpu ~]$ taskset -pc 0 $$
pid 1588's current affinity list: 0-3
pid 1588's new affinity list: 0
[azureuser@marc-cpu ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 36.9319 s, 58.8 MB/s
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 72.8379 s, 29.8 MB/s
2170552320 bytes (2.2 GB) copied, 73.6173 s, 29.5 MB/s

Pretty low; that’s half the throughput we saw on AWS, albeit with a slower clock speed here.

[azureuser@marc-cpu ~]$ taskset -pc 0,1 $$
pid 1588's current affinity list: 0
pid 1588's new affinity list: 0,1
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.4285 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.7957 s, 59.0 MB/s
[azureuser@marc-cpu ~]$ taskset -pc 0,2 $$
pid 1588's current affinity list: 0,1
pid 1588's new affinity list: 0,2
[azureuser@marc-cpu ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu ~]$ 2170552320 bytes (2.2 GB) copied, 36.3998 s, 59.6 MB/s
2170552320 bytes (2.2 GB) copied, 36.776 s, 59.0 MB/s

Pretty consistent results, so no core sharing, but running considerably slower than we saw with AWS.

Kicking off 20 runs in a rows:

[azureuser@marc-cpu ~]$ taskset -pc 0-3 $$
pid 1588's current affinity list: 0,2
pid 1588's new affinity list: 0-3
[azureuser@marc-cpu ~]$ for run in {1..20}; do
>  for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      1 59.1
      4 59.2
      1 59.3
      2 59.4
      2 59.5
      8 59.6
     12 59.7
      7 59.8
      3 59.9

We get very consistent results, between 59.1 and 59.9 mB/sec

Results from “top” while running:

cat > ~/.toprc <<-EOF
RCfile for "top with windows"           # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=25913, sortindx=10, maxtasks=2
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[azureuser@marc-cpu ~]$  top -b -n20 -U azureuser
...
top - 14:38:41 up 2 min,  2 users,  load average: 2.27, 0.78, 0.28
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 95.4%us,  4.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu3  :  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1606 azureuse  20   0  4292  800  400 R 97.0  0.0   0:03.49 gzip
 1604 azureuse  20   0  4292  796  400 R 96.7  0.0   0:03.50 gzip
top - 14:38:44 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.4%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.5%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 72.3%us,  3.9%sy,  0.0%ni, 23.4%id,  0.0%wa,  0.0%hi,  0.4%si,  0.0%st
Cpu3  : 12.0%us,  0.7%sy,  0.0%ni, 85.6%id,  1.4%wa,  0.0%hi,  0.4%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.8  0.0   0:06.42 gzip
 1606 azureuse  20   0  4292  800  400 R 96.4  0.0   0:06.40 gzip
top - 14:38:47 up 2 min,  2 users,  load average: 2.25, 0.80, 0.29
Tasks: 205 total,   3 running, 202 sleeping,   0 stopped,   0 zombie
Cpu0  : 94.9%us,  5.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  9.7%us,  0.3%sy,  0.0%ni, 89.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 51.8%us,  2.8%sy,  0.0%ni, 45.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 17.9%us,  1.4%sy,  0.0%ni, 80.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1604 azureuse  20   0  4292  796  400 R 96.5  0.0   0:09.34 gzip
 1606 azureuse  20   0  4292  800  400 R 95.5  0.0   0:09.29 gzip

It’s using full CPUs and all from gzip, so no large system overhead here. Also, “%st”, time reported “stolen” by the hypervisor, is zero. We’re simply getting half the throughput of AWS.

Basic instances

In addition to standard instances, Microsoft makes available basic instances, which claim to offer “similar machine configurations as the Standard tier of instances offered today (Extra Small [A0] to Extra Large [A4]). These instances will cost up to 27% less than the corresponding instances in use today (which will now be called “Standard”) and do not include load balancing or auto-scaling, which are included in Standard” (https://azure.microsoft.com/blog/2014/03/31/microsoft-azure-innovation-quality-and-price/)

Having a look at throughput here, by creating a basic A3 instance “marc-cpu-basic” that otherwise matches exactly marc-cpu created earlier.

[azureuser@marc-cpu-basic ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor       : 0
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
processor       : 1
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
processor       : 2
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
processor       : 3
model name      : Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
cpu MHz         : 2199.993
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4

CPU specs are identical to marc-cpu. Running the same tests:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0 $$
pid 1566's current affinity list: 0-3
pid 1566's new affinity list: 0
[azureuser@marc-cpu-basic ~]$  dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 54.6678 s, 39.7 MB/s
for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
[azureuser@marc-cpu-basic ~]$ for i in {1..2}; do (dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null &) done
2170552320 bytes (2.2 GB) copied, 107.73 s, 20.1 MB/s
2170552320 bytes (2.2 GB) copied, 107.846 s, 20.1 MB/s

Now that’s very slow: even with the identical stated CPU specs as marc-cpu, marc-cpu-basic comes in with 33% less throughput.

Doing 20 runs in a rows:

[azureuser@marc-cpu-basic ~]$ taskset -pc 0-3 $$
pid 1566's current affinity list: 0
pid 1566's new affinity list: 0-3
[azureuser@marc-cpu-basic ~]$ for run in {1..20}; do
> for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2>> output | gzip -c > /dev/null & done
> wait
> done
...
[azureuser@marc-cpu-basic ~]$ cat output | awk '/copied/ {print $8}' | sort | uniq -c
      4 40.4
     15 40.5
     14 40.6
      7 40.7

Very consistent results, but consistently slow. They do show that cores aren’t being shared, but throughput is lower than even a shared core under AWS.

Wrapping up

Comparison chart

Under this simple gzip test, we are testing CPU integer performance. The Azure standard instance got half the throughput of the equivalent AWS instance, in spite of a clock speed only 15% slower. But the throughput was consistent: no drops when running on adjacent cores. The basic instance was a further 33% slower than a standard instance, in spite of having the same CPU configuration.

Under Azure, we simply aren’t getting a full physical core’s worth of throughput. Perhaps the hypervisor is capping throughput, and capping even lower for basic instances? Or maybe the actual CPU is different than the E5-2660 reported? For integer CPU-bound workloads like our gzip test, we would need to purchase at least twice as much capacity under Azure than AWS, making Azure considerably more expensive as a platform.

No Comments Yet

Let us know what you think

Subscribe by email