Pythian Blog: Technical Track

Virtual CPUs with Amazon Web Services

Some months ago, Amazon Web Services changed the way they measure CPU capacity on their EC2 compute platform. In addition to the old ECUs, there is a new unit to measure compute capacity: vCPUs. The instance type page defines a vCPU as “a hyperthreaded core for M3, C3, R3, HS1, G2, and I2.” The description seems a bit confusing: is it a dedicated CPU core (which has two hyperthreads in the E5-2670 v2 CPU platform being used), or is it a half-core, single hyperthread?

I decided to test this out for myself by setting up one of the new-generation m3.xlarge instances (with thanks to Christo for technical assistance). It is stated to have 4 vCPUs running E5-2670 v2 processor at 2.5GHz on the Ivy Bridge-EP microarchitecture (or sometimes 2.6GHz in the case of xlarge instances).

Investigating for ourselves

I’m going to use paravirtualized Amazon Linux 64-bit for simplicity:

$ ec2-describe-images ami-fb8e9292 -H
Type ImageID Name Owner State Accessibility ProductCodes Architecture ImageType KernelId RamdiskId Platform RootDeviceType VirtualizationType Hypervisor
IMAGE ami-fb8e9292 amazon/amzn-ami-pv-2014.03.1.x86_64-ebs amazon available public x86_64 machine aki-919dcaf8 ebs paravirtual xen
BLOCKDEVICEMAPPING /dev/sda1 snap-b047276d 8

Launching the instance:

$ ec2-run-instances ami-fb8e9292 -k marc-aws --instance-type m3.xlarge --availability-zone us-east-1d
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 pending marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled ebs paravirtual xen sg-5fc61437 default

The instance is up and running within a few minutes:

$ ec2-describe-instances i-b5f5a2e6 -H
Type ReservationID Owner Groups Platform
RESERVATION r-cde66bb3 462281317311 default
INSTANCE i-b5f5a2e6 ami-fb8e9292 ec2-54-242-182-88.compute-1.amazonaws.com ip-10-145-209-67.ec2.internal running marc-aws 0 m3.xlarge 2014-06-16T20:23:48+0000 us-east-1d aki-919dcaf8 monitoring-disabled 54.242.182.88 10.145.209.67 ebs paravirtual xen sg-5fc61437 default
BLOCKDEVICE /dev/sda1 vol-1633ed53 2014-06-16T20:23:52.000Z true

Logging in as ec2-user. First of all, let’s see what /proc/cpuinfo says:

[ec2-user@ip-10-7-160-199 ~]$ egrep '(processor|model name|cpu MHz|physical id|siblings|core id|cpu cores)' /proc/cpuinfo
processor : 0
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 1
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 2
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
processor : 3
model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
cpu MHz : 2599.998
physical id : 0
siblings : 4
core id : 0
cpu cores : 1

Looks like I got some of the slightly faster 2.6GHz CPUs. /proc/cpuinfo shows four processors, each with physical id 0 and core id 0. Or in other words, one single-core processor with 4 threads. We know that the E5-2670 v2 processor is actually a 10-core processor, so the information we see at the OS level is not quite corresponding.

Nevertheless, we’ll proceed with a few simple tests. I’m going to run “gzip”, an integer-compute-intensive compression test, on 2.2GB of zeroes from /dev/zero. By using synthetic input and discarding output, we can avoid effects of disk I/O. I’m going to combine this test with taskset comments to impose processor affinity on the process.

A simple test

The simplest case: a single thread, on processor 0:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0 $$
pid 1531's current affinity list: 0-3
pid 1531's new affinity list: 0
[ec2-user@ip-10-7-160-199 ~]$ dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null
2170552320 bytes (2.2 GB) copied, 17.8837 s, 121 MB/s

With the single processor, we can process 121 MB/sec. Let’s try running two gzips at once. Sharing a single processor, we should see half the throughput.

[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 35.8279 s, 60.6 MB/s
2170552320 bytes (2.2 GB) copied, 35.8666 s, 60.5 MB/s

Sharing those cores

Now, let’s make things more interesting: two threads, on adjacent processors. If they are truly dedicated CPU cores, we should get a full 121 MB/s each. If our processors are in fact hyperthreads, we’ll see throughput drop.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,1 $$
pid 1531's current affinity list: 0
pid 1531's new affinity list: 0,1
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 27.1704 s, 79.9 MB/s
2170552320 bytes (2.2 GB) copied, 27.1687 s, 79.9 MB/s

We have our answer: throughput has dropped by a third, to 79.9 MB/sec, showing that processors 0 and 1 are threads sharing a single core. (But note that Hyperthreading is giving performance benefits here: 79.9 MB/s on a shared core is higher than then 60.5 MB/s we see when sharing a single hyperthread.)

Trying the exact same test, but this time, non-adjacent processors 0 and 2:

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0,2 $$
pid 1531's current affinity list: 0,1
pid 1531's new affinity list: 0,2
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 17.8967 s, 121 MB/s
2170552320 bytes (2.2 GB) copied, 17.8982 s, 121 MB/s

All the way up to full-speed, showing dedicated cores.

What does this all mean? Let’s go back to the Amazon’s vCPU definition

Each vCPU is a hyperthreaded core

As our tests have shown, a vCPU is most definitely not a core. It’s a half of a shared core, or one hyperthread.

A side effect: inconsistent performance

There’s another issue at play here too: the shared-core behavior is hidden from the operating system. Going back to /proc/cpuinfo:

[ec2-user@ip-10-7-160-199 ~]$ grep 'core id' /proc/cpuinfo
core id : 0
core id : 0
core id : 0
core id : 0

This means that the OS scheduler has no way of knowing which processors have shared cores, and can not schedule tasks around it. Let’s go back to our two-thread test, but instead of restricting it to two specific processors, we’ll let it run on any of them.

[ec2-user@ip-10-7-160-199 ~]$ taskset -pc 0-3 $$
pid 1531's current affinity list: 0,2
pid 1531's new affinity list: 0-3
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 18.041 s, 120 MB/s
2170552320 bytes (2.2 GB) copied, 18.0451 s, 120 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 21.2189 s, 102 MB/s
2170552320 bytes (2.2 GB) copied, 21.2215 s, 102 MB/s
[ec2-user@ip-10-7-160-199 ~]$ for i in {1..2}; do dd if=/dev/zero bs=1M count=2070 2> >(grep bytes >&2 ) | gzip -c > /dev/null & done
2170552320 bytes (2.2 GB) copied, 26.2199 s, 82.8 MB/s
2170552320 bytes (2.2 GB) copied, 26.22 s, 82.8 MB/s

We see throughput varying between 82 MB/sec and 120 MB/sec, for the exact same workload. To get some more performance information, we’ll configure top to run 10-second samples with per-processor usage information:

[ec2-user@ip-10-7-160-199 ~]$ cat > ~/.toprc <<-EOF
RCfile for "top with windows" # shameless braggin'
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=3.000, Curwin=0
Def fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
winflags=25913, sortindx=10, maxtasks=2
summclr=1, msgsclr=1, headclr=3, taskclr=1
Job fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
winflags=62777, sortindx=0, maxtasks=0
summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
winflags=62777, sortindx=13, maxtasks=0
summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr fieldscur=ABDECGfhijlopqrstuvyzMKNWX
winflags=62777, sortindx=4, maxtasks=0
summclr=3, msgsclr=3, headclr=2, taskclr=3
EOF
[ec2-user@ip-10-7-160-199 ~]$ top -b -n10 -U ec2-user
top - 21:07:50 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.7%us, 3.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 1.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu2 : 96.0%us, 4.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 1.0%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.7%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:06.08 gzip

Here two non-adjacent CPUs are in use. But 3 seconds later, the processes are running on adjacent CPUs:

top - 21:07:53 up 43 min, 2 users, load average: 0.55, 0.45, 0.36
Tasks: 86 total, 4 running, 82 sleeping, 0 stopped, 0 zombie
Cpu0 : 96.3%us, 3.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 96.0%us, 3.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.0%si, 0.3%st
Cpu3 : 0.3%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1766 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip
1768 ec2-user 20 0 4444 608 400 R 99.7 0.0 0:09.08 gzip

Although usage percentages are similar, we’ve seen earlier that throughput drops by a third when cores are shared, and we see varied throughput as the processes are context-switched between processors.

This type of situation arises where compute-intensive workloads are running, and when there are fewer processes than total CPU threads. And if only AWS would report correct core IDs to the system, this problem wouldn’t happen: the OS scheduler would make sure processes did not share cores unless necessary.

Here’s a chart summarizing the results:

 

Summing up

Over the course of the testing I’ve learned two things:

  • A vCPU in an AWS environment actually represents only half a physical core. So if you’re looking for equivalent compute capacity to, say, an 8-core server, you would need a so-called 4xlarge EC2 instance with 16 vCPUs. So take it into account in your costing models!
  • The mislabeling of the CPU threads as separate single-core processors can result in performance variability as processes are switched between threads. This is something the AWS and/or Xen teams should be able to fix in the kernel.

Readers: what has been your experience with CPU performance in AWS? If any of you has access to a physical machine running E5-2670 processors, it would be interesting to see how the simple gzip test runs.

No Comments Yet

Let us know what you think

Subscribe by email