Why does E3 SSD dedi seem faster than Ryzen 7950X NVMe 2 vCore VPS at `sha256sum -c`?
Yesterday I uploaded a 4.6 GB Chromebook backup to storage. I uploaded the same file to both an E3 dedicated server and to a 2 vCore Ryzen 7950X VPS.
When I checked the integrity of the uploaded files, I was surprised that running sha256sum -c
seemed to take longer on the Ryzen 7950X VPS than on the E3 dedi. For the E3 dedi the wall clock "real" time was 22.655 seconds, and, for the Ryzen, 30.658 seconds.
The E3 dedi has SSD Raid 10, and the Ryzen 7950X 2 vCore VPS has NVMe Raid 10.
The E3 is running Ubuntu 22.04.4 LTS. The Ryzen is running Debian 12.7.
Geekbench and fio
scores from Yabs on both machines are shown below along with both machines' time
results.
Note that both fio
and Geekbench scores are significantly higher on the Ryzen 2 vCore VPS than on the E3 Dedi.
I guess the sha256sum -c
execution time might depend on how many threads are being used by sha256sum
. From both machines' top
results, also shown below, it seems like sha256sum
might be single threaded.
I haven't seen any steal on the Ryzen VPS, and I was told that it was on a new node.
In the time
results shown below, please note that "real", "user", and "sys" each differ.
I decided to try time sha256sum -c
also with another, bigger 17 GB file. For the 17 GB file, the wall clock "real" time was 1m18.146s on the E3 dedi and 2m16.960s on the Ryzen. So, again, the E3 Dedi seemed to beat the 2 vCore Ryzen 7950X.
It really seems like I must be missing something basic here! It doesn't seem sensible that the E3 would be faster than the Ryzen for sha256sum -c
. But, what explains the time differences? Why does the E3 seem faster?
Both processors have integrated graphics. The E3 graphics are enabled on the bare metal, and the Ryzen graphics are passed through into and are enabled on the Ryzen VPS. But, are the graphics processors involved with sha256sum
?
Assuming these results are not from way out in left field, and assuming the lack of some easy explanation that I am missing, does anybody here have experience checking the time expended on the various operations performed by the sha256sum
program and also the detailed E3 and Ryzen architectural specifications? Why does E3 seem faster than Ryzen on sha256sum -c
?
E3 Dedi
OS: Ubuntu 22.04.4 LTS
fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/mapper/vg0-root):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 199.08 MB/s (49.7k) | 180.73 MB/s (2.8k)
Write | 199.61 MB/s (49.9k) | 181.68 MB/s (2.8k)
Total | 398.69 MB/s (99.6k) | 362.42 MB/s (5.6k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 301.57 MB/s (589) | 323.23 MB/s (315)
Write | 317.60 MB/s (620) | 344.75 MB/s (336)
Total | 619.17 MB/s (1.2k) | 667.98 MB/s (651)
Geekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 1344
Multi Core | 4430
Full Test | https://browser.geekbench.com/v6/cpu/6406219
top - 23:16:48 up 1 day, 5:59, 2 users, load average: 1.21, 0.93, 0.90
Tasks: 216 total, 2 running, 214 sleeping, 0 stopped, 0 zombie
%Cpu0 : 4.7 us, 0.3 sy, 0.0 ni, 94.3 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu1 : 1.3 us, 2.3 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 10.6 us, 1.7 sy, 0.0 ni, 87.1 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu3 : 93.0 us, 1.7 sy, 0.0 ni, 5.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 2.0 us, 2.3 sy, 0.0 ni, 95.4 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu5 : 4.0 us, 1.0 sy, 0.0 ni, 94.4 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu6 : 1.0 us, 1.0 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu7 : 6.9 us, 2.3 sy, 0.0 ni, 90.5 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
MiB Mem : 64084.4 total, 27018.1 free, 7788.8 used, 29277.5 buff/cache
MiB Swap: 4096.0 total, 4096.0 free, 0.0 used. 55571.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
58585 root 20 0 5792 1056 968 R 100.0 0.0 0:24.90 sha256sum
root@E3-Dedi:~# time sha256sum -c chronos-20240904.tgz.cpt.SHA256
chronos-20240904.tgz.cpt: OK
real 0m22.655s
user 0m22.090s
sys 0m0.564s
root@E3-Dedi:~#
root@E3-Dedi:~# time sha256sum -c Documents.tgz.cpt.SHA256
Documents.tgz.cpt: OK
real 1m18.146s
user 1m16.245s
sys 0m1.892s
root@E3-Dedi:~#
Ryzen 7950X 2 vCore VPS
OS: Debian 12.7
fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 331.18 MB/s (82.7k) | 1.68 GB/s (26.3k)
Write | 332.05 MB/s (83.0k) | 1.69 GB/s (26.5k)
Total | 663.23 MB/s (165.8k) | 3.38 GB/s (52.8k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 4.98 GB/s (9.7k) | 4.53 GB/s (4.4k)
Write | 5.24 GB/s (10.2k) | 4.83 GB/s (4.7k)
Total | 10.23 GB/s (19.9k) | 9.36 GB/s (9.1k)
Geekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 2534
Multi Core | 4414
Full Test | https://browser.geekbench.com/v6/cpu/7541417
top - 23:15:06 up 4 days, 2:11, 2 users, load average: 0.44, 0.21, 0.09
Tasks: 91 total, 2 running, 89 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 0.7 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.7 si, 0.0 st
%Cpu1 : 94.6 us, 4.4 sy, 0.0 ni, 0.0 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3915.5 total, 191.6 free, 329.2 used, 3616.6 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 3586.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23848 root 20 0 5484 904 812 R 99.0 0.0 0:25.57 sha256sum
root@Ryzen-2-vCore-VPS:~# time sha256sum -c chronos-20240904.tgz.cpt.SHA256
chronos-20240904.tgz.cpt: OK
real 0m30.658s
user 0m9.694s
sys 0m0.668s
root@Ryzen-2-vCore-VPS:~#
root@Ryzen-2-vCore-VPS:~# time sha256sum -c Documents.tgz.cpt.SHA256
Documents.tgz.cpt: OK
real 2m16.960s
user 0m34.364s
sys 0m1.840s
root@Ryzen-2-vCore-VPS:~#
I hope everyone gets the servers they want!
Comments
Stupid guess: maybe 64GB RAM vs 4GB RAM?
The Ultimate Speedtest Script | Get Instant Alerts on new LES/LET deals | Cheap VPS Deals | VirMach Flash Sales Notifier
FREE KVM VPS - FreeVPS.org | FREE LXC VPS - MicroLXC
@sh97 Interesting guess! I completely missed that possibility! Notice, though, that the
top
result for Ryzen showsSince 191.6 MiB is free during the
sha256sum -c
execution, I am guessing that the process might not be memory constrained.But, still, you make a good point that I completely missed. Thank you so much!
I hope everyone gets the servers they want!
The time difference may be caused by I/O caching.
To isolate them:
First run timing reflects uncached I/O plus compute.
Second run timing reflects cached I/O plus compute.
Accepting submissions for IPv6 less than /64 Hall of Incompetence.
@sh97 Two additional ideas in response to your kind suggestion about RAM:
I could rerun the tests with a smaller file, say 1 GB.
I could add a swap file to the VPS and see if the numbers change.
I hope everyone gets the servers they want!
Guessing some of the cryptography extensions aren't passing through the virt layer cleanly
Just now remembering that the E3 is running ext4 with LVM, whereas the Ryzen is running xfs. I haven't yet studied up on the effect of these filesystem differences. Ideas, please?
I hope everyone gets the servers they want!
Are you sure both sha256sum executables are identical, or at least compiled with identical compile flags?
Hey teamacc. You're a dick. (c) Jon Biloh, 2020.
Another excellent point that I completely missed! The executables could be different or could be compiled with different optimizations.
I hope everyone gets the servers they want!
The
sha256sum
command is single-threaded. Virtualization overhead on the VPS, as well as OS and kernel differences, can also contribute to the performance.The Xeon E3 has 4 times the number of "vcores" and aren't being shared between multiple users in the same server, simple as that.
Reboot plus two successive runs.
Another reboot plus another two successive runs.
Yet another reboot plus yet another two successive runs.
One more double try.
Why are the times on the first set of reruns here so different? The wall clock "real" time was 0m11.104s for the first run of the first set and 0m27.489s for the second run of the first set.
We might expect the second run of each set (the cached run) to be faster if caching is a significant factor. That's not what happened.
Indeed, the first run of the first set of reruns here, the Ryzen VPS at 11.104s was the fastest of all the runs, including the original E3 and Ryzen runs, 22.655s and 30.658s, respectively.
Having Ryzen take 11 seconds for something an E3 does in 22 seconds seems reasonable. But, here, Ryzen doesn't consistently run that fast in these tests.
More ideas, please?
I hope everyone gets the servers they want!
The
sha256sum -c
is a CPU intensive task. So is the Geekbench 6 test. For whatever it is worth, the Geekbench 6 test doesn't seem to show as much variation as thesha256sum -c
task.I haven't yet compared these yabs scores either with other guys' tests of the 7950X or with previous tests on 7950X bare metal servers I had.
I hope everyone gets the servers they want!
One is a vps with overhead and what not. The other direct pipe no virtualization.
Run it on a ryzen with no virtualization and you will see the difference.
Free Hosting at YetiNode | Cryptid Security | URL Shortener | LaunchVPS | ExtraVM | Host-C | In the Node, or Out of the Loop?
An enticing answer!
Here's a guess: consistently 11 seconds.
I'm surprised that the variation under virtualization is so high! All the way from 11 seconds to 35 seconds for the same job? And, I really don't think the Node is oversold. I really don't think the Neighbors are problematic, either.
I've been talking with two different providers and with both about their giving me bare metal Ryzen. So we will see what happens.
Best wishes!
I hope everyone gets the servers they want!
What I find strange is that real time differs so much from user+sys time. In a compute bound single-threaded application you would usually expect real time be roughly the same as user+sys time (user time is the time the application uses the CPU to do the computation, sys time is the time the kernel actively does something, like file system overhead). I would expect real time to differ if either the application doesn't get CPU time when it wants the CPU, or if the system is waiting for I/O, but the top snapshot doesn't seem to show that.
Maybe you could additionally run
vmstat 1
in parallel and show that output. Using/usr/bin/time -v
(instead of justtime
) would also include a bit more information.@cmeerw Wow! Thanks for introducing me to GNU time and vmstat! What follows might be the output you requested. I have to study up a lot to understand the output. If you want something more or something different, please let me know. Thanks again for helping!
I hope everyone gets the servers they want!
The vmstat output shows cpu steal time of around ~25% (st column)
Oops, sorry, the ~25% should be the cpu wait time (wa column), there's a slight shift in the column alignment of the vmstat output.
So in this case it's spending half the time waiting for I/O (not sure why the
top
snapshot didn't show that)I am assuming the E3 with 64 GB RAM has all the file contents already cached, so doesn't need to do any I/O.
You could try re-running the tests on the E3, but with clearing the cache beforehand, e.g.
see https://unix.stackexchange.com/questions/87908/how-do-you-empty-the-buffers-and-cache-on-a-linux-system
I would expect user and system times to remain roughly the same, but real time to go way up (as it will also have to wait for I/O now).
@tmntwitw @cmeerw I just want to say that I really appreciate you guys commenting here and on others of my threads, I need to learn more, and comments from you guys and others like you are super helpful to me. So thanks again! Much appreciated! Thanks also to LES, our platform which makes Low End Learning possible!
I hope everyone gets the servers they want!
The numbers in the I/O wait ("wa") second-to-last column are mostly in the twenties. So, why do you say "half?" Shouldn't we say "spending a quarter of the time waiting for I/O?"
Also, is there something about
sha256sum -c
that would make it, as compared with other programs, more likely to succumb to or even create I/O wait time issues?I suppose I could run
fio
withvmstat 1
and see whether I/O wait issues also plaguefio
?Interesting!
Thanks again, @cmeerw!
I hope everyone gets the servers they want!
Those are percentages of total CPU time. You have two cores, but one of them is idle (idle column shows roughly 50 %), and the remaining 50 % are split into roughly 24 % user time, 2 % system time and 24 % wait. So one core is only doing real work half the time and waiting for I/O the other half of the time. This is also consistent with what
time
tells you above:Percent of CPU this job got: 52%
(although that percentage is based on single cores, so for multi-threaded applications it could go higher than 100 %).So yes, this can get confusing - and other Unixes might do things differently. If I remember correctly, on HP-UX a load average of 1 meant that all CPU cores were just kept busy, whereas on Linux it means that only 1 CPU core is kept busy.
Most Unix utilities are single-threaded and do synchronous I/O, so you will see similar things. Ideally, they could do asynchronous I/O, so they could tell the OS to read some data while they are doing real work - and once they are done with a block of data, the next block is already available to continue doing their processing.
BTW, another thing you could do is use
dd
to read the file (with a bigger block size) and see if that reduces I/O wait (sha256sum
seems to read in chunks of 32 kB only). Unfortunately, you can't do that with the-c
options, but you can still just try:Okay, got it from your saying that the percentages are of total CPU time and thus need to be doubled because there are two CPUs. Thank you!
I hope everyone gets the servers they want!
@cmeerw Now it's getting to be almost too much fun!
Did another run to test consistency.
So, chunk size is the answer, at least on the Ryzen VPS?
I can see how the E3 could just keep on reading because the E3 has plenty of memory. But the VPS, even without much memory, seems to work fast when we use 1 M block size with
dd
. So it seems that the chunk size issue is at least partly independent of overall memory size. Maybe the root of the issue not in the memory size, but, instead, is in the increased number of read operations when the chunk size is small? How is the chunk size issue (memory size or number of operations, or maybe both) avoided on the E3 Dedi?Thanks yet again!
I hope everyone gets the servers they want!
From the OP:
Now, on the Ryzen 2 vCore VPS, using
dd
to increase block size forsha256sum
.This looks enough faster than the E3 Dedi.
But this one is almost twice as long. And not much faster than the E3 Dedi. Why the inconsistency?
2 0 0 129352 1876 3752600 0 0 99928 0 896 737 33 2 50 16 0
What's this about the 16 in the wait column?
I hope everyone gets the servers they want!
I added some swap to the Ryzen KVM VPS.
I hope everyone gets the servers they want!
I wanted to see whether another file I/O program would show I/O wait like sha256sum. So, just for some quick fun, I ran a yabs, which calls fio. I ran vmstat while the Yabs was running. Note that the interval for this vmstat was 6 seconds instead of 1 second. This time I ran vmstat in the background instead of in a separate terminal. Here's the Yabs result, and I will post the vmstat results in a moment.
I hope everyone gets the servers they want!
I hope everyone gets the servers they want!
6 1 2 572 138928 1876 3746624 0 0 49 240391 844 1028 2 4 51 43 0
7 2 0 572 138928 1876 3746624 0 0 785761 335577 42194 20746 4 21 52 22 0
Note that the time interval is 6 seconds. The command to start vmstat was
Looks like these two lines 6 and 7 show significant I/O wait, 43% and 22%. So it seems that both fio and sha256sum experience significant I/O wait. Why?
I hope everyone gets the servers they want!
The whole point of fio is to do I/O at full speed, so it really is just waiting for the OS to do that I/O.
The other option would be to just not show that column (like some other Unixes) and just show it as "CPU idle" (as the CPU itself isn't doing anything). But then you would probably wonder why the system is showing that it's mostly idle when there is work to do, but you have an I/O bottleneck.