Can C# catch up with C?
The modern programming community is divided into two camps – those who love memory-managed programming languages and those who do not. The two camps argue vehemently with each other, squabbling over advantages in some aspect of programming. Unmanaged memory languages are presented as faster, more manageable, and more controlled. And languages with managed memory are considered more convenient to develop, while their lag in execution speed and memory consumption is considered insignificant. In this article we will check if this is actually true. On the side of old-school programming languages, the mastodon of the development world will speak – S.
The side of the latest generation languages will be represented by C#.
The article is for informational purposes only and does not pretend to be a comprehensive comparison. Full testing will not be carried out, but tests will be given that any developer can repeat on his own computer.
Details
Both languages will be participating in their latest LTC versions at the time of writing.
C = gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0
C# = C# 12, NET 8.0
For comparison, a machine with the Linux operating system will be used
Operating System: Ubuntu 24.04.1 LTS
Kernel: Linux 6.8.0-48-generic
Architecture: x86-64
CPU
*-cpu
description: CPU
product: AMD Ryzen 7 3800X 8-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 15
bus info: cpu@0
version: 23.113.0
serial: Unknown
slot: AM4
size: 2200MHz
capacity: 4558MHz
width: 64 bits
clock: 100MHz
capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc cpuid _apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc m waitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter reshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es cpufreq
configuration: cores=8 enabledcores=8 microcode=141561889 threads=16
Memory
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.Handle 0x000F, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: None
Maximum Capacity: 128 GB
Error Information Handle: 0x000E
Number Of Devices: 4Handle 0x0017, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x000F
Error Information Handle: 0x0016
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL A
Type: Unknown
Type Detail: UnknownHandle 0x0019, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x000F
Error Information Handle: 0x0018
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL A
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Unknown
Serial Number: 12030387
Asset Tag: Not Specified
Part Number: PSD416G320081
Rank: 1
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 6, Hex 0x02
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: NoneHandle 0x001C, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x000F
Error Information Handle: 0x001B
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: Unknown
Set: None
Locator: DIMM 0
Bank Locator: P0 CHANNEL B
Type: Unknown
Type Detail: UnknownHandle 0x001E, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x000F
Error Information Handle: 0x001D
Total Width: 64 bits
Data Width: 64 bits
Size: 16 GB
Form Factor: DIMM
Set: None
Locator: DIMM 1
Bank Locator: P0 CHANNEL B
Type: DDR4
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 3200 MT/s
Manufacturer: Unknown
Serial Number: 120304DD
Asset Tag: Not Specified
Part Number: PSD416G320081
Rank: 1
Configured Memory Speed: 3200 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: Unknown
Module Manufacturer ID: Bank 6, Hex 0x02
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 16 GB
Cache Size: None
Logical Size: None
Since the main difference between one language and another is managed memory, we will look at handling this memory. Namely, we will look at the speed of writing to RAM.
There are many tests on which you can check the difference, but as part of our tests we will fill a sequential memory block of 1 GB in size.
In the case of C, this will be a sequential block of unmanaged memory obtained using malloc, and in the case of C#, we will consider both a block of memory located in the managed heap and a block of unmanaged memory in the process address space.
C# allows us to work with unmanaged memory.
What could cause the difference in the execution time of this operation?
The code we'll be comparing will eventually turn into processor instructions that that processor will execute. However, when we talk about C, we understand that the compiler can optimize the code we write. In the case of C#, the situation is even more complicated. Under normal circumstances, the code would be compiled into a CIL intermediate language, which would then be compiled using real-time compilation (JIT) into a set of instructions that would be executed. The code can be optimized at both stages.
It is the comparison of these optimizations of two programming languages that is interesting to us.
However, in addition to these optimizations, the execution time of our code can be influenced by a large number of factors, for example, the implementation features of the processor itself.
Test No. 1
First, let's look at the situation without optimizations
We will look at iterative recording in blocks of 1 byte. The code is a little more complex than required for the test. This is done so that its running time results can be compared with other results obtained in this article.
Let's run the C code first
Let's just compile it without telling the compiler to apply optimizations
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <stddef.h>
#define MEMSIZE (1l << 30)
#define CLOCK_IN_MS (CLOCKS_PER_SEC / 1000)
#define ITERATIONS 10
int main(int argc, char **argv)
{
const size_t mem_size = MEMSIZE;
const size_t cache_line_size = sysconf (_SC_LEVEL1_DCACHE_LINESIZE);
clock_t start_clock;
long diff_ms = 0;
char *mem, *arr, *stop_addr, *ix_line;
ptrdiff_t ix_char = 0;
const char c = 1;
int iter = 0;
const int iter_count = ITERATIONS;
printf("memsize=%zxh sizeof(size_t)=%zx cache_line=%lu\n",
mem_size, sizeof(mem_size), cache_line_size
);
if (!(mem = malloc(mem_size + cache_line_size))){
fprintf(stderr, "unable to allocate memory\n");
return -1;
}
arr = mem + cache_line_size - (long)mem % cache_line_size;
stop_addr = arr + mem_size;
for (iter = 0 ; iter < iter_count; ++iter) {
start_clock = clock();
for ( ix_line = arr; ix_line < stop_addr ; ix_line += cache_line_size) {
for (ix_char = 0 ; ix_char < cache_line_size ; ++ix_char) {
*(ix_line + ix_char) = c;
}
}
diff_ms = (clock() - start_clock) / CLOCK_IN_MS;
printf("iter=%d seq time=%lu\n", iter, diff_ms);
}
free(mem);
return 0;
}
Results:
Average time: 2700 ms
iter=0 seq time=2177
iter=1 seq time=2765
iter=2 seq time=2765
iter=3 seq time=2797
iter=4 seq time=2781
iter=5 seq time=2743
iter=6 seq time=2791
iter=7 seq time=2743
iter=8 seq time=2695
iter=9 seq time=2739
The average time is longer than specified, since the first iteration with a small value makes a large contribution.
Now let's look at C# and an array on the heap
using System.Diagnostics;
const int typicalItarationsCount = 10;
const int arraySize = 1073741824;
const int lineLength = 64;
const int linesCount = arraySize / lineLength;
var tmpArray = new bool[arraySize];
for(var iteration = 0; iteration < typicalItarationsCount; ++iteration)
{
var watch = new Stopwatch();
watch.Start();
for(long i = 0; i < linesCount; ++i)
{
for(long j = 0; j < lineLength; ++j)
{
tmpArray[i * lineLength + j] = true;
}
}
watch.Stop();
tmpArray = new bool[arraySize];
Console.WriteLine($"iter={iteration} seq time={watch.ElapsedMilliseconds}");
}
Results:
Average time: 446 ms
iter=0 seq time=764
iter=1 seq time=766
iter=2 seq time=362
iter=3 seq time=362
iter=4 seq time=369
iter=5 seq time=362
iter=6 seq time=364
iter=7 seq time=372
iter=8 seq time=368
iter=9 seq time=370
In fact, the average time is less, since the first two iterations make a large contribution. If you perform more iterations, the average time will decrease.
Now let's look at unmanaged memory in C#
To work with pointers in C#, you need to mark a block of code with a keyword “unsafe” and also add a block to the .csproj file indicating that the assembly will work with such code.
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<PropertyGroup>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
</PropertyGroup>
</Project>
using System.Diagnostics;
using System.Runtime.InteropServices;
unsafe
{
const int typicalItarationsCount = 10;
const int arraySize = 1073741824;
const int lineLength = 64;
const int linesCount = arraySize / lineLength;
for(var iteration = 0; iteration < typicalItarationsCount; ++iteration)
{
bool* buffer = (bool*)NativeMemory.Alloc((nuint) arraySize, sizeof(bool));
var readPtr = buffer;
var endPtr = buffer + arraySize;
var watch = new Stopwatch();
watch.Start();
for(long i = 0; i < linesCount; ++i)
{
for(long j = 0; j < lineLength; ++j)
{
*readPtr = true;
++readPtr;
}
}
watch.Stop();
NativeMemory.Free(buffer);
Console.WriteLine($"iter={iteration} seq time={watch.ElapsedMilliseconds}");
}
}
Results:
Average time: 691 ms
iter=0 seq time=696
iter=1 seq time=704
iter=2 seq time=694
iter=3 seq time=689
iter=4 seq time=686
iter=5 seq time=696
iter=6 seq time=684
iter=7 seq time=692
iter=8 seq time=685
iter=9 seq time=688
No application special optimizations, C lost the speed race by a factor of 7 compared to C#'s heap-based arrays, and by a factor of 4 compared to C#'s unmanaged memory usage. The results are already interesting.
Test No. 2
Now let's compile the C code with the maximum possible optimizations
– use the command line argument for gcc “-Wall -O4”
Results:
Average time: 118 ms
iter=0 seq time=448
iter=1 seq time=81
iter=2 seq time=82
iter=3 seq time=83
iter=4 seq time=82
iter=5 seq time=82
iter=6 seq time=82
iter=7 seq time=81
iter=8 seq time=81
iter=9 seq time=82
The average time is less because the first iteration with a long running time has a large effect. This happens because the operating system actually allocates memory only when writing.
As expected, the optimized C code shows impressive results
But these results are impressive compared to the results of non-optimized C# code.
Let's try to use optimizations in C# when working with an array on the heap
To do this, you need to add a section to the .csproj file that includes optimizations performed by the compiler
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<PropertyGroup>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
</PropertyGroup>
<PropertyGroup>
<Optimize>true</Optimize>
</PropertyGroup>
</Project>
Results:
Average time: 603 ms
iter=0 seq time=953
iter=1 seq time=948
iter=2 seq time=515
iter=3 seq time=522
iter=4 seq time=520
iter=5 seq time=517
iter=6 seq time=516
iter=7 seq time=520
iter=8 seq time=507
iter=9 seq time=510
Let's try to use optimizations in C# when working with unmanaged memory
Results:
Average time: 694 ms
iter=0 seq time=690
iter=1 seq time=687
iter=2 seq time=686
iter=3 seq time=694
iter=4 seq time=691
iter=5 seq time=702
iter=6 seq time=697
iter=7 seq time=704
iter=8 seq time=695
iter=9 seq time=695
It can be seen that trying to tell the C# compiler that the code needs to be optimized does not lead to improved results.
Maybe it's a matter of JIT compilation? The latest version of C# allows you to use AOT compilation.
Test No. 3
Let's try to compile C# code natively for our computer.
To execute such a file we will not need dotnet
To do this, .csproj must contain a section that adds a native publication
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<PropertyGroup>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
</PropertyGroup>
<PropertyGroup>
<Optimize>true</Optimize>
</PropertyGroup>
<PropertyGroup>
<PublishAot>true</PublishAot>
<OptimizationPreference>Speed</OptimizationPreference>
</PropertyGroup>
</Project>
Results for arrays on the heap:
Average time: 548 ms
iter=0 seq time=932
iter=1 seq time=905
iter=2 seq time=453
iter=3 seq time=450
iter=4 seq time=453
iter=5 seq time=464
iter=6 seq time=452
iter=7 seq time=459
iter=8 seq time=452
iter=9 seq time=456
The first two iterations again greatly influence the result.
Results for unmanaged memory:
Average time; 827ms
iter=0 seq time=822
iter=1 seq time=822
iter=2 seq time=828
iter=3 seq time=829
iter=4 seq time=826
iter=5 seq time=828
iter=6 seq time=827
iter=7 seq time=829
iter=8 seq time=831
iter=9 seq time=826
There is no performance gain either.
Conclusion
C# loses to C when writing sequentially to RAM by about 8 times. This is because the optimizations of the C compiler are superior to the optimizations that C# code undergoes when it turns into machine code. However, these optimizations are useless when writing to memory inconsistently, as will be seen in the next test. Extraneous factors, such as the physical implementation of the processor, affect many operations more than the difference in the programs written in these languages
A little theory
The central element of a modern computer is the processor. The processor has cache lines – sequential pieces of memory into which data is loaded that the processor will work with. Loading a cache line is quite an expensive operation, so such operations should be minimized if possible. We assume that to fill a block of RAM, with sequential data recording, the number of data loads into the processor cache line and subsequent copies of this data into RAM will be minimal. And with inconsistent writing to memory, when for each next iteration the cache line must be reloaded, this is the maximum.
So let's do the following test.
Test No. 4
Let's look at C code that does not write to memory sequentially
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <stddef.h>
#define MEMSIZE (1l << 30)
#define CLOCK_IN_MS (CLOCKS_PER_SEC / 1000)
#define ITERATIONS 10
int main(int argc, char **argv)
{
const size_t mem_size = MEMSIZE;
const size_t cache_line_size = sysconf (_SC_LEVEL1_DCACHE_LINESIZE);
clock_t start_clock;
long diff_ms = 0;
char *mem, *arr, *stop_addr, *ix_line;
ptrdiff_t ix_char = 0;
const char c = 1;
int iter = 0;
const int iter_count = ITERATIONS;
printf("memsize=%zxh sizeof(size_t)=%zx cache_line=%lu\n",
mem_size, sizeof(mem_size), cache_line_size
);
if (!(mem = malloc(mem_size + cache_line_size))){
fprintf(stderr, "unable to allocate memory\n");
return -1;
}
arr = mem + cache_line_size - (long)mem % cache_line_size;
stop_addr = arr + mem_size;
for (iter = 0 ; iter < iter_count; ++iter) {
start_clock = clock();
for (ix_char = 0 ; ix_char < cache_line_size ; ++ix_char) {
for ( ix_line = arr; ix_line < stop_addr ; ix_line += cache_line_size) {
*(ix_line + ix_char) = c;
}
}
diff_ms = (clock() - start_clock) / CLOCK_IN_MS;
printf("iter=%d unseq time=%lu\n", iter, diff_ms);
}
free(mem);
return 0;
}
Average time: 5188 ms
iter=0 unseq time=5521
iter=1 unseq time=5122
iter=2 unseq time=5110
iter=3 unseq time=5160
iter=4 unseq time=5130
iter=5 unseq time=5124
iter=6 unseq time=5170
iter=7 unseq time=5181
iter=8 unseq time=5195
iter=9 unseq time=5163
Average time of the optimized version: 5735 ms
iter=0 unseq time=6067
iter=1 unseq time=5694
iter=2 unseq time=5704
iter=3 unseq time=5695
iter=4 unseq time=5692
iter=5 unseq time=5695
iter=6 unseq time=5707
iter=7 unseq time=5698
iter=8 unseq time=5704
iter=9 unseq time=5691
Inconsistent access in C#. Array on heap
using System.Diagnostics;
const int typicalItarationsCount = 10;
const int arraySize = 1073741824;
const int lineLength = 64;
const int linesCount = arraySize / lineLength;
var tmpArray = new bool[arraySize];
for(var iteration = 0; iteration < typicalItarationsCount; ++iteration)
{
var watch = new Stopwatch();
watch.Start();
for(long i = 0; i < lineLength; ++i)
{
var currentLineStart = 0;
for(long j = 0; j < linesCount; ++j)
{
tmpArray[currentLineStart + i] = true;
currentLineStart += lineLength;
}
}
watch.Stop();
Console.WriteLine($"iter={iteration} seq time={watch.ElapsedMilliseconds}");
}
Results:
Average time: 5647 ms
iter=0 seq time=5969
iter=1 seq time=5637
iter=2 seq time=5568
iter=3 seq time=5618
iter=4 seq time=5568
iter=5 seq time=5617
iter=6 seq time=5623
iter=7 seq time=5637
iter=8 seq time=5626
iter=9 seq time=5608
Inconsistent access in C#. Unmanaged memory
using System.Diagnostics;
using System.Runtime.InteropServices;
unsafe
{
const int typicalItarationsCount = 10;
const int arraySize = 1073741824;
const int lineLength = 64;
const int linesCount = arraySize / lineLength;
for(var iteration = 0; iteration < typicalItarationsCount; ++iteration)
{
bool* buffer = (bool*)NativeMemory.Alloc((nuint) arraySize, sizeof(bool));
var readPtr = buffer;
var endPtr = buffer + arraySize;
var watch = new Stopwatch();
watch.Start();
for(long i = 0; i < lineLength; ++i)
{
readPtr = buffer + i;
for(long j = 0; j < linesCount; ++j)
{
*readPtr = true;
readPtr += lineLength;
}
}
watch.Stop();
NativeMemory.Free(buffer);
Console.WriteLine($"iter={iteration} seq time={watch.ElapsedMilliseconds}");
}
}
Results:
Average time: 6145 ms
iter=0 seq time=6166
iter=1 seq time=6160
iter=2 seq time=6142
iter=3 seq time=6135
iter=4 seq time=6152
iter=5 seq time=6130
iter=6 seq time=6120
iter=7 seq time=6160
iter=8 seq time=6138
iter=9 seq time=6142
For the tests, such program implementations were specially chosen so that the difference in arithmetic operations did not affect the execution time.
PS: This is my first experience of writing such articles, do not judge strictly for the roughness.