Topic: Meat and potatoes / fish and chips machine code (Read 2519 times)

Weaver · « **on:** July 19, 2020, 09:50:35 AM »

I read a minor rant

by Linus Torvalds about how he hates Intel AVX512. I agree with him wholeheartedly, although not for the same reasons quite - I hate the fact that they are making developers’ lives a misery by fragmenting the product range into processors that do and those that don’t have different AVX512 modules or indeed no support at all. What are we supposed to do to handle all that complexity? Why the hell bother.

I definitely agree with Linus’ point about getting back to concentrating on meat and potatoes (or is it fish and chips?) everyday integer code, plain arithmetic and logic. We need a war on the cost of fighting our way through the forest of everyday tasks that is logic, string processing and conditional jumps.

My wish list :

more execution ports
better string operations - use some imagination- more innovative insns can be thought up - study real code
fast pcmpestri and pcmpestri for AVX2 with YMM regs (assuming they aren’t supported already);
AMD made the loop and rcxz instructions fast again, intel should finally do the same, although now we will have to tell compilers some day that they can now dare to use it once again and then it will take a decade for things to filter through
a double compare instruction cond = ( x >= lo && x <= hi ); yes I know you can do x - lo <= hi - lo if you use unsigned arithmetic bu that isn’t always so easy and in any case it’s not fast enough and it alters at least one register doing the subtraction and it isn’t one clock but two subtracts and the jmp
want a CMOVcc with an imm
combined cond moves (like CMOVcc) which do not use the flags register but combine a comparison and a conditional move together into one instruction with four operands- two for the comparison, then a dest and a source. Doesn’t use the flags so no waiting on the flags nor do the flags become a bottleneck preventing ILP.
as above but with a conditional jump, that is a comparison and a jmp combined into one, like generalised jcxz but blazingly fast and it doesn’t use the flags
what ever happened to bitfield operations; like VAX insv/ extv ? Wasn’t there a pair of similar-sounding insns on the first 386dx processors which were then withdrawn, is that right?

I’m going to keep on expanding the list. I can keep coming up with more and more as I look around. I’m thinking about using the XMM/YMM registers for _single_ plain 64-bit and 32-big integers with greater ease of to- and froing between those xMM registers and the traditional integer register set to relieve register pressure on the integer registers by allowing them to be spilled into the xMM registers’ (and their halves too, to make better use of the space), as an alternative to spilling registers to the stack.

Other thoughts - I wonder if there is a market for 128-bit scalar integer operations? I also wonder about the costs and benefits of going up to say 32 or 64 integer registers. (HOW to handle the transition when an o/s has to save all the registers and know about the changes to stack frame layout caused by that. - It was only like going to AVX2 YMM and then AVX512 ZMM regs I suppose - it has been handled before?)

I wonder if it’s time for a completely new faster byte encoding of x86-64 v2.0 instructions - so assembler source compatible but binary incompatible and you need a fast byte-stream retranslator when you load a trad encoding exe into memory so old bytes get transformed in ram on the fly, JIT fashion, by an o/s exe loader. Should be done in such a fashion as to be able to split up the workload onto multiple cores, split the code of the exe up into n portions. We need to get rid of all the REX prefixes and the decoder complexity needs to go in the bin, thus more speed, less power consumption. And get rid of the code size penalty for using r8-r15. If you were going for some of these other wish-list items, such as lots more new instructions, or - even more so - in the case of a change to far more regs then new flexibility brought by new encodings would be a blessing.

I didn’t notice the date on the article so this might be very old news: Linus has bought himself an AMD Ryzen so for the first time in 15 years has forsaken Intel. Good for him, and he says it’s three times faster than his previous box. I don’t know how much of his build jobs is CPU-bound, how much RAM-i/o bound and how much disk-bound. I wonder if Linus has a parallel make - I was looking round for one some while back and the only one I could find cost a fortune.

Makes me think about -what is it - for Microsoft’s MSIL their compiler into machine code - as in C# to machine code - how does it work, I forget ?

burakkucat · « **Reply #1 on:** July 19, 2020, 06:29:16 PM »

GNU "make" has the -j (job server) flag and so the regular kernel builds, instigated from "The Cattery", all occur with a -j8 flag and argument.

A few months ago an experiment was performed with one of the build systems that I routinely use (AMD processor, 8 cpus). A series of builds, using the same target, were performed starting with -j1 and ending with -j64. A table of the maximum number of "make" jobs versus the elapsed time for the build was drawn up.
Maximum Log (base 10) Elapsed Log (base 10) number maximum number time elapsed of jobs of jobs (seconds) time (seconds) 1 0.000000000 6647 3.822625679 2 0.301029995 3628 3.559667278 4 0.602059991 2136 3.329601248 8 0.903089987 1338 3.126456113 16 1.204119983 1380 3.139879086 24 1.380211242 1397 3.145196406 32 1.505149978 1388 3.142389466 40 1.602059991 1402 3.146748014 48 1.681241237 1400 3.146128036 56 1.748188027 1398 3.145507171 64 1.806179974 1402 3.146748014
The results were plotted in four separate views; linear - linear, linear - logarithmic, logarithmic - linear and logarithmic - logarithmic. A minimum in the elapsed time was observed when the maximum number of "make" jobs was equal to the number of cpus.

I think Linus routinely builds kernels using a -j28 flag and argument to "make".

Weaver · « **Reply #2 on:** July 21, 2020, 10:41:43 AM »

Assuming there is limitless (!) RAM available, I would have thought that the number of jobs should be something like 2 * num of cores, but there’s the speed of RAM to consider, hyper threading or not and how effective file system caching is in practice. I’m please to see that -j16 is not tooo bad though.

I’m glad that Linus has seen some serious improvement, although I don’t know how much of that was due to system improvement and show much due to the baseline comparison with the previous system and how much is raw CPU. I feel as if Intel has had a good part of a decade of stagnation. I ask myself how do we get to ambitious single core breakthroughs. How to get every integer instruction down to a minimised number of sub 1-cycle performance including those neglected. How do we get to seriously ambitious targets. What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?

niemand · « **Reply #3 on:** August 04, 2020, 12:31:26 PM »

Quote from: Weaver on July 21, 2020, 10:41:43 AM

What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?

The world record is 8.7 GHz and was set years ago so with lower feature size and fewer cores on a die to reduce heat that's feasible even without more exotic things like use of graphene.

It would, however, require focus on that specific target. Maybe Intel / AMD would do it for the e-peen value in the lab?

Fastest stock clock production CPU ever was a 5.5 GHz RISC CPU from IBM. RISC CPUs could I'm sure go to higher clocks than CISC simply due to the far lower transistor count required.

Alex Atkin UK · « **Reply #4 on:** August 05, 2020, 12:06:21 PM »

It will be interesting to see how fast Apple push ARM now they are migrated desktops across to it. Not really seen any mention of what ARM is capable of when given desktop level cooling.

Weaver · « **Reply #5 on:** August 06, 2020, 12:06:00 AM »

I’ve always been a big fan of CISC. Having to fetch in a ton of code to get anything done is a limitation on RISC’s performance and CISC can always have its microcode implementations improved later on by adding more dedicated hardware while software stays unbroken.

What were you saying about ARM and Apple Alex? ARM is still really slow compared to x86 because the current implements afaik don’t have the same levels of ILP, or am I out of date on that too? With Intel boxes having four-way ILP commonly and the AMD Ryzen now having five-way that’s staggeringly fast for meat and potatoes important stuff. Agner Fog gives the Ryzen a wonderful write-up btw.

I don’t know enough about ARM AArch64 - I’m wondering if a register-register move costs 1 clock or zero clocks like on Intel. Zero clock operations such as reg-reg move and addressing mode calculation as part of a load/store and the macrofusion of cmp or sub + jmp into one instruction, so losing one of the two instructions in the pair; those are all very impressive developments and I don’t know if ARM has any equivalents of zero-clock ops or macrofusion.

flilot · « **Reply #6 on:** August 06, 2020, 01:16:56 AM »

Quote from: Weaver on August 06, 2020, 12:06:00 AM

What were you saying about ARM and Apple Alex? ARM is still really slow compared to x86 because the current implements afaik don’t have the same levels of ILP, or am I out of date on that too? With Intel boxes having four-way ILP commonly and the AMD Ryzen now having five-way that’s staggeringly fast for meat and potatoes important stuff. Agner Fog gives the Ryzen a wonderful write-up btw.

I'd take a look at this:

https://youtu.be/GEZhD3J89ZE?t=5210 (from 1:26:50 in the video if the link doesn't take you to that timeframe directly).
What Apple is doing with their own ARM based silicon is astounding, and the performance is incredible - desktop class. The video shows you Pro apps running natively on macOS on their own ARM based silicon. It's likely to make your jaw drop once you realise the implications for Intel and the like.

Alex Atkin UK · « **Reply #7 on:** August 06, 2020, 06:36:30 AM »

Quote

When we make bold changes its for one simple but powerful reason, so we can make much better products

Sorry but that made me laugh, considering the stupid hardware mistakes and deliberate right to repair breaking changes they have made in the past. Making it so third parties cannot repair their devices does not a better product make!

More like:

Quote

When we make bold changes its for one simple but powerful reason, so we can make more profit

Which honestly, they're a business, fair enough, but don't bullshit about it!

What rubs me up the wrong way about this stuff is that for decades, Apple are allowed to do anti-competitive things, because they are the underdog. A lot of their software integrations, Microsoft were simply not allowed to do. I understand why, but it seems flawed to me as all its done is stagnate Windows innovation and allowed Apple to develop more compelling solutions.

I mean sure, Microsoft made a lot of mistakes too, but its hardly surprising when they were constantly walking on eggshells with the likes of the EU who fined them whenever they tried to do something like what Apple has done regarding integrating everything in the OS.

flilot · « **Reply #8 on:** August 06, 2020, 04:46:05 PM »

Quote from: Alex Atkin UK on August 06, 2020, 06:36:30 AM

Sorry but that made me laugh, considering the stupid hardware mistakes and deliberate right to repair breaking changes they have made in the past. Making it so third parties cannot repair their devices does not a better product make!

Indeed. I don't listen to the waffle of these rich polished Americans, they are so over dramatic and arrogant.
Unfortunately you have to get through the waffle to see the demonstrations in the video, which I still maintain are compelling. Desktop class ARM based silicon is going to create a shift, and Intel need to pull their socks up before they are relegated to third in the desktop CPU market.

Weaver · « **Reply #9 on:** October 02, 2020, 05:21:29 PM »

A couple of questions for Burakkucat as I realise I don’t understand some things in his very helpful earlier post. When you wrote "8 cpus" does that mean separate physical CPUs? Or 8 cores, or 8 threads including AMD’s 2-way hyperthreading (although I don’t think they call it that and that’s perhaps a buzzword confined to Intel only) per core?

Out of interest, is that a hosted machine? Sounds delicious.

Also, one other thing. Back there, where I wrote:
double compare instruction cond = ( x >= lo && x <= hi ); yes I know you can do x - lo <= hi - lo
I should have noted that the first two subtract instructions can be done in parallel, so they only count in time as one operation and then there is the compare+jump fused micro-op insn for the ‘<=’ for a second operation. So the relative advantage is not as straightforward as I made it out to be if the ‘hi - lo’ operation cannot be done at compile-time.

burakkucat · « **Reply #10 on:** October 02, 2020, 05:28:39 PM »

Hmm . . . I'll have to ask. (As I don't really understand the technology.)

No, not hosted. It is a physical system, owned by one of my colleagues who is based in New Jersey, USA.

Weaver · « **Reply #11 on:** October 02, 2020, 06:52:12 PM »

As I’m sure you know ‘cores’ means multiple processors inside one physical chip. Apologies most sincerely if the following War and Peace explanation of hyperthreading is all old hat to you:

Hyperthreading is where a core within a CPU can run two operating system threads simultaneously in hardware, albeit with each thread having to stop and wait for the other thread again and again, because processor components are shared and so there is constant competition for the use of them. In hyperthreading, each hardware thread is a ‘hardware struct’ (if you like) which contains a complete record of the current state of that hyperthread ‘side’ of the processor nanosecond by nanosecond as it executes its one associated thread. There will be two such ‘hw-structs’ per core if hyperthreading is in use. The ‘struct’ will contain: all of the registers including the program counter or instruction pointer, whichever you call it; the flags register(s); and the stack pointer(s), but most of the critical hardware will have to be shared, hence the constant contention delays. The o/s will maintain (at least) one stack in RAM per thread, so there will be at least two stacks per core in a hyperthreading system.

There will be separate per-thread instruction decoders and addressing mode decoding units, but I’m not sure about branch prediction units, I would very much hope there is one per thread. ALUs and hardware components such as multiply units I would expect to be shared between the two threads of a core. I have no idea what happens in particular micro architectures about register renaming and sharing of underlying physical registers between threads.

Micro ops will be generated and will get poured into execution queues in such a way that if one thread is stalled for some reason, such as waiting on a memory fetch or waiting on the execution of a long-winded operation such as a DIV, then the other thread may be able to get its micro-ops executed to fill the dead time.

This is not a subject I know anything much about so please don’t take any of this as gospel; I’m just speculating based on what little I’ve read.

In contrast, separate cores will be just like physically separate chips in that they will have a complete set of independent hardware resources so they won’t have to compete for anything, the one obvious exception to this being the RAM though as some layers of the RAM hierarchy will be independent per-core possibly and some will be system wide with contention between cores so there will be bottleneck delays accessing the RAM in some situations (and these delays can be horrific).

The motivation behind hyperthreading is to allow one thread to make use of the processor’s execution units as much as is possible by making use of the dead time whenever one thread is stuck waiting on something and then the other thread will proceed instead if it can. Typically the combined speed improvement is only something like 5-10%, maybe 15% if you’re really lucky depending on the particular application.

In some situations, hyperthreading can be a disaster. One example is where a thread enters a loop where it is just wasting time, either counting down, or waiting on a shared resource or polling some hardware or waiting for some interrupt. Spinlocks are one perfect example. Such code is evil in any event because it heats up the processor doing nothing and the resulting temperature rise reduces clock rate in CPUs that have dynamic clock rate adjustment, sometimes termed ‘turbo’ clock rate boost. This is bad in every system, but in a hyperthreaded system the time-wasting looping thread can run flat out preventing the other thread on that core from running or at least wasting 50% of the time when the good thread could be doing its useful work. Some processors have a ‘hint’ instruction that should be placed in such time wasting loops, which tells the processor hardware to just go to sleep and either stop the clock or reduce the clock rate so as to reduce the wasteful heat generation, or in a hyper threaded system to give the other thread the whole of the execution time, or almost all of it leaving just enough so that the looping thread can poll and check for completion. If memory serves the Intel x86 architecture added a PAUSE hint instruction with the introduction of SSE2 (or was it with the Pentium 4), which told the CPU that it was in a time wasting loop. [The encoding of the instruction was very cleverly chosen, in such a way that the chosen byte sequence would harmlessly just do nothing on older processors, meaning that it could be just deployed straightaway without having to wait for older machine to go out of use and with no worrying about backwards compatibility. Was it something like REP NOP ?]

Anyway, because occasionally hyperthreading can be horrible, operating systems try to be intelligent about how o/s threads are assigned to hardware hyper threads on different cores, but even so, hyperthreading can in my experience be disabled, which is done by something in the BIOS settings. It is very much worth trying to carefully benchmark your most important workloads with and without hyperthreading disabled in the BIOS settings to see which is the fastest.

I have had many Intel machines that feature hyperthreading; one server had two physical CPU chips in it, not two cores, two socketed chips and each had two threads making four hardware threads in total. It was running Windows Server 2003 and don’t know but I hope that that operating system had a scheduler that really understood hyperthreads and had a wise, sophisticated plan for thread allocation and evil loop management. That box died from lightning tweak, I think.

Sincerest apologies for this tome. It might be useful to someone, who knows.

burakkucat · « **Reply #12 on:** October 02, 2020, 08:47:33 PM »

Quote from: Weaver on October 02, 2020, 06:52:12 PM

Sincerest apologies for this tome. It might be useful to someone, who knows.

It is useful to me, thank you.

burakkucat · « **Reply #13 on:** October 03, 2020, 03:17:19 PM »

Perhaps the following will provide enlightenment --

Code: [Select]

[bcat ~]$ cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 1
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 2
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 3
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 4
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 5
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 6
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD EPYC Processor (with IBPB)
stepping	: 2
microcode	: 0x1000065
cpu MHz		: 3493.436
cache size	: 512 KB
physical id	: 7
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
bogomips	: 6986.87
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

[bcat ~]$

Lists processors numbered from 0 to 7.

News:

Author Topic: Meat and potatoes / fish and chips machine code (Read 2519 times)

Weaver

Meat and potatoes / fish and chips machine code

burakkucat

Re: Meat and potatoes / fish and chips machine code

Weaver

Re: Meat and potatoes / fish and chips machine code

niemand

Re: Meat and potatoes / fish and chips machine code

Alex Atkin UK

Re: Meat and potatoes / fish and chips machine code

Weaver

Re: Meat and potatoes / fish and chips machine code

flilot

Re: Meat and potatoes / fish and chips machine code

Alex Atkin UK

Re: Meat and potatoes / fish and chips machine code

flilot

Re: Meat and potatoes / fish and chips machine code

Weaver

Re: Meat and potatoes / fish and chips machine code

burakkucat

Re: Meat and potatoes / fish and chips machine code

Weaver

Re: Meat and potatoes / fish and chips machine code

burakkucat

Re: Meat and potatoes / fish and chips machine code

burakkucat

Re: Meat and potatoes / fish and chips machine code