Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Author Topic: Meat and potatoes / fish and chips machine code  (Read 875 times)

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 8964
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Meat and potatoes / fish and chips machine code
« on: July 19, 2020, 09:50:35 AM »

I read a minor rant  ;)  by Linus Torvalds about how he hates Intel AVX512. I agree with him wholeheartedly, although not for the same reasons quite - I hate the fact that they are making developersí lives a misery by fragmenting the product range into processors that do and those that donít have different AVX512 modules or indeed no support at all. What are we supposed to do to handle all that complexity? Why the hell bother.

I definitely agree with Linusí point about getting back to concentrating on meat and potatoes (or is it fish and chips?) everyday integer code, plain arithmetic and logic. We need a war on the cost of fighting our way through the forest of everyday tasks that is logic, string processing and conditional jumps.

My wish list :
  • more execution ports
  • better string operations - use some imagination- more innovative insns can be thought up - study real code
  • fast pcmpestri and pcmpestri for AVX2 with YMM regs (assuming they arenít supported already);
  • AMD made the loop and rcxz instructions fast again, intel should finally do the same, although now we will have to tell compilers some day that they can now dare to use it once again and then it will take a decade for things to filter through
  • a double compare instruction cond = ( x >= lo && x <= hi ); yes I know you can do x - lo <= hi - lo if you use unsigned arithmetic bu that isnít always so easy and in any case itís not fast enough and it alters at least one register doing the subtraction and it isnít one clock but two subtracts and the jmp
  • want a CMOVcc with an imm
  • combined cond moves (like CMOVcc) which do not use the flags register but combine a comparison and a conditional move together into one instruction with four operands- two for the comparison, then a dest and a source. Doesnít use the flags so no waiting on the flags nor do the flags become a bottleneck preventing ILP.
  • as above but with a conditional jump, that is a comparison and a jmp combined into one, like generalised jcxz but blazingly fast and it doesnít use the flags
  • what ever happened to bitfield operations; like VAX insv/ extv ? Wasnít there a pair of similar-sounding insns on the first 386dx processors which were then withdrawn, is that right?
Iím going to keep on expanding the list. I can keep coming up with more and more as I look around. Iím thinking about using the XMM/YMM registers for _single_ plain 64-bit and 32-big integers with greater ease of to- and froing between those xMM registers and the traditional integer register set to relieve register pressure on the integer registers by allowing them to be spilled into the xMM registersí (and their halves too, to make better use of the space), as an alternative to spilling registers to the stack.

Other thoughts - I wonder if there is a market for 128-bit scalar integer operations? I also wonder about the costs and benefits of going up to say 32 or 64 integer registers. (HOW to handle the transition when an o/s has to save all the registers and know about the changes to stack frame layout caused by that. - It was only like going to AVX2 YMM and then AVX512 ZMM regs I suppose - it has been handled before?)

I wonder if itís time for a completely new faster byte encoding of x86-64 v2.0 instructions - so assembler source compatible but binary incompatible and you need a fast byte-stream retranslator when you load a trad encoding exe into memory so old bytes get transformed in ram on the fly, JIT fashion, by an o/s exe loader. Should be done in such a fashion as to be able to split up the workload onto multiple cores, split the code of the exe up into n portions. We need to get rid of all the REX prefixes and the decoder complexity needs to go in the bin, thus more speed, less power consumption. And get rid of the code size penalty for using r8-r15. If you were going for some of these other wish-list items, such as lots more new instructions, or - even more so - in the case of a change to far more regs then new flexibility brought by new encodings would be a blessing.

I didnít notice the date on the article so this might be very old news: Linus has bought himself an AMD Ryzen so for the first time in 15 years has forsaken Intel. Good for him, and he says itís three times faster than his previous box. I donít know how much of his build jobs is CPU-bound, how much RAM-i/o bound and how much disk-bound. I wonder if Linus has a parallel make - I was looking round for one some while back and the only one I could find cost a fortune.

Makes me think about -what is it - for Microsoftís MSIL their compiler into machine code - as in C# to machine code - how does it work, I forget ?
Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 30477
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Meat and potatoes / fish and chips machine code
« Reply #1 on: July 19, 2020, 06:29:16 PM »

GNU "make" has the -j (job server) flag and so the regular kernel builds, instigated from "The Cattery", all occur with a -j8 flag and argument.

A few months ago an experiment was performed with one of the build systems that I routinely use (AMD processor, 8 cpus). A series of builds, using the same target, were performed starting with -j1 and ending with -j64. A table of the maximum number of "make" jobs versus the elapsed time for the build was drawn up.

Maximum   Log (base 10)      Elapsed      Log (base 10)
number    maximum number     time         elapsed
of jobs   of jobs            (seconds)    time (seconds)

 1        0.000000000        6647         3.822625679
 2        0.301029995        3628         3.559667278
 4        0.602059991        2136         3.329601248
 8        0.903089987        1338         3.126456113
16        1.204119983        1380         3.139879086
24        1.380211242        1397         3.145196406
32        1.505149978        1388         3.142389466
40        1.602059991        1402         3.146748014
48        1.681241237        1400         3.146128036
56        1.748188027        1398         3.145507171
64        1.806179974        1402         3.146748014

The results were plotted in four separate views; linear - linear, linear - logarithmic, logarithmic - linear and logarithmic - logarithmic. A minimum in the elapsed time was observed when the maximum number of "make" jobs was equal to the number of cpus.

I think Linus routinely builds kernels using a -j28 flag and argument to "make".
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 8964
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Re: Meat and potatoes / fish and chips machine code
« Reply #2 on: July 21, 2020, 10:41:43 AM »

Assuming there is limitless (!) RAM available, I would have thought that the number of jobs should be something like 2 * num of cores, but thereís the speed of RAM to consider, hyper threading or not and how effective file system caching is in practice. Iím please to see that -j16 is not tooo bad though.

Iím glad that Linus has seen some serious improvement, although I donít know how much of that was due to system improvement and show much due to the baseline comparison with the previous system and how much is raw CPU. I feel as if Intel has had a good part of a decade of stagnation. I ask myself how do we get to ambitious single core breakthroughs. How to get every integer instruction down to a minimised number of sub 1-cycle performance including those neglected. How do we get to seriously ambitious targets. What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?
« Last Edit: July 21, 2020, 11:13:42 AM by Weaver »
Logged

CarlT

  • Kitizen
  • ****
  • Posts: 1673
  • Next generation network design and deployment
Re: Meat and potatoes / fish and chips machine code
« Reply #3 on: August 04, 2020, 12:31:26 PM »

What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?

The world record is 8.7 GHz and was set years ago so with lower feature size and fewer cores on a die to reduce heat that's feasible even without more exotic things like use of graphene.

It would, however, require focus on that specific target. Maybe Intel / AMD would do it for the e-peen value in the lab?

Fastest stock clock production CPU ever was a 5.5 GHz RISC CPU from IBM. RISC CPUs could I'm sure go to higher clocks than CISC simply due to the far lower transistor count required.
Logged
WiFi: Nighthawkģ AX12 RAX120 - 5Gb uplink
Routing: pfSense VM - 10Gb in and indeed out
Switching: 2 * Mikrotik CRS305-1G-4S-IN, 10Gb uplinks, various cheap and cheerful
Exchange: Wakefield
ISP: BT Full Fibre 900. Zen Full Fibre 900. Zoom, zoom.

Alex Atkin UK

  • Kitizen
  • ****
  • Posts: 1531
    • My Broadband History
Re: Meat and potatoes / fish and chips machine code
« Reply #4 on: August 05, 2020, 12:06:21 PM »

It will be interesting to see how fast Apple push ARM now they are migrated desktops across to it.  Not really seen any mention of what ARM is capable of when given desktop level cooling.
Logged
Exchange: INTAKE (ECI) ISP/Modems: Zen (Home Hub 5A running OpenWrt) + Plusnet (VMG-3925-B10B) + Three (Hauwei B535-232)
Router: pfSense (i5-7200U) WiFi: Ubiquiti nanoHD

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 8964
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Re: Meat and potatoes / fish and chips machine code
« Reply #5 on: August 06, 2020, 12:06:00 AM »

Iíve always been a big fan of CISC. Having to fetch in a ton of code to get anything done is a limitation on RISCís performance and CISC can always have its microcode implementations improved later on by adding more dedicated hardware while software stays unbroken.

What were you saying about ARM and Apple Alex? ARM is still really slow compared to x86 because the current implements afaik donít have the same levels of ILP, or am I out of date on that too? With Intel boxes having four-way ILP commonly and the AMD Ryzen now having five-way thatís staggeringly fast for meat and potatoes important stuff. Agner Fog gives the Ryzen a wonderful write-up btw.

I donít know enough about ARM AArch64 - Iím wondering if a register-register move costs 1 clock or zero clocks like on Intel. Zero clock operations such as reg-reg move and addressing mode calculation as part of a load/store and the macrofusion of cmp or sub + jmp into one instruction, so losing one of the two instructions in the pair; those are all very impressive developments and I donít know if ARM has any equivalents of zero-clock ops or macrofusion.
Logged

flilot

  • Member
  • **
  • Posts: 47
Re: Meat and potatoes / fish and chips machine code
« Reply #6 on: August 06, 2020, 01:16:56 AM »

What were you saying about ARM and Apple Alex? ARM is still really slow compared to x86 because the current implements afaik donít have the same levels of ILP, or am I out of date on that too? With Intel boxes having four-way ILP commonly and the AMD Ryzen now having five-way thatís staggeringly fast for meat and potatoes important stuff. Agner Fog gives the Ryzen a wonderful write-up btw.

I'd take a look at this:
https://youtu.be/GEZhD3J89ZE?t=5210 (from 1:26:50 in the video if the link doesn't take you to that timeframe directly).
What Apple is doing with their own ARM based silicon is astounding, and the performance is incredible - desktop class. The video shows you Pro apps running natively on macOS on their own ARM based silicon. It's likely to make your jaw drop once you realise the implications for Intel and the like.
Logged
Carl
____________________________
vodafone Gigafast 100/100 FTTP
FRITZ!Box 7530 Router | Calix 801Gv2 GigaPoint ONT

Alex Atkin UK

  • Kitizen
  • ****
  • Posts: 1531
    • My Broadband History
Re: Meat and potatoes / fish and chips machine code
« Reply #7 on: August 06, 2020, 06:36:30 AM »

Quote
When we make bold changes its for one simple but powerful reason, so we can make much better products

Sorry but that made me laugh, considering the stupid hardware mistakes and deliberate right to repair breaking changes they have made in the past.  Making it so third parties cannot repair their devices does not a better product make!

More like:
Quote
When we make bold changes its for one simple but powerful reason, so we can make more profit

Which honestly, they're a business, fair enough, but don't bullshit about it!

What rubs me up the wrong way about this stuff is that for decades, Apple are allowed to do anti-competitive things, because they are the underdog.  A lot of their software integrations, Microsoft were simply not allowed to do.  I understand why, but it seems flawed to me as all its done is stagnate Windows innovation and allowed Apple to develop more compelling solutions.

I mean sure, Microsoft made a lot of mistakes too, but its hardly surprising when they were constantly walking on eggshells with the likes of the EU who fined them whenever they tried to do something like what Apple has done regarding integrating everything in the OS.
« Last Edit: August 06, 2020, 06:41:59 AM by Alex Atkin UK »
Logged
Exchange: INTAKE (ECI) ISP/Modems: Zen (Home Hub 5A running OpenWrt) + Plusnet (VMG-3925-B10B) + Three (Hauwei B535-232)
Router: pfSense (i5-7200U) WiFi: Ubiquiti nanoHD

flilot

  • Member
  • **
  • Posts: 47
Re: Meat and potatoes / fish and chips machine code
« Reply #8 on: August 06, 2020, 04:46:05 PM »

Sorry but that made me laugh, considering the stupid hardware mistakes and deliberate right to repair breaking changes they have made in the past.  Making it so third parties cannot repair their devices does not a better product make!

Indeed. I don't listen to the waffle of these rich polished Americans, they are so over dramatic and arrogant.
Unfortunately you have to get through the waffle to see the demonstrations in the video, which I still maintain are compelling.  Desktop class ARM based silicon is going to create a shift, and Intel need to pull their socks up before they are relegated to third in the desktop CPU market.
Logged
Carl
____________________________
vodafone Gigafast 100/100 FTTP
FRITZ!Box 7530 Router | Calix 801Gv2 GigaPoint ONT
 

anything