Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Author Topic: Meat and potatoes / fish and chips machine code  (Read 389 times)

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 8853
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Meat and potatoes / fish and chips machine code
« on: July 19, 2020, 09:50:35 AM »

I read a minor rant  ;)  by Linus Torvalds about how he hates Intel AVX512. I agree with him wholeheartedly, although not for the same reasons quite - I hate the fact that they are making developersí lives a misery by fragmenting the product range into processors that do and those that donít have different AVX512 modules or indeed no support at all. What are we supposed to do to handle all that complexity? Why the hell bother.

I definitely agree with Linusí point about getting back to concentrating on meat and potatoes (or is it fish and chips?) everyday integer code, plain arithmetic and logic. We need a war on the cost of fighting our way through the forest of everyday tasks that is logic, string processing and conditional jumps.

My wish list :
  • more execution ports
  • better string operations - use some imagination- more innovative insns can be thought up - study real code
  • fast pcmpestri and pcmpestri for AVX2 with YMM regs (assuming they arenít supported already);
  • AMD made the loop and rcxz instructions fast again, intel should finally do the same, although now we will have to tell compilers some day that they can now dare to use it once again and then it will take a decade for things to filter through
  • a double compare instruction cond = ( x >= lo && x <= hi ); yes I know you can do x - lo <= hi - lo if you use unsigned arithmetic bu that isnít always so easy and in any case itís not fast enough and it alters at least one register doing the subtraction and it isnít one clock but two subtracts and the jmp
  • want a CMOVcc with an imm
  • combined cond moves (like CMOVcc) which do not use the flags register but combine a comparison and a conditional move together into one instruction with four operands- two for the comparison, then a dest and a source. Doesnít use the flags so no waiting on the flags nor do the flags become a bottleneck preventing ILP.
  • as above but with a conditional jump, that is a comparison and a jmp combined into one, like generalised jcxz but blazingly fast and it doesnít use the flags
  • what ever happened to bitfield operations; like VAX insv/ extv ? Wasnít there a pair of similar-sounding insns on the first 386dx processors which were then withdrawn, is that right?
Iím going to keep on expanding the list. I can keep coming up with more and more as I look around. Iím thinking about using the XMM/YMM registers for _single_ plain 64-bit and 32-big integers with greater ease of to- and froing between those xMM registers and the traditional integer register set to relieve register pressure on the integer registers by allowing them to be spilled into the xMM registersí (and their halves too, to make better use of the space), as an alternative to spilling registers to the stack.

Other thoughts - I wonder if there is a market for 128-bit scalar integer operations? I also wonder about the costs and benefits of going up to say 32 or 64 integer registers. (HOW to handle the transition when an o/s has to save all the registers and know about the changes to stack frame layout caused by that. - It was only like going to AVX2 YMM and then AVX512 ZMM regs I suppose - it has been handled before?)

I wonder if itís time for a completely new faster byte encoding of x86-64 v2.0 instructions - so assembler source compatible but binary incompatible and you need a fast byte-stream retranslator when you load a trad encoding exe into memory so old bytes get transformed in ram on the fly, JIT fashion, by an o/s exe loader. Should be done in such a fashion as to be able to split up the workload onto multiple cores, split the code of the exe up into n portions. We need to get rid of all the REX prefixes and the decoder complexity needs to go in the bin, thus more speed, less power consumption. And get rid of the code size penalty for using r8-r15. If you were going for some of these other wish-list items, such as lots more new instructions, or - even more so - in the case of a change to far more regs then new flexibility brought by new encodings would be a blessing.

I didnít notice the date on the article so this might be very old news: Linus has bought himself an AMD Ryzen so for the first time in 15 years has forsaken Intel. Good for him, and he says itís three times faster than his previous box. I donít know how much of his build jobs is CPU-bound, how much RAM-i/o bound and how much disk-bound. I wonder if Linus has a parallel make - I was looking round for one some while back and the only one I could find cost a fortune.

Makes me think about -what is it - for Microsoftís MSIL their compiler into machine code - as in C# to machine code - how does it work, I forget ?
Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 30038
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Meat and potatoes / fish and chips machine code
« Reply #1 on: July 19, 2020, 06:29:16 PM »

GNU "make" has the -j (job server) flag and so the regular kernel builds, instigated from "The Cattery", all occur with a -j8 flag and argument.

A few months ago an experiment was performed with one of the build systems that I routinely use (AMD processor, 8 cpus). A series of builds, using the same target, were performed starting with -j1 and ending with -j64. A table of the maximum number of "make" jobs versus the elapsed time for the build was drawn up.

Maximum   Log (base 10)      Elapsed      Log (base 10)
number    maximum number     time         elapsed
of jobs   of jobs            (seconds)    time (seconds)

 1        0.000000000        6647         3.822625679
 2        0.301029995        3628         3.559667278
 4        0.602059991        2136         3.329601248
 8        0.903089987        1338         3.126456113
16        1.204119983        1380         3.139879086
24        1.380211242        1397         3.145196406
32        1.505149978        1388         3.142389466
40        1.602059991        1402         3.146748014
48        1.681241237        1400         3.146128036
56        1.748188027        1398         3.145507171
64        1.806179974        1402         3.146748014

The results were plotted in four separate views; linear - linear, linear - logarithmic, logarithmic - linear and logarithmic - logarithmic. A minimum in the elapsed time was observed when the maximum number of "make" jobs was equal to the number of cpus.

I think Linus routinely builds kernels using a -j28 flag and argument to "make".
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 8853
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Re: Meat and potatoes / fish and chips machine code
« Reply #2 on: July 21, 2020, 10:41:43 AM »

Assuming there is limitless (!) RAM available, I would have thought that the number of jobs should be something like 2 * num of cores, but thereís the speed of RAM to consider, hyper threading or not and how effective file system caching is in practice. Iím please to see that -j16 is not tooo bad though.

Iím glad that Linus has seen some serious improvement, although I donít know how much of that was due to system improvement and show much due to the baseline comparison with the previous system and how much is raw CPU. I feel as if Intel has had a good part of a decade of stagnation. I ask myself how do we get to ambitious single core breakthroughs. How to get every integer instruction down to a minimised number of sub 1-cycle performance including those neglected. How do we get to seriously ambitious targets. What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?
« Last Edit: July 21, 2020, 11:13:42 AM by Weaver »
Logged

CarlT

  • Kitizen
  • ****
  • Posts: 1637
  • Next generation network design and deployment
Re: Meat and potatoes / fish and chips machine code
« Reply #3 on: Today at 12:31:26 PM »

What about single core 10GHz at current cycle counts (so no cheating by increasing the number of cycles per insn) to be reached in one decade with appropriate cooling - a JFK moonshot target ?

The world record is 8.7 GHz and was set years ago so with lower feature size and fewer cores on a die to reduce heat that's feasible even without more exotic things like use of graphene.

It would, however, require focus on that specific target. Maybe Intel / AMD would do it for the e-peen value in the lab?

Fastest stock clock production CPU ever was a 5.5 GHz RISC CPU from IBM. RISC CPUs could I'm sure go to higher clocks than CISC simply due to the far lower transistor count required.
Logged
WiFi: Nighthawkģ AX12 RAX120 - 5Gb uplink
Routing: pfSense VM - 10Gb in and indeed out
Switching: 2 * Mikrotik CRS305-1G-4S-IN, 10Gb uplinks, various cheap and cheerful
Exchange: Wakefield
ISP: BT Full Fibre 900. Zen Full Fibre 900. Zoom, zoom.