readme revision 76866
1299990SadrianFirst up, let me say I don't like writing in assembler. It is not portable, 2299990Sadriandependant on the particular CPU architecture release and is generally a pig 3299990Sadrianto debug and get right. Having said that, the x86 architecture is probably 4299990Sadrianthe most important for speed due to number of boxes and since 5299990Sadrianit appears to be the worst architecture to to get 6299990Sadriangood C compilers for. So due to this, I have lowered myself to do 7299990Sadrianassembler for the inner DES routines in libdes :-). 8299990Sadrian 9299990SadrianThe file to implement in assembler is des_enc.c. Replace the following 10299990Sadrian4 functions 11299990Sadriandes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt); 12299990Sadriandes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); 13299990Sadriandes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 14299990Sadriandes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 15299990Sadrian 16299990SadrianThey encrypt/decrypt the 64 bits held in 'data' using 17299990Sadrianthe 'ks' key schedules. The only difference between the 4 functions is that 18299990Sadriandes_encrypt2() does not perform IP() or FP() on the data (this is an 19299990Sadrianoptimization for when doing triple DES and des_encrypt3() and des_decrypt3() 20299990Sadrianperform triple des. The triple DES routines are in here because it does 21299990Sadrianmake a big difference to have them located near the des_encrypt2 function 22299990Sadrianat link time.. 23299990Sadrian 24299990SadrianNow as we all know, there are lots of different operating systems running on 25299990Sadrianx86 boxes, and unfortunately they normally try to make sure their assembler 26299990Sadrianformating is not the same as the other peoples. 27299990SadrianThe 4 main formats I know of are 28299990SadrianMicrosoft Windows 95/Windows NT 29299990SadrianElf Includes Linux and FreeBSD(?). 30299990Sadriana.out The older Linux. 31299990SadrianSolaris Same as Elf but different comments :-(. 32299990Sadrian 33299990SadrianNow I was not overly keen to write 4 different copies of the same code, 34299990Sadrianso I wrote a few perl routines to output the correct assembler, given 35299990Sadriana target assembler type. This code is ugly and is just a hack. 36299990SadrianThe libraries are x86unix.pl and x86ms.pl. 37299990Sadriandes586.pl, des686.pl and des-som[23].pl are the programs to actually 38299990Sadriangenerate the assembler. 39299990Sadrian 40299990SadrianSo to generate elf assembler 41299990Sadrianperl des-som3.pl elf >dx86-elf.s 42299990SadrianFor Windows 95/NT 43299990Sadrianperl des-som2.pl win32 >win32.asm 44299990Sadrian 45299990Sadrian[ update 4 Jan 1996 ] 46299990SadrianI have added another way to do things. 47299990Sadrianperl des-som3.pl cpp >dx86-cpp.s 48299990Sadriangenerates a file that will be included by dx86unix.cpp when it is compiled. 49299990SadrianTo build for elf, a.out, solaris, bsdi etc, 50299990Sadriancc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o 51299990Sadriancc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o 52299990Sadriancc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o 53299990Sadriancc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o 54299990SadrianThis was done to cut down the number of files in the distribution. 55299990Sadrian 56299990SadrianNow the ugly part. I acquired my copy of Intels 57299990Sadrian"Optimization's For Intel's 32-Bit Processors" and found a few interesting 58299990Sadrianthings. First, the aim of the exersize is to 'extract' one byte at a time 59299990Sadrianfrom a word and do an array lookup. This involves getting the byte from 60299990Sadrianthe 4 locations in the word and moving it to a new word and doing the lookup. 61299990SadrianThe most obvious way to do this is 62299990Sadrianxor eax, eax # clear word 63299990Sadrianmovb al, cl # get low byte 64299990Sadrianxor edi DWORD PTR 0x100+des_SP[eax] # xor in word 65299990Sadrianmovb al, ch # get next byte 66299990Sadrianxor edi DWORD PTR 0x300+des_SP[eax] # xor in word 67299990Sadrianshr ecx 16 68299990Sadrianwhich seems ok. For the pentium, this system appears to be the best. 69299990SadrianOne has to do instruction interleaving to keep both functional units 70299990Sadrianoperating, but it is basically very efficient. 71299990Sadrian 72299990SadrianNow the crunch. When a full register is used after a partial write, eg. 73299990Sadrianmov al, cl 74299990Sadrianxor edi, DWORD PTR 0x100+des_SP[eax] 75299990Sadrian386 - 1 cycle stall 76299990Sadrian486 - 1 cycle stall 77299990Sadrian586 - 0 cycle stall 78299990Sadrian686 - at least 7 cycle stall (page 22 of the above mentioned document). 79299990Sadrian 80299990SadrianSo the technique that produces the best results on a pentium, according to 81299990Sadrianthe documentation, will produce hideous results on a pentium pro. 82299990Sadrian 83299990SadrianTo get around this, des686.pl will generate code that is not as fast on 84299990Sadriana pentium, should be very good on a pentium pro. 85299990Sadrianmov eax, ecx # copy word 86299990Sadrianshr ecx, 8 # line up next byte 87299990Sadrianand eax, 0fch # mask byte 88299990Sadrianxor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup 89299990Sadrianmov eax, ecx # get word 90299990Sadrianshr ecx 8 # line up next byte 91299990Sadrianand eax, 0fch # mask byte 92299990Sadrianxor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup 93299990Sadrian 94299990SadrianDue to the execution units in the pentium, this actually works quite well. 95299990SadrianFor a pentium pro it should be very good. This is the type of output 96299990SadrianVisual C++ generates. 97299990Sadrian 98299990SadrianThere is a third option. instead of using 99299990Sadrianmov al, ch 100299990Sadrianwhich is bad on the pentium pro, one may be able to use 101299990Sadrianmovzx eax, ch 102299990Sadrianwhich may not incur the partial write penalty. On the pentium, 103299990Sadrianthis instruction takes 4 cycles so is not worth using but on the 104299990Sadrianpentium pro it appears it may be worth while. I need access to one to 105299990Sadrianexperiment :-). 106299990Sadrian 107299990Sadrianeric (20 Oct 1996) 108299990Sadrian 109299990Sadrian22 Nov 1996 - I have asked people to run the 2 different version on pentium 110299990Sadrianpros and it appears that the intel documentation is wrong. The 111299990Sadrianmov al,bh is still faster on a pentium pro, so just use the des586.pl 112299990Sadrianinstall des686.pl 113299990Sadrian 114299990Sadrian3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these 115299990Sadrianfunctions into des_enc.c because it does make a massive performance 116299990Sadriandifference on some boxes to have the functions code located close to 117299990Sadrianthe des_encrypt2() function. 118299990Sadrian 119299990Sadrian9 Jan 1997 - des-som2.pl is now the correct perl script to use for 120299990Sadrianpentiums. It contains an inner loop from 121299990SadrianSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at 122299990Sadrian273,000 per second. He had a previous version at 250,000 and the best 123299990SadrianI was able to get was 203,000. The content has not changed, this is all 124299990Sadriandue to instruction sequencing (and actual instructions choice) which is able 125299990Sadrianto keep both functional units of the pentium going. 126299990SadrianWe may have lost the ugly register usage restrictions when x86 went 32 bit 127299990Sadrianbut for the pentium it has been replaced by evil instruction ordering tricks. 128299990Sadrian 129299990Sadrian13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. 130299990Sadrianraw DES at 281,000 per second on a pentium 100. 131299990Sadrian 132299990Sadrian