readme revision 76866
1299990SadrianFirst up, let me say I don't like writing in assembler.  It is not portable,
2299990Sadriandependant on the particular CPU architecture release and is generally a pig
3299990Sadrianto debug and get right.  Having said that, the x86 architecture is probably
4299990Sadrianthe most important for speed due to number of boxes and since
5299990Sadrianit appears to be the worst architecture to to get
6299990Sadriangood C compilers for.  So due to this, I have lowered myself to do
7299990Sadrianassembler for the inner DES routines in libdes :-).
8299990Sadrian
9299990SadrianThe file to implement in assembler is des_enc.c.  Replace the following
10299990Sadrian4 functions
11299990Sadriandes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
12299990Sadriandes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
13299990Sadriandes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
14299990Sadriandes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
15299990Sadrian
16299990SadrianThey encrypt/decrypt the 64 bits held in 'data' using
17299990Sadrianthe 'ks' key schedules.   The only difference between the 4 functions is that
18299990Sadriandes_encrypt2() does not perform IP() or FP() on the data (this is an
19299990Sadrianoptimization for when doing triple DES and des_encrypt3() and des_decrypt3()
20299990Sadrianperform triple des.  The triple DES routines are in here because it does
21299990Sadrianmake a big difference to have them located near the des_encrypt2 function
22299990Sadrianat link time..
23299990Sadrian
24299990SadrianNow as we all know, there are lots of different operating systems running on
25299990Sadrianx86 boxes, and unfortunately they normally try to make sure their assembler
26299990Sadrianformating is not the same as the other peoples.
27299990SadrianThe 4 main formats I know of are
28299990SadrianMicrosoft	Windows 95/Windows NT
29299990SadrianElf		Includes Linux and FreeBSD(?).
30299990Sadriana.out		The older Linux.
31299990SadrianSolaris		Same as Elf but different comments :-(.
32299990Sadrian
33299990SadrianNow I was not overly keen to write 4 different copies of the same code,
34299990Sadrianso I wrote a few perl routines to output the correct assembler, given
35299990Sadriana target assembler type.  This code is ugly and is just a hack.
36299990SadrianThe libraries are x86unix.pl and x86ms.pl.
37299990Sadriandes586.pl, des686.pl and des-som[23].pl are the programs to actually
38299990Sadriangenerate the assembler.
39299990Sadrian
40299990SadrianSo to generate elf assembler
41299990Sadrianperl des-som3.pl elf >dx86-elf.s
42299990SadrianFor Windows 95/NT
43299990Sadrianperl des-som2.pl win32 >win32.asm
44299990Sadrian
45299990Sadrian[ update 4 Jan 1996 ]
46299990SadrianI have added another way to do things.
47299990Sadrianperl des-som3.pl cpp >dx86-cpp.s
48299990Sadriangenerates a file that will be included by dx86unix.cpp when it is compiled.
49299990SadrianTo build for elf, a.out, solaris, bsdi etc,
50299990Sadriancc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
51299990Sadriancc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
52299990Sadriancc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
53299990Sadriancc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
54299990SadrianThis was done to cut down the number of files in the distribution.
55299990Sadrian
56299990SadrianNow the ugly part.  I acquired my copy of Intels
57299990Sadrian"Optimization's For Intel's 32-Bit Processors" and found a few interesting
58299990Sadrianthings.  First, the aim of the exersize is to 'extract' one byte at a time
59299990Sadrianfrom a word and do an array lookup.  This involves getting the byte from
60299990Sadrianthe 4 locations in the word and moving it to a new word and doing the lookup.
61299990SadrianThe most obvious way to do this is
62299990Sadrianxor	eax,	eax				# clear word
63299990Sadrianmovb	al,	cl				# get low byte
64299990Sadrianxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
65299990Sadrianmovb	al,	ch				# get next byte
66299990Sadrianxor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
67299990Sadrianshr	ecx	16
68299990Sadrianwhich seems ok.  For the pentium, this system appears to be the best.
69299990SadrianOne has to do instruction interleaving to keep both functional units
70299990Sadrianoperating, but it is basically very efficient.
71299990Sadrian
72299990SadrianNow the crunch.  When a full register is used after a partial write, eg.
73299990Sadrianmov	al,	cl
74299990Sadrianxor	edi,	DWORD PTR 0x100+des_SP[eax]
75299990Sadrian386	- 1 cycle stall
76299990Sadrian486	- 1 cycle stall
77299990Sadrian586	- 0 cycle stall
78299990Sadrian686	- at least 7 cycle stall (page 22 of the above mentioned document).
79299990Sadrian
80299990SadrianSo the technique that produces the best results on a pentium, according to
81299990Sadrianthe documentation, will produce hideous results on a pentium pro.
82299990Sadrian
83299990SadrianTo get around this, des686.pl will generate code that is not as fast on
84299990Sadriana pentium, should be very good on a pentium pro.
85299990Sadrianmov	eax,	ecx				# copy word 
86299990Sadrianshr	ecx,	8				# line up next byte
87299990Sadrianand	eax,	0fch				# mask byte
88299990Sadrianxor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
89299990Sadrianmov	eax,	ecx				# get word
90299990Sadrianshr	ecx	8				# line up next byte
91299990Sadrianand	eax,	0fch				# mask byte
92299990Sadrianxor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup
93299990Sadrian
94299990SadrianDue to the execution units in the pentium, this actually works quite well.
95299990SadrianFor a pentium pro it should be very good.  This is the type of output
96299990SadrianVisual C++ generates.
97299990Sadrian
98299990SadrianThere is a third option.  instead of using
99299990Sadrianmov	al,	ch
100299990Sadrianwhich is bad on the pentium pro, one may be able to use
101299990Sadrianmovzx	eax,	ch
102299990Sadrianwhich may not incur the partial write penalty.  On the pentium,
103299990Sadrianthis instruction takes 4 cycles so is not worth using but on the
104299990Sadrianpentium pro it appears it may be worth while.  I need access to one to
105299990Sadrianexperiment :-).
106299990Sadrian
107299990Sadrianeric (20 Oct 1996)
108299990Sadrian
109299990Sadrian22 Nov 1996 - I have asked people to run the 2 different version on pentium
110299990Sadrianpros and it appears that the intel documentation is wrong.  The
111299990Sadrianmov al,bh is still faster on a pentium pro, so just use the des586.pl
112299990Sadrianinstall des686.pl
113299990Sadrian
114299990Sadrian3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
115299990Sadrianfunctions into des_enc.c because it does make a massive performance
116299990Sadriandifference on some boxes to have the functions code located close to
117299990Sadrianthe des_encrypt2() function.
118299990Sadrian
119299990Sadrian9 Jan 1997 - des-som2.pl is now the correct perl script to use for
120299990Sadrianpentiums.  It contains an inner loop from
121299990SadrianSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
122299990Sadrian273,000 per second.  He had a previous version at 250,000 and the best
123299990SadrianI was able to get was 203,000.  The content has not changed, this is all
124299990Sadriandue to instruction sequencing (and actual instructions choice) which is able
125299990Sadrianto keep both functional units of the pentium going.
126299990SadrianWe may have lost the ugly register usage restrictions when x86 went 32 bit
127299990Sadrianbut for the pentium it has been replaced by evil instruction ordering tricks.
128299990Sadrian
129299990Sadrian13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
130299990Sadrianraw DES at 281,000 per second on a pentium 100.
131299990Sadrian
132299990Sadrian