NameDateSize

..13-May-201382

alpha.asmH A D22-Nov-20121.7 KiB

common.cH A D13-May-201357 KiB

divrem1div.cH A D22-Nov-20121.1 KiB

divrem1inv.cH A D22-Nov-20121.1 KiB

divrem2div.cH A D22-Nov-20121 KiB

divrem2inv.cH A D22-Nov-20121,018

freq.cH A D22-Nov-201223.6 KiB

gcdext_double.cH A D22-Nov-2012983

gcdext_single.cH A D22-Nov-2012995

gcdextod.cH A D22-Nov-20121,010

gcdextos.cH A D22-Nov-20121,022

hppa.asmH A D22-Nov-20121.1 KiB

hppa2.asmH A D22-Nov-20121.1 KiB

hppa2w.asmH A D22-Nov-20121.1 KiB

ia64.asmH A D22-Nov-20121.1 KiB

jacbase1.cH A D22-Nov-2012953

jacbase2.cH A D22-Nov-2012953

jacbase3.cH A D22-Nov-2012953

Makefile.amH A D22-Nov-20125.6 KiB

Makefile.inH A D13-May-201345.9 KiB

many.plH A D13-May-201339.2 KiB

mod_1_div.cH A D22-Nov-20121.2 KiB

mod_1_inv.cH A D22-Nov-20121.2 KiB

modlinv.cH A D22-Nov-20127.2 KiB

noop.cH A D22-Nov-20121.2 KiB

pentium.asmH A D22-Nov-20121.4 KiB

powerpc.asmH A D13-May-20131.1 KiB

powerpc64.asmH A D22-Nov-20121.1 KiB

powm_mod.cH A D22-Nov-2012936

powm_redc.cH A D22-Nov-20121 KiB

pre_divrem_1.cH A D22-Nov-2012967

READMEH A D13-May-201318.7 KiB

set_strb.cH A D22-Nov-20121.3 KiB

set_strp.cH A D22-Nov-20121.2 KiB

set_strs.cH A D22-Nov-20121.2 KiB

sparcv9.asmH A D22-Nov-20121.1 KiB

speed-ext.cH A D22-Nov-20126.8 KiB

speed.cH A D22-Nov-201239.3 KiB

speed.hH A D13-May-2013100.8 KiB

time.cH A D22-Nov-201245.1 KiB

tuneup.cH A D13-May-201365.2 KiB

x86_64.asmH A D22-Nov-20121.4 KiB

README

1Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21
22               GMP SPEED MEASURING AND PARAMETER TUNING
23
24
25The programs in this directory are for knowledgeable users who want to
26measure GMP routines on their machine, and perhaps tweak some settings or
27identify things that can be improved.
28
29The programs here are tools, not ready to run solutions.  Nothing is built
30in a normal "make all", but various Makefile targets described below exist.
31
32Relatively few systems and CPUs have been tested, so be sure to verify that
33results are sensible before relying on them.
34
35
36
37
38MISCELLANEOUS NOTES
39
40--enable-assert
41
42    Don't configure with --enable-assert, since the extra code added by
43    assertion checking may influence measurements.
44
45Direct mapped caches
46
47    Some effort has been made to accommodate CPUs with direct mapped caches,
48    by putting data blocks more or less contiguously on the stack.  But this
49    will depend on TMP_ALLOC using alloca, and even then it may or may not
50    be enough.
51
52FreeBSD 4.2 i486 getrusage
53
54    This getrusage seems to be a bit doubtful, it looks like it's
55    microsecond accurate, but sometimes ru_utime remains unchanged after a
56    time of many microseconds has elapsed.  It'd be good to detect this in
57    the time.c initializations, but for now the suggestion is to pretend it
58    doesn't exist.
59
60        ./configure ac_cv_func_getrusage=no
61
62NetBSD 1.4.1 m68k macintosh time base
63
64    On this system it's been found getrusage often goes backwards, making it
65    unusable (time.c getrusage_backwards_p detects this).  gettimeofday
66    sometimes doesn't update atomically when it crosses a 1 second boundary.
67    Not sure what to do about this.  Expect possible intermittent failures.
68
69SCO OpenUNIX 8 /etc/hw
70
71    /etc/hw takes about a second to return the cpu frequency, which suggests
72    perhaps it's measuring each time it runs.  If this is annoying when
73    running the speed program repeatedly then set a GMP_CPU_FREQUENCY
74    environment variable (see TIME BASE section below).
75
76Timing on GNU/Linux
77
78    On Linux, timing currently uses the cycle counter. This is unreliable,
79    since the counter is not saved and restored at context switches (unlike
80    FreeBSD and Solaris where the cycle counter is "virtualized").
81
82    Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
83    CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
84    with glibc, one has to link with -lrt (which also drags in the pthreads
85    threading library). configure.in must be hacked to detect this and
86    arrange proper linking. Something like
87
88      old_LIBS="$LIBS"
89      AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
90      TUNE_LIBS="$LIBS"
91      LIBS="$old_LIBS"
92
93      AC_SUBST(TUNE_LIBS)
94
95    might work.
96
97Low resolution timebase
98
99    Parameter tuning can be very time consuming if the only timebase
100    available is a 10 millisecond clock tick, to the point of being
101    unusable.  This is currently the case on VAX and ARM systems.
102
103
104
105
106PARAMETER TUNING
107
108The "tuneup" program runs some tests designed to find the best settings for
109various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
110into gmp-mparam.h.  The program is built and run with
111
112        make tune
113
114If the thresholds indicated are grossly different from the values in the
115selected gmp-mparam.h then there may be a performance boost in applicable
116size ranges by changing gmp-mparam.h accordingly.
117
118Be sure to do a full reconfigure and rebuild to get any newly set thresholds
119to take effect.  A partial rebuild is enough sometimes, but a fresh
120configure and make is certain to be correct.
121
122If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
123the mpn subdirectories then the values from "make tune" should be similar.
124But check that the configured CPU is right and there are no machine specific
125effects causing a difference.
126
127It's hoped the compiler and options used won't have too much effect on
128thresholds, since for most CPUs they ultimately come down to comparisons
129between assembler subroutines.  Missing out on the longlong.h macros by not
130using gcc will probably have an effect.
131
132Some thresholds produced by the tune program are merely single values chosen
133from what's a range of sizes where two algorithms are pretty much the same
134speed.  When this happens the program is likely to give somewhat different
135values on successive runs.  This is noticeable on the toom3 thresholds for
136instance.
137
138
139
140
141SPEED PROGRAM
142
143The "speed" program can be used for measuring and comparing various
144routines, and producing tables of data or gnuplot graphs.  Compile it with
145
146	make speed
147
148(Or on DOS systems "make speed.exe".)
149
150Here are some examples of how to use it.  Check the code for all the
151options.
152
153Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
154(whichever is greater).
155
156        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
157	gnuplot foo.gnuplot
158
159Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
160showing under mpn_lshift the difference between it and mpn_add_n.
161
162	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
163
164Using option -c for times in cycles is interesting but normally only
165necessary when looking carefully at assembler subroutines.  You might think
166it would always give an integer value, but this doesn't happen in practice,
167probably due to overheads in the time measurements.
168
169In the free-form output the "#" symbol against a measurement means the
170corresponding routine is fastest at that size.  This is a convenient visual
171cue when comparing different routines.  The graph data files <name>.data
172don't get this since it would upset gnuplot or other data viewers.
173
174
175
176
177TIME BASE
178
179The time measuring method is determined in time.c, based on what the
180configured host has available.  A cycle counter is preferred, possibly
181supplemented by another method if the counter has a limited range.  A
182microsecond accurate getrusage() or gettimeofday() will work quite well too.
183
184The cycle counters (except possibly on alpha) and gettimeofday() will depend
185on the machine being otherwise idle, or rather on other jobs not stealing
186CPU time from the measuring program.  Short routines (those that complete
187within a timeslice) should work even on a busy machine.
188
189Some trouble is taken by speed_measure() in common.c to avoid ill effects
190from sporadic interrupts, or other intermittent things (like cron waking up
191every minute).  But generally an idle machine will be necessary to be
192certain of consistent results.
193
194The CPU frequency is needed to convert between cycles and seconds, or for
195when a cycle counter is supplemented by getrusage() etc.  The speed program
196will convert as necessary according to the output format requested.  The
197tune program will work with either cycles or seconds.
198
199freq.c knows how to get the frequency on some systems, or can measure a
200cycle counter against gettimeofday() or getrusage(), but when that fails, or
201needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
202used (in Hertz).  For example in "bash" on a 650 MHz machine,
203
204	export GMP_CPU_FREQUENCY=650e6
205
206A high precision time base makes it possible to get accurate measurements in
207a shorter time.
208
209
210
211
212EXAMPLE COMPARISONS - VARIOUS
213
214Here are some ideas for things that can be done with the speed program.
215
216There's always going to be a certain amount of overhead in the time
217measurements, due to reading the time base, and in the loop that runs a
218routine enough times to get a reading of the desired precision.  Noop
219functions taking various arguments are available to measure this.  The
220"overhead" printed by the speed program each time in its intro is the "noop"
221routine, but note that this is just for information, it isn't deducted from
222the times printed or anything.
223
224	./speed -s 1 noop noop_wxs noop_wxys
225
226To see how many cycles per limb a routine is taking, look at the time
227increase when the size increments, using option -D.  This avoids fixed
228overheads in the measuring.  Also, remember many of the assembler routines
229have unrolled loops, so it might be necessary to compare times at, say, 16,
23032, 48, 64 etc to see what the unrolled part is taking, as opposed to any
231finishing off.
232
233        ./speed -s 16-64 -t 16 -C -D mpn_add_n
234
235The -C option on its own gives cycles per limb, but is really only useful at
236big sizes where fixed overheads are small compared to the code doing the
237real work.  Remember of course memory caching and/or page swapping will
238affect results at large sizes.
239
240        ./speed -s 500000 -C mpn_add_n
241
242Once a calculation stops fitting in the CPU data cache, it's going to start
243taking longer.  Exactly where this happens depends on the cache priming in
244the measuring routines, and on what sort of "least recently used" the
245hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
24632-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
247limbs.
248
249        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
250	gnuplot foo.gnuplot
251
252When a routine has an unrolled loop for, say, multiples of 8 limbs and then
253an ordinary loop for the remainder, it can happen that it's actually faster
254to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
255draws a graph of mpn_sub_n, to see whether times smoothly increase with
256size.
257
258        ./speed -s 1-100 -c -P foo mpn_sub_n
259	gnuplot foo.gnuplot
260
261If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
262ought to be faster (or at least not slower) than shifting by, say, 2 bits.
263
264        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
265
266An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
267if the lshift isn't faster there's an obvious improvement that's possible.
268
269        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
270
271On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
272destination is one of the sources is faster than a separate destination.
273Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
274mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
275
276        ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
277
278The gmp manual points out that divisions by powers of two should be done
279using a right shift because it'll be significantly faster than an actual
280division.  The following shows by what factor mpn_rshift is faster than
281mpn_divrem_1, using division by 32 as an example.
282
283        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
284
285
286
287
288EXAMPLE COMPARISONS - MULTIPLICATION
289
290mul_basecase takes a ".<r>" parameter which is the first (larger) size
291parameter.  For example to show speeds for 20x1 up to 20x15 in cycles,
292
293        ./speed -s 1-15 -c mpn_mul_basecase.20
294
295mul_basecase with no parameter does an NxN multiply, so for example to show
296speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
297
298        ./speed -s 1-20 -c mpn_mul_basecase
299
300sqr_basecase is implemented by a "triangular" method on most CPUs, making it
301up to twice as fast as mul_basecase.  In practice loop overheads and the
302products on the diagonal mean it falls short of this.  Here's an example
303running the two and showing by what factor an NxN mul_basecase is slower
304than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
305below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
306
307        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
308
309The technique described above with -CD for showing the time difference in
310cycles per limb between two size operations can be done on an NxN
311mul_basecase using -E to change the basis for the size increment to N*N.
312For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
313doing 256 limbs.  The following therefore shows the per crossproduct speed
314of mul_basecase and sqr_basecase at around 20x20 limbs.
315
316        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
317
318Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
319interesting to compare it to mul_basecase as if it was.  For sqr_basecase
320the -F option can be used to base the deltas on N*(N+1)/2 operations, which
321is the triangular products sqr_basecase does.  For example,
322
323        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
324
325Both -E and -F are preliminary and might change.  A consistent approach to
326using them when claiming certain per crossproduct or per triangularproduct
327speeds hasn't really been established, but the increment between speeds in
328the range karatsuba will call seems sensible, that being k to k/2.  For
329instance, if the karatsuba threshold was 20 for the multiply and 30 for the
330square,
331
332        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
333        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
334
335
336
337EXAMPLE COMPARISONS - MALLOC
338
339The gmp manual recommends application programs avoid excessive initializing
340and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
341variable will at a minimum go through an init, a realloc for its first
342store, and finally a clear.  Quite how long that takes depends on the C
343library.  The following compares an mpz_init/realloc/clear to a 10 limb
344mpz_add.  Don't be surprised if the mallocing is quite slow.
345
346        ./speed -s 10 -c mpz_init_realloc_clear mpz_add
347
348On some systems malloc and free are much slower when dynamic linked.  The
349speed-dynamic program can be used to see this.  For example the following
350measures malloc/free, first static then dynamic.
351
352        ./speed -s 10 -c malloc_free
353        ./speed-dynamic -s 10 -c malloc_free
354
355Of course a real world program has big problems if it's doing so many
356mallocs and frees that it gets slowed down by a dynamic linked malloc.
357
358
359
360
361
362EXAMPLE COMPARISONS - STRING CONVERSIONS
363
364mpn_get_str does a binary to string conversion.  The base is specified with
365a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
366than general bases.  The following compares decimal and hex for instance.
367
368        ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
369
370Smaller bases need more divisions to split a given size number, and so are
371slower.  The following compares base 3 and base 9.  On small operands 9 will
372be nearly twice as fast, though at bigger sizes this reduces since in the
373current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
374limbs) and those divisions come to dominate.
375
376        ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
377
378mpn_set_str does a string to binary conversion.  The base is specified with
379a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
380general bases on large conversions.
381
382	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
383
384mpn_set_str also has some special case code for decimal which is a bit
385faster than the general case, basically by giving the compiler a chance to
386optimize some multiplications by 10.
387
388	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
389
390
391
392
393EXAMPLE COMPARISONS - GCDs
394
395mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
396x and y are single limbs.  This isn't tuned currently, but a value can be
397established by a measurement like
398
399	./speed -s 10-32 mpn_gcd_1.10
400
401This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
402threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
403is done by nibbling away at the 32-bit operands bit-by-bit.  When the
404threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
40510x10 bit operation.
406
407The threshold in mpn/generic/gcd_1.c or the various assembler
408implementations can be tweaked up or down until there's no more speedups on
409interesting combinations of sizes.  Note that this affects only a 1x1 limb
410operation and so isn't very important.  (An Nx1 limb operation always does
411an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
412
413
414
415
416SPEED PROGRAM EXTENSIONS
417
418Potentially lots of things could be made available in the program, but it's
419been left at only the things that have actually been wanted and are likely
420to be reasonably useful in the future.
421
422Extensions should be fairly easy to make though.  speed-ext.c is an example,
423in a style that should suit one-off tests, or new code fragments under
424development.
425
426many.pl is a script for generating a new speed program supplemented with
427alternate versions of the standard routines.  It can be used for measuring
428experimental code, or for comparing different implementations that exist
429within a CPU family.
430
431
432
433
434THRESHOLD EXAMINING
435
436The speed program can be used to examine the speeds of different algorithms
437to check the tune program has done the right thing.  For example to examine
438the karatsuba multiply threshold,
439
440	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
441
442When examining the toom3 threshold, remember it depends on the karatsuba
443threshold, so the right karatsuba threshold needs to be compiled into the
444library first.  The tune program uses specially recompiled versions of
445mpn/mul_n.c etc for this reason, but the speed program simply uses the
446normal libgmp.la.
447
448Note further that the various routines may recurse into themselves on sizes
449far enough above applicable thresholds.  For example, mpn_kara_mul_n will
450recurse into itself on sizes greater than twice the compiled-in
451MUL_TOOM22_THRESHOLD.
452
453When doing the above comparison between mul_basecase and kara_mul_n what's
454probably of interest is mul_basecase versus a kara_mul_n that does one level
455of Karatsuba then calls to mul_basecase, but this only happens on sizes less
456than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
457setting can be compiled-in to avoid the problem if necessary.  The same
458applies to toom3 and DC, though in a trickier fashion.
459
460There are some upper limits on some of the thresholds, arising from arrays
461dimensioned according to a threshold (mpn_mul_n), or asm code with certain
462sized displacements (some x86 versions of sqr_basecase).  So putting huge
463values for the thresholds, even just for testing, may fail.
464
465
466
467
468FUTURE
469
470Make a program to check the time base is working properly, for small and
471large measurements.  Make it able to test each available method, including
472perhaps the apparent resolution of each.
473
474Make a general mechanism for specifying operand overlap, and a syntax like
475maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
476sort of thing with the "r" parameter currently.
477
478
479
480----------------
481Local variables:
482mode: text
483fill-column: 76
484End:
485