-
Notifications
You must be signed in to change notification settings - Fork 8
/
TODO
121 lines (113 loc) · 6.2 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
ToDo's (see also TODO.sp):
Table of contents:
1) efficiency/memory
2) interface
3) documentation
4) installation
5) bugs
6) others
1) efficiency/memory
- use a random sigma value of 64 bits by default
- try the mpn/generic/{sb,dc,mu}_bdiv_qr.c functions in GMP >= 4.3.0 for REDC
- the conversion from NTT primes to mpz_t in function mpzspv_to_mpzv()
(file mpzspv.c) is quadratic. A faster conversion is possible with a
product tree (already done for the mpz_t -> NTT conversion).
- even worse, mpzspm_init seems to be cubic in the input size (because the CRT
algorithm used is quadratic in sp_num). We should use a subquadratic CRT.
- the "Reducing G * H" step is faster in NTT than with KS. This is probably
due to the fact that some transforms are cached in the NTT mode.
- the "Reducing G * H" step can be improved as follows: first compute
D = GH*I mod (x^d+1) where d = deg(F), and I = 1/F mod (x^d+1); then
compute E = D*F mod (x^d-1); finally compute T = (GH-E)/2 mod (x^d+1).
T equals the Montgomery product GH/(x^d+1) mod F.
See the paper "Fast convolution meets Montgomery" by Preda Mihailescu
(Mathematics of Computation).
- slowdown in stage 1 with REDC between a 58672-digit number and a
58688-digit number [reported by [email protected], 29 Aug 2007]
(((2003663613*2^195000-2)/(2*23*173*3863))/1954173900202379)/3612632846010637
((2003663613*2^195000-2)/(2*23*173*3863))/1954173900202379
with B1=1000 on an Opteron (44.2s for c58672, 67.5s for c58688).
The culprit seems to be the REDC routine in mpmod.c: indeed, in case the
modulus has n limbs, but the most significant one has only a few bits,
the product (called x in REDC) has only 2n-1 limbs, and we never call
Mulders's short product in ecm_redc_n (however the else-code using full
products seem faster in that case).
For c58672, if one replaces if (xn == 2 * n) in mpmod.c/REDC by
if (xn >= 2 * n - 1), the time of stage 1 grows from 44s to 64s, whereas
ecm_redc_n should be faster...
This problem is still present in 6.2, ecm_redc_n should be better tuned,
in particular the choice k=0.75*n in ecm_mul_lo_n() is far from optimal.
- in Brent-Suyama's extension, the evaluation of a polynomial of degree k
over N consecutive values is currently done using a O(k N) algorithm
[table of differences]. One can do O(N/k M(k)), cf Section 3 from
"Linear recurrences with polynomial coefficients and application to
integer factorization and Cartier-Manin operator", by Alin Bostan,
Pierrick Gaudry and Eric Schost, SIAM journal on computing,
vol. 36, no. 6, pp. 1777 - 1806, 2007.
It is not clear if this result also applies to ECM, but at least it
should word for P-1 and P+1.
- why restrict the use of mpn_mul_fft to Fermat numbers? We could use it
for any cofactor of 2^(n*BITS_PER_MP_LIMB)+1, as long as
mpn_fft_next_size (n, mpn_fft_best_k (n, S1 == S2)) == n.
- use mpres in step 2 (Target: 7.0)
- write a mpn version of add3 and duplicate
- rewrite entire mpmod.c to be based on mpn_* functions, not mpz_*
- take relative speed of multiplying/squaring into account in PRAC
(DN: couldn't get any significant speed increase)
- use/implement a mpn_mul_hi_n routine for use in mpn_REDC
- use mpn_addmul_2, mpn_addmul_4 in the basecase REDC [for machines
where it exists]. ASM code should perhaps be moved into GMP.
- try McLaughlin's algorithm for Montgomery's modular multiplication
(http://www.ams.org/mcom/0000-000-00/S0025-5718-03-01543-6/home.html)
- consider Colin Percival's generalized DWT for multiplication modulo
k*a^n+b, where k*a*b is highly composite. May belong to GMP rather than
GMP-ECM.
- implement assembly code (redc.asm) for other architectures
- allow composite d2, or better use the S1+S2 idea from the P+-1 algorithm
of Montgomery and Kruppa.
- init mpz_t's with correct amount of memory allocated to avoid reallocs.
Check for reallocs with GMP's memory interface routines. (Partly done.)
- try sliding window multiplication for ECM stage 1 (Target: 7.0)
- choose Brent/Suyama polynomial according to B2/k and not B2!
- Adjust estimated memory to take into account -treefile and NTT
(done but improvement possible)
- when GWNUM is used, lower the default B2 (James Wanless, 17 Mar 2006,
james at grok.ltd.uk)
- implement enhanced standard continuation? With graph cover algorithm?
- parallel/distributed stage 2?
- add curve selection for torsion group of order 8 or 16, see Montgomery's
thesis (request of Peter-Lawrence Montgomery)
- Torbj"orn Granlund suggested faster code for mpn_mod_1(), used extensively
in NTT. See
https://sympa.inria.fr/sympa/arc/ecm-discuss/2008-05/msg00000.html
2) interface
- from Mark Rodenkirch <[email protected]> 08 April 2011: print messages like
"Step 1: 1500000/100000000" with a command-line option (or with -v)
https://sympa.inria.fr/sympa/arc/ecm-discuss/2011-04/msg00000.html
- with -resume, print %time for THIS RUN instead of total run?
[suggested by SleepHound <[email protected]>]
Add CPUTIME=... in the save file, to take into account the total cpu time
spend so far (in seconds). George Woltman agrees for that change. It won't
hurt prime95/mprime -> will be added for his next version.
- when resuming, print the *initial* x0 for P-1/P+1?
- [from Jakub Pawlewicz <[email protected]>] add an option -stage1time t
to tell the step 1 time, when done by another program. PZ: or better
have it in resume file? (Target: 6.1. Command line option done)
3) documentation
4) installation
- check for __builtin_constant_p and __builtin_expect at configure time
- [suggested by Peter Montgomery] add the possibility to compile a "fat"
binary, which automatically selects the best mulredc assembly code
depending on the cpuid [see TODO.fat]
- [suggested by Thomas Kunz, who did port GMP-ECM to the PS3, i.e., to the
Cell architecture]: several changes to make it easier to port GMP-ECM to
specific architectures. Cf TODO.kunz.
5) bugs
6) others
- add point counting algorithm? SEA implementation exists for Pari/GP,
use that?
- let user specify previous factoring work, compute distribution of
candidate factors, compute probability of/est. time to finding a factor
with given parameters.
- re-write in C++? Lots of work, but would make parts of the code much
cleaner.