Subversion Repositories Games.Chess Giants

Rev

Rev 33 | Rev 154 | Go to most recent revision | Show entire file | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 33 Rev 108
Line 2... Line 2...
2
#include "data.h"
2
#include "data.h"
3
#include "epdglue.h"
3
#include "epdglue.h"
-
 
4
#if (CPUS > 1)
4
/* modified 04/30/14 */
5
/* modified 11/04/15 */
5
/*
6
/*
6
 *******************************************************************************
7
 *******************************************************************************
7
 *                                                                             *
8
 *                                                                             *
8
 *   Thread() is the driver for the threaded parallel search in Crafty.  The   *
9
 *   Split() is the driver for the threaded parallel search in Crafty.  The    *
9
 *   basic idea is that whenever we notice that one (or more) threads are in   *
10
 *   basic idea is that whenever we notice that one (or more) threads are in   *
10
 *   their idle loop, we drop into thread, from Search(), and begin a new      *
11
 *   their idle loop, we drop into Split(), from Search(), and begin a new     *
11
 *   parallel search from this node.  This is simply a problem of copying the  *
12
 *   parallel search from this node.  This is simply a problem of establishing *
-
 
13
 *   a new split point, and then letting each thread join this split point and *
-
 
14
 *   copy whatever data they need.                                             *
-
 
15
 *                                                                             *
-
 
16
 *   This is generation II of Split().  The primary differences address two    *
-
 
17
 *   specific performance-robbing issues.  (1) Excessive waiting for a split   *
-
 
18
 *   to be done, and (b) excessive waiting on specific locks.  Generation II   *
-
 
19
 *   addresses both of these to significantly improve performance.             *
-
 
20
 *                                                                             *
-
 
21
 *   The main difference between Gen I and Gen II is the effort required to    *
12
 *   search state space for each thread working at this node, then sending     *
22
 *   split the tree and which thread(s) expend this effort.  In generation I,  *
-
 
23
 *   the parent thread was responsible for allocating a split block for each   *
-
 
24
 *   child thread, and then copying the necessary data from the parent split   *
-
 
25
 *   block to these child split blocks.  When all of this was completed, the   *
-
 
26
 *   child processes were released to start the parallel search after being    *
-
 
27
 *   held while the split / copy operations were done.  In the generation II   *
-
 
28
 *   Split() we now simply allocate a new split block for THIS thread, flag    *
-
 
29
 *   the parent split block as joinable, and then go directly to ThreadWait()  *
-
 
30
 *   which will drop us back in to the search immediately.  The idle threads   *
-
 
31
 *   are continually looping on Join() which will jump them right into this    *
-
 
32
 *   split block letting them do ALL the work of allocating a split block,     *
-
 
33
 *   filling it in, and then copying the data to their local split block.      *
-
 
34
 *   This distributes the split overhead among all the threads that split,     *
-
 
35
 *   rather than this thread having to do all the work while the other threads *
-
 
36
 *   sit idle.                                                                 *
-
 
37
 *                                                                             *
-
 
38
 *   Generation II is also much more lightweight, in that it copies only the   *
-
 
39
 *   bare minimum from parent to child.  Generation I safely copied far too    *
-
 
40
 *   much since this code was being changed regularly, but that is no longer   *
-
 
41
 *   necessary overhead.                                                       *
-
 
42
 *                                                                             *
-
 
43
 *   Generation II has a zero contention split algorithm.  In the past, when a *
-
 
44
 *   thread became idle, it posted a global split request and any thread that  *
-
 
45
 *   was at an appropriate point would try to split.  But while it was doing   *
-
 
46
 *   the splitting, other threads that were also willing to split would "get   *
-
 
47
 *   in line" because Crafty used a global lock to prevent two threads from    *
-
 
48
 *   attempting to split at the same instant in time.  They got in line, and   *
-
 
49
 *   waited for the original splitter to release the lock, but now they have   *
-
 
50
 *   no idle threads to split with.  A waste of time.  Now we allow ANY thread *
-
 
51
 *   to attempt to split at the current ply.  When we do what might be called  *
-
 
52
 *   a "gratuitous split" the only restriction is that if we have existing     *
-
 
53
 *   "gratuitous split points" (split blocks that are joinable but have not    *
-
 
54
 *   yet been joined), then we limit the number of such splits (per thread) to *
-
 
55
 *   avoid excessive overhead.                                                 *
-
 
56
 *                                                                             *
-
 
57
 *   Generation II takes another idea from DTS, the idea of "late-join".  The  *
-
 
58
 *   idea is fairly simple.  If, when a thread becomes idle, there are already *
-
 
59
 *   other split points being searched in parallel, then we will try to join   *
-
 
60
 *   one of them rather than waiting for someone to ask us to help.  We use    *
-
 
61
 *   some simple criteria:  (1) The split point must be joinable, which simply *
-
 
62
 *   means that no processor has exited the split point yet (which would let   *
-
 
63
 *   us know there is no more work here and a join would be futile);  (2) We   *
-
 
64
 *   compute an "interest" value which is a simple formula based on depth at   *
-
 
65
 *   the split point, and the number of moves already searched.  It seems less *
-
 
66
 *   risky to choose a split point with max depth AND minimum moves already    *
13
 *   everyone off to SearchParallel() to search this node in parallel.         *
67
 *   searched so that there is plenty to do.  This was quite simple to add     *
-
 
68
 *   after the rest of the Generation II rewrite.  In fact, this is now THE    *
-
 
69
 *   way threads join a split point, period, which further simplifies the code *
-
 
70
 *   and improves efficiency.  IE, a thread can split when idle threads are    *
-
 
71
 *   noticed, or if it is far enough away from the tips to make the cost       *
-
 
72
 *   negligible.  At that point any idle thread(s) can join immediately, those *
-
 
73
 *   that become idle later can join when they are ready.                      *
14
 *                                                                             *
74
 *                                                                             *
15
 *   There are a number of settable options via the command-line or .craftyrc  *
75
 *   There are a number of settable options via the command-line or .craftyrc  *
16
 *   initialization file.  Here's a concise explanation for each option and an *
76
 *   initialization file.  Here's a concise explanation for each option and an *
17
 *   occasional suggestion for testing/tuning.                                 *
77
 *   occasional suggestion for testing/tuning.                                 *
-
 
78
 *                                                                             *
-
 
79
 *   smp_affinity (command = smpaffinity=<n> is used to enable or disable      *
-
 
80
 *      processor affinity.  -1 disables affinity and lets threads run on any  *
-
 
81
 *      available core.  If you use an integer <n> then thread zero will bind  *
-
 
82
 *      itself to cpu <n> and each additional thread will bind to the next     *
-
 
83
 *      higher cpu number.  This is useful if you try to run two copies of     *
-
 
84
 *      crafty on the same machine, now you can cause one to bind to the first *
-
 
85
 *      <n> cores, and the second to the last <n> cores.  For the first        *
-
 
86
 *      instance of Crafty, you would use smpaffinity=0, and for the second    *
-
 
87
 *      smpaffinity=8, assuming you are running 8 threads per copy on a 16 cpu *
-
 
88
 *      machine.  If you get this wrong, you can have more than one thread on  *
-
 
89
 *      the same cpu which will significantly impact performance.              *
18
 *                                                                             *
90
 *                                                                             *
19
 *   smp_max_threads (command = smpmt=n) sets the total number of allowable    *
91
 *   smp_max_threads (command = smpmt=n) sets the total number of allowable    *
20
 *      threads for the search.  The default is one (1) as Crafty does not     *
92
 *      threads for the search.  The default is one (1) as Crafty does not     *
21
 *      assume it should use all available resources.  For optimal performance *
93
 *      assume it should use all available resources.  For optimal performance *
22
 *      this should be set to the number of physical cores your machine has,   *
94
 *      this should be set to the number of physical cores your machine has,   *
23
 *      which does NOT include hyperthreaded cores.                            *
95
 *      which does NOT include hyperthreaded cores.                            *
24
 *                                                                             *
96
 *                                                                             *
25
 *   smp_split_group (command = smpgroup=n) sets the maximum number of threads *
97
 *   smp_split_group (command = smpgroup=n) sets the maximum number of threads *
26
 *      at any single split point, with the exception of split points where    *
98
 *      at any single split point, with the exception of split points fairly   *
27
 *      ply <= 4 where ALL threads are allowed to split together ignoring this *
99
 *      close to the root where ALL threads are allowed to split together,     *
-
 
100
 *      ignoring this limit.  Note that this is ignored in the first 1/2 of    *
-
 
101
 *      the tree (the nodes closer to the root).  There it is actually good to *
28
 *      limit.                                                                 *
102
 *      split and get all active threads involved.                             *
29
 *                                                                             *
103
 *                                                                             *
30
 *   smp_min_split_depth (command = smpmin=n) avoids splitting when remaining  *
104
 *   smp_min_split_depth (command = smpmin=n) avoids splitting when remaining  *
31
 *      depth < n.  This is used to balance splitting overhead cost against    *
105
 *      depth < n.  This is used to balance splitting overhead cost against    *
32
 *      the speed gain the parallel search produes.  The default is currently  *
106
 *      the speed gain the parallel search produces.  The default is currently *
33
 *      5 (which could change with future generations of Intel hardware) but   *
107
 *      5 (which could change with future generations of Intel hardware) but   *
34
 *      values between 4 and 8 will work.  Larger values allow somewhat fewer  *
108
 *      values between 4 and 8 will work.  Larger values allow somewhat fewer  *
35
 *      splits, which reduces overhead, but it also increases the percentage   *
109
 *      splits, which reduces overhead, but it also increases the percentage   *
36
 *      of the time where a thread is waiting on work.                         *
110
 *      of the time where a thread is waiting on work.                         *
37
 *                                                                             *
111
 *                                                                             *
38
 *   smp_split_nodes (command = smpnodes=n) will not allow a thread to split   *
-
 
39
 *      until it has searched at least N nodes, to prevent excessive splitting *
-
 
40
 *      out near the tips in endgames.  The default is 2000.  Larger values    *
-
 
41
 *      will reduce splitting overhead but will increase the amoutn of time    *
-
 
42
 *      a thread is waiting on work.                                           *
-
 
43
 *                                                                             *
-
 
44
 *   smp_split_at_root (command=smproot=0 or 1) enables (1) or disables (0)    *
112
 *   smp_split_at_root (command = smproot=0 or 1) enables (1) or disables (0)  *
45
 *      splitting the tree at the root.  This defaults to 1 which produces the *
113
 *      splitting the tree at the root.  This defaults to 1 which produces the *
46
 *      best performance by a signficiant margin.  But it can be disabled if   *
114
 *      best performance by a signficiant margin.  But it can be disabled if   *
47
 *      you are playing with code changes.                                     *
115
 *      you are playing with code changes.                                     *
48
 *                                                                             *
116
 *                                                                             *
-
 
117
 *   smp_gratuitous_depth (command = smpgd=<n>) controls " gratuitous splits"  *
-
 
118
 *      which are splits that are done without any idle threads.  This sets a  *
-
 
119
 *      depth limit (remaining depth) that must be present before such a split *
-
 
120
 *      can be done.  Making this number larger will reduce the number of      *
-
 
121
 *      these splits.  Making it too small will increase overhead slightly and *
-
 
122
 *      increase split block usage significantly.                              *
-
 
123
 *                                                                             *
-
 
124
 *   smp_gratuitous_limit (command = smpgl=<n>) limits the number of these     *
49
 *   The best way to tune any of these parameters is to run SEVERAL test cases *
125
 *      splits that a thread can do.  Once a thread has this number of         *
-
 
126
 *      unjoined split points, it will not be allowed to split any more until  *
50
 *   (positions) with max threads set.  Run each set of positions several      *
127
 *      one or more threads join at least one of the existing split points.    *
51
 *   times with each parameter change you want to try (do them ONE at a time   *
128
 *      In the smp search statistics, where you see output that looks like     *
-
 
129
 *      this:                                                                  *
-
 
130
 *                                                                             *
-
 
131
 *        splits=m(n) ...                                                      *
52
 *   for obvious reasons).  Pick the value with the smallest overall search    *
132
 *                                                                             *
53
 *   time.  The more cores you use, the more times you should run each test,   *
133
 *      m is the total splits done, n is the number of "wasted splits" which   *
54
 *   since parallel search is highly non-deterministic and you need several    *
134
 *      are basically gratuitous splits where no thread joined before this     *
-
 
135
 *      split point was completed and deallocated.                             *
55
 *   runs to get a reasonable average.                                         *
136
 *                                                                             *
-
 
137
 *   The best way to tune all of these paramenters is to use the "autotune"    *
-
 
138
 *   command (see autotune.c and help autotune) which will automatically run   *
-
 
139
 *   tests and optimize the parameters.  More details are in the autotune.c    *
-
 
140
 *   source file.                                                              *
56
 *                                                                             *
141
 *                                                                             *
57
 *   A few basic "rules of the road" for anyone interested in changing or      *
142
 *   A few basic "rules of the road" for anyone interested in changing or      *
58
 *   adding to any of this code.                                               *
143
 *   adding to any of this code.                                               *
59
 *                                                                             *
144
 *                                                                             *
60
 *   1.  If, at any time, you want to modify your private split block, no lock *
145
 *   1.  If, at any time, you want to modify your private split block, no lock *
61
 *       is required.                                                          *
146
 *       is required.                                                          *
62
 *                                                                             *
147
 *                                                                             *
63
 *   2.  If, at any time, you want to modify another split block, such as the  *
148
 *   2.  If, at any time, you want to modify another split block, such as the  *
64
 *       parent split block shared move list, you must acquire the lock in the *
149
 *       parent split block shared move list, you must acquire the lock in the *
65
 *       split block first.  IE (tree->parent->lock to lock the parent split   *
150
 *       split block first.  IE tree->parent->lock to lock the parent split    *
66
 *       block during NextMove() and such).  Note that this ONLY applies to    *
151
 *       block during NextMove() and such.                                     *
67
 *       search-related data, NOT the SMP data within the split block (parent  *
-
 
68
 *       pointer, sibling pointers, number of processors working here, etc).   *
-
 
69
 *       Modifying those falls under the next lock below.                      *
-
 
70
 *                                                                             *
152
 *                                                                             *
71
 *   3.  If you want to modify any SMP-related data, such as allocating a      *
153
 *   3.  If you want to modify any SMP-related data that spans multiple split  *
72
 *       split block, or telling sibling processes to stop, etc, then you must *
154
 *       blocks, such as telling sibling processes to stop, etc, then you must *
73
 *       acquire the global "smp_lock" lock first.  This prevents any sort of  *
155
 *       acquire the global "lock_smp" lock first.  This prevents a deadlock   *
74
 *       race condition that could corrupt the split block tree organization.  *
156
 *       caused by two different threads walking the split block chain from    *
75
 *       This applies to ANY smp-related variable from the simple smp_idle     *
157
 *       different directions, and acquiring the split block locks in          *
-
 
158
 *       different orders, which could cause a catastrophic deadlock to occur. *
76
 *       variable through the various sibling chains and such.                 *
159
 *       This is an infrequent event so the overhead is not significant.       *
77
 *                                                                             *
160
 *                                                                             *
78
 *   4.  If you want to do any sort of I/O operation, you must acquire the     *
161
 *   4.  If you want to do any sort of I/O operation, you must acquire the     *
79
 *       "lock_io" lock first.  Since threads share descriptors, there are     *
162
 *       "lock_io" lock first.  Since threads share descriptors, there are     *
80
 *       lots of potential race conditions, from the simple tty-output getting *
163
 *       lots of potential race conditions, from the simple tty-output getting *
81
 *       interlaced from different threads, to outright data corruption in the *
164
 *       interlaced from different threads, to outright data corruption in the *
82
 *       book or log files.                                                    *
165
 *       book or log files.                                                    *
83
 *                                                                             *
-
 
84
 *   5.  There is a specific "lock_split" that is used to prevent a lot of     *
-
 
85
 *       queued waiting to do splits, but it is unlikely that lock would ever  *
-
 
86
 *       be used anywhere other than where it is currently accessed in         *
-
 
87
 *       Search() at the parallel split initiation code.                       *
-
 
88
 *                                                                             *
-
 
89
 *   6.  If you want to alter the root move list, you must first acquire       *
-
 
90
 *       lock_root, since the root move list is shared and multiple threads    *
-
 
91
 *       can attempt to modify it at the same time.  Overlooking this can      *
-
 
92
 *       result in a corrupted root move list with one or more moves missing   *
-
 
93
 *       or duplicated.                                                        *
-
 
94
 *                                                                             *
166
 *                                                                             *
95
 *   Some of the bugs caused by failing to acquire the correct lock will only  *
167
 *   Some of the bugs caused by failing to acquire the correct lock will only  *
96
 *   occur infrequently, and they are extremely difficult to find.  Some only  *
168
 *   occur infrequently, and they are extremely difficult to find.  Some only  *
97
 *   show up in a public game where everyone is watching, to cause maximum     *
169
 *   show up in a public game where everyone is watching, to cause maximum     *
98
 *   embarassment and causes the program to do something extremely stupid.     *
170
 *   embarassment and causes the program to do something extremely stupid.     *
99
 *                                                                             *
171
 *                                                                             *
100
 *******************************************************************************
172
 *******************************************************************************
101
 */
173
 */
102
int Thread(TREE * RESTRICT tree) {
174
int Split(TREE * RESTRICT tree) {
-
 
175
  TREE *child;
103
  int proc, tsplit = 0, nblocks = 0;
176
  int tid, tstart, tend;
104
 
177
 
105
/*
178
/*
106
 ************************************************************
179
 ************************************************************
107
 *                                                          *
180
 *                                                          *
108
 *  First, we make sure that there are idle threads. It is  *
181
 *  Here we prepare to split the tree.  All we really do in *
-
 
182
 *  the Generation II threading is grab a split block for   *
109
 *  possible that we get here after they have all been      *
183
 *  this thread, then flag the parent as "joinable" and     *
-
 
184
 *  then jump right to ThreadWait() to resume where we left *
-
 
185
 *  off, with the expectation (but not a requirement) that  *
-
 
186
 *  other threads will join us to help.                     *
-
 
187
 *                                                          *
-
 
188
 *  Idle threads are sitting in ThreadWait() repeatedly     *
-
 
189
 *  calling Join() to find them a split point, which we are *
-
 
190
 *  fixing to provide.  They will then join as quickly as   *
110
 *  claimed by another thread.  In that case we return.     *
191
 *  they can, and other threads that become idle later can  *
-
 
192
 *  also join without any further splitting needed.         *
-
 
193
 *                                                          *
-
 
194
 *  If we are unable to allocate a split block, we simply   *
-
 
195
 *  abort this attempted split and return to the search     *
-
 
196
 *  since other threads will also split quickly.            *
111
 *                                                          *
197
 *                                                          *
112
 ************************************************************
198
 ************************************************************
113
 */
199
 */
-
 
200
  tstart = ReadClock();
114
  Lock(lock_smp);
201
  tree->nprocs = 0;
115
  for (proc = 0; proc < smp_max_threads && !thread[proc].idle; proc++);
202
  for (tid = 0; tid < (int) smp_max_threads; tid++) // Pierre-Marie Baty -- added type cast
116
  if (proc == smp_max_threads) {
203
    tree->siblings[tid] = 0;
117
    smp_split = 0;
204
  child = GetBlock(tree, tree->thread_id);
118
    Unlock(lock_smp);
205
  if (!child)
119
    return 0;
206
    return 0;
-
 
207
  CopyFromParent(child);
-
 
208
  thread[tree->thread_id].tree = child;
-
 
209
  tree->joined = 0;
-
 
210
  tree->joinable = 1;
-
 
211
  parallel_splits++;
120
  }
212
  smp_split = 0;
-
 
213
  tend = ReadClock();
-
 
214
  thread[tree->thread_id].idle += tend - tstart;
121
/*
215
/*
122
 ************************************************************
216
 ************************************************************
123
 *                                                          *
217
 *                                                          *
-
 
218
 *  We have successfully created a split point, which means *
124
 *  Now we start copying the current "state" from this      *
219
 *  we are done.  The instant we set the "joinable" flag,   *
-
 
220
 *  idle threads may begin to join in at this split point   *
-
 
221
 *  to help.  Since this thread may finish before any or    *
125
 *  thread into new thread blocks so that the threads can   *
222
 *  all of the other parallel threads, this thread is sent  *
126
 *  search this position without interfering with each      *
223
 *  to ThreadWait() which will immediately send it to       *
-
 
224
 *  SearchMoveList() like the other threads; however, going *
-
 
225
 *  to ThreadWait() allows this thread to join others if it *
-
 
226
 *  runs out of work to do.  We do pass ThreadWait() the    *
-
 
227
 *  address of the parent split block, so that if this      *
127
 *  other.  As we copy, we link those blocks as siblings of *
228
 *  thread becomes idle, and this thread block shows no     *
-
 
229
 *  threads are still busy, then this thread can return to  *
-
 
230
 *  here and then back up into the previous ply as it       *
128
 *  the parent node, and point each sibling back to the     *
231
 *  should.  Note that no other thread can back up to the   *
129
 *  parent so we can unwind this confusion as the threads   *
232
 *  previous ply since their recursive call stacks are not  *
-
 
233
 *  set for that, while this call stack will bring us back  *
-
 
234
 *  to this point where we return to the normal search,     *
130
 *  complete their work.                                    *
235
 *  which we just completed.                                *
131
 *                                                          *
236
 *                                                          *
-
 
237
 ************************************************************
-
 
238
 */
-
 
239
  ThreadWait(tree->thread_id, tree);
-
 
240
  if (!tree->joined)
-
 
241
    parallel_splits_wasted++;
-
 
242
  return 1;
-
 
243
}
-
 
244
 
-
 
245
/* modified 12/16/15 */
-
 
246
/*
-
 
247
 *******************************************************************************
-
 
248
 *                                                                             *
-
 
249
 *   Join() is called just when we enter the usual spin-loop waiting for work. *
132
 *  CopyToChild() allocates a split block optimally (see    *
250
 *   We take a quick look at all active split blocks to see if any look        *
-
 
251
 *   "joinable".  If so, we compute an "interest" value, which will be defined *
-
 
252
 *   below.  We then join the most interesting split point directly. This      *
-
 
253
 *   split point might have been created specifically for this thread to join, *
-
 
254
 *   or it might be one that was already active when this thread became idle,  *
133
 *  that code for an explanation) and then copies anything  *
255
 *   which allows us to join that existing split point and not request a new   *
134
 *  necessary for the parallel search.                      *
256
 *   split operation, saving time.                                             *
-
 
257
 *                                                                             *
-
 
258
 *******************************************************************************
-
 
259
 */
-
 
260
int Join(int64_t tid) {
-
 
261
  TREE *tree, *join_block, *child;
-
 
262
  int interest, best_interest, current, pass = 0;
-
 
263
 
-
 
264
/*
-
 
265
 ************************************************************
135
 *                                                          *
266
 *                                                          *
136
 *  Special case 1:  In the loop to allocate a split block  *
-
 
137
 *  and copy current data to the new child thread, we skip  *
-
 
138
 *  over the current splitting thread's data.  After the    *
-
 
139
 *  loop we then explicitly allocate a block for this       *
267
 *  First we pass over ALL split blocks, looking for those  *
140
 *  thread so that it will be included in the thread group  *
268
 *  flagged as "joinable" (which means they are actually    *
141
 *  (remember that this thread is the one that has to back  *
269
 *  active split points and that no processor at that split *
142
 *  up through the split block, so it will always call      *
-
 
143
 *  ThreadWait() telling it to do so.  There was a peculiar *
-
 
144
 *  condition that caused an inefficiency.  If we are at    *
-
 
145
 *  a depth where we honor the smp_split_group limit, it is *
270
 *  point has run out of work (there is no point in joining *
146
 *  possible that on occasion, there could be so many idle  *
271
 *  a split point with no remaining work) and no fail high  *
147
 *  processors that we would not include ourself.  For      *
272
 *  has been found which would raise the "stop" flag.) This *
148
 *  example, suppose we are thread 6, and when we enter     *
273
 *  is "racy" because we do not acquire any locks, which    *
149
 *  Thread() to do the split, threads 0-5 are idle, and     *
274
 *  means that the busy threads continue working, and there *
150
 *  smp_split_group = 6.  We would pick up the first 6 idle *
275
 *  is a small probability that the split point will        *
151
 *  threads and sic them on this split point, but we would  *
276
 *  disappear while we are in this loop.  To resolve the    *
152
 *  fail to include ourself.  We would hit ThreadWait()     *
277
 *  potential race, after we find the most attractive split *
153
 *  with no work to do, and then split somewhere else.      *
278
 *  point, we acquire the lock for that split block and     *
154
 *  Unfortunately, THIS thread is the one that needs to     *
279
 *  test again, but this time if the block is joinable, we  *
155
 *  wait for all the other threads to complete so that it   *
280
 *  can safely join under control of the lock, which is not *
156
 *  can back up to the previous ply.  But if it is busy on  *
281
 *  held for very long at all.  If the block is not         *
157
 *  another split point, it won't be able to return to this *
282
 *  joinable once we acquire the lock, we abort joining     *
158
 *  point until it is idle.  Which means progress at this   *
283
 *  since it is futile.  Note that if this happens, we will *
159
 *  split point is held up.  We now forcefully include the  *
284
 *  try to find an existing split point we can join three   *
160
 *  current thread, and only include smp_split_group - 1    *
285
 *  times before we exit, setting split to 1 to ask other   *
161
 *  other threads to work here with this thread.            *
286
 *  threads to produce more candidate split points.         *
162
 *                                                          *
287
 *                                                          *
-
 
288
 ************************************************************
-
 
289
 */
-
 
290
  for (pass = 0; pass < 3; pass++) {
-
 
291
    best_interest = -999999;
-
 
292
    join_block = 0;
-
 
293
    for (current = 0; current <= (int) smp_max_threads * 64; current++) { // Pierre-Marie Baty -- added type cast
-
 
294
      tree = block[current];
-
 
295
      if (tree->joinable && (tree->ply <= tree->depth / 2 ||
-
 
296
              tree->nprocs < (int) smp_split_group)) { // Pierre-Marie Baty -- added type cast
-
 
297
        interest = tree->depth * 2 - tree->searched[0];
-
 
298
        if (interest > best_interest) {
-
 
299
          best_interest = interest;
-
 
300
          join_block = tree;
-
 
301
        }
-
 
302
      }
-
 
303
    }
-
 
304
/*
-
 
305
 ************************************************************
-
 
306
 *                                                          *
-
 
307
 *  Now we acquire the lock for this split block, and then  *
163
 *  Special case 2:  it is possible that there are no split *
308
 *  check to see if the block is still flagged as joinable. *
-
 
309
 *  If so, we set things up, and then we get pretty tricky  *
164
 *  blocks available.  In that case, CopyToChild() will     *
310
 *  as we then release the lock, and then copy the data     *
165
 *  terminate Crafty with a log-file error message telling  *
311
 *  from the parent to our split block.  There is a chance  *
166
 *  the user to re-compile with a larger number of split    *
312
 *  that while we are copying this data, the split point    *
-
 
313
 *  gets completed by other threads.  Which would leave an  *
-
 
314
 *  apparent race condition exposed where we start copying  *
167
 *  blocks (the probability of this actually happening is   *
315
 *  data here, the split point is completed, the parent     *
168
 *  incredibly small, but it is non-zero.  We use a VERY    *
316
 *  block is released and then reacquired and we continue   *
-
 
317
 *  if nothing has happened here, getting data copied from  *
169
 *  liberal estimate for the number of split blocks to      *
318
 *  two different positions.                                *
170
 *  allocate to avoid this.)                                *
319
 *                                                          *
-
 
320
 *  Fortunately, we linked this new split block to the old  *
-
 
321
 *  (original parent).  If that split block is released, we *
-
 
322
 *  will discover this because that operation will also set *
-
 
323
 *  our "stop" flag which will prevent us from using this   *
-
 
324
 *  data and breaking things.  We allow threads to copy     *
-
 
325
 *  this data without any lock protection to eliminate a    *
-
 
326
 *  serialization (each node would copy the data serially,  *
-
 
327
 *  rather than all at once) with the only consequence to   *
-
 
328
 *  this being the overhead of copying and then throwing    *
-
 
329
 *  the data away, which can happen on occasion even if we  *
-
 
330
 *  used a lock for protection, since once we release the   *
-
 
331
 *  lock it still takes time to get into the search and we  *
-
 
332
 *  could STILL find that this split block has already been *
-
 
333
 *  completed, once again.  Less contention and serial      *
-
 
334
 *  computing improves performance.                         *
171
 *                                                          *
335
 *                                                          *
172
 ************************************************************
336
 ************************************************************
173
 */
337
 */
-
 
338
    if (join_block) {
-
 
339
      Lock(join_block->lock);
-
 
340
      if (join_block->joinable) {
-
 
341
        child = GetBlock(join_block, (int) tid); // Pierre-Marie Baty -- added type cast
174
  thread[tree->thread_id].tree = 0;
342
        Unlock(join_block->lock);
175
  tree->nprocs = 0;
343
        if (child) {
176
  for (proc = 0; proc < smp_max_threads; proc++) {
344
          CopyFromParent(child);
177
    tree->siblings[proc] = 0;
345
          thread[tid].tree = child;
178
    if (thread[proc].idle) {
346
          parallel_joins++;
179
      CopyToChild(tree, proc);
347
          return 1;
-
 
348
        }
180
      nblocks++;
349
      } else {
181
      if (nblocks >= smp_split_group && tree->ply > tree->depth / 2)
350
        Unlock(join_block->lock);
182
        break;
351
        break;
-
 
352
      }
183
    }
353
    }
184
  }
354
  }
185
  CopyToChild(tree, tree->thread_id);
-
 
186
  nblocks++;
-
 
187
/*
355
/*
188
 ************************************************************
356
 ************************************************************
189
 *                                                          *
357
 *                                                          *
190
 *  Everything is set.  Now we can stuff the address of the *
358
 *  We did not acquire a split point to join, so we set     *
191
 *  thread blocks into thread[i].tree so that those idle    *
359
 *  smp_split to 1 to ask busy threads to create joinable   *
192
 *  threads can begin the parallel search.  We also flag    *
-
 
193
 *  them as "not idle" so that a split immediately after we *
-
 
194
 *  exit won't try to pick them up again, breaking things.  *
360
 *  split points.                                           *
195
 *                                                          *
361
 *                                                          *
196
 ************************************************************
362
 ************************************************************
197
 */
363
 */
198
  parallel_splits++;
364
  smp_split = 1;
-
 
365
  return 0;
-
 
366
}
-
 
367
 
-
 
368
/* modified 11/04/15 */
-
 
369
/*
-
 
370
 *******************************************************************************
-
 
371
 *                                                                             *
-
 
372
 *   ThreadInit() is called after a process is created.  Its main task is to   *
-
 
373
 *   initialize the process local memory so that it will fault in and be       *
-
 
374
 *   allocated on the local node rather than the node where the original       *
-
 
375
 *   (first) process was running.  All threads will hang here via a custom     *
-
 
376
 *   WaitForALlThreadsInitialized() procedure so that all the local thread     *
-
 
377
 *   blocks are usable before the search actually begins.                      *
-
 
378
 *                                                                             *
-
 
379
 *******************************************************************************
-
 
380
 */
-
 
381
void *STDCALL ThreadInit(void *t) {
-
 
382
  int tid = (int64_t) t;
-
 
383
 
-
 
384
  ThreadAffinity(tid);
-
 
385
#  if !defined(UNIX)
-
 
386
  ThreadMalloc((uint64_t) tid);
-
 
387
#  endif
199
  for (proc = 0; proc < smp_max_threads; proc++)
388
  thread[tid].blocks = 0xffffffffffffffffull;
-
 
389
  Lock(lock_smp);
-
 
390
  initialized_threads++;
-
 
391
  Unlock(lock_smp);
-
 
392
  WaitForAllThreadsInitialized();
-
 
393
  ThreadWait(tid, (TREE *) 0);
-
 
394
  Lock(lock_smp);
-
 
395
  smp_threads--;
-
 
396
  Unlock(lock_smp);
-
 
397
  return 0;
-
 
398
}
-
 
399
 
-
 
400
/* modified 01/08/15 */
-
 
401
/*
-
 
402
 *******************************************************************************
-
 
403
 *                                                                             *
-
 
404
 *   ThreadAffinity() is called to "pin" a thread to a specific processor.  It *
-
 
405
 *   is a "noop" (no-operation) if Crafty was not compiled with -DAFFINITY, or *
-
 
406
 *   if smp_affinity is negative (smpaffinity=-1 disables affinity).  It       *
-
 
407
 *   simply sets the affinity for the current thread to the requested CPU and  *
-
 
408
 *   returns.  NOTE:  If hyperthreading is enabled, there is no guarantee that *
-
 
409
 *   this will work as expected and pin one thread per physical core.  It      *
-
 
410
 *   depends on how the O/S numbers the SMT cores.                             *
-
 
411
 *                                                                             *
-
 
412
 *******************************************************************************
-
 
413
 */
200
    if (tree->siblings[proc]) {
414
void ThreadAffinity(int cpu) {
-
 
415
#  if defined(AFFINITY)
-
 
416
  cpu_set_t cpuset;
201
      thread[proc].tree = tree->siblings[proc];
417
  pthread_t current_thread = pthread_self();
-
 
418
 
202
      thread[proc].idle = 0;
419
  if (smp_affinity >= 0) {
-
 
420
    CPU_ZERO(&cpuset);
-
 
421
    CPU_SET(cpu + smp_affinity, &cpuset);
-
 
422
    pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
203
    }
423
  }
-
 
424
#  endif
-
 
425
}
-
 
426
 
-
 
427
/* modified 11/04/15 */
-
 
428
/*
-
 
429
 *******************************************************************************
-
 
430
 *                                                                             *
-
 
431
 *   ThreadSplit() is used to determine if we should split at the current ply. *
-
 
432
 *   There are some basic constraints on when splits can be done, such as the  *
-
 
433
 *   depth remaining in the search (don't split to near the tips), and have we *
-
 
434
 *   searched at least one move to get a score or bound (YBW condition).       *
-
 
435
 *                                                                             *
-
 
436
 *   If those conditions are satisfied, AND either a thread has requested a    *
-
 
437
 *   split OR we are far enough away from the tips of the tree to justify a    *
-
 
438
 *   "gratuitout split" then we return "success."  A "gratuitout split" is a   *
-
 
439
 *   split done without any idle threads.  Since splits are not free, we only  *
-
 
440
 *   do this well away from tips to limit overhead.  We do this so that when a *
-
 
441
 *   thread becomes idle, it will find these split points immediately and not  *
-
 
442
 *   have to wait for a split after the fact.                                  *
-
 
443
 *                                                                             *
-
 
444
 *******************************************************************************
-
 
445
 */
-
 
446
int ThreadSplit(TREE *RESTRICT tree, int ply, int depth, int alpha, int o_alpha, // Pierre-Marie Baty -- missing keyword
-
 
447
    int done) {
-
 
448
  TREE *used;
-
 
449
  int64_t tblocks;
-
 
450
  int temp, unused = 0;
-
 
451
 
204
/*
452
/*
205
 ************************************************************
453
 ************************************************************
206
 *                                                          *
454
 *                                                          *
207
 *  One final test before leaving.  We test all threads to  *
-
 
208
 *  determine if any are still idle (most likely due to the *
455
 *  First, we see if we meet the basic criteria to create a *
209
 *  smp_split_group limit, but a thread could go idle after *
-
 
210
 *  our loop to pick up idle threads).  If so we set split  *
-
 
211
 *  back to one (actually a temporary split variable since  *
-
 
212
 *  we don't want to remove the split=-1 state until just   *
456
 *  split point, that being that we must not be too far     *
213
 *  before we exit to avoid any races.)                     *
457
 *  from the root (smp_min_split_depth).                    *
214
 *                                                          *
458
 *                                                          *
215
 ************************************************************
459
 ************************************************************
216
 */
460
 */
217
  for (proc = 0; proc < smp_max_threads; proc++)
461
  if (depth < (int) smp_min_split_depth) // Pierre-Marie Baty -- added type cast
218
    if (thread[proc].idle)
-
 
219
      tsplit = 1;
462
    return 0;
220
/*
463
/*
221
 ************************************************************
464
 ************************************************************
222
 *                                                          *
465
 *                                                          *
223
 *  After the threads are kicked off to start the parallel  *
466
 *  If smp_split is NOT set, we are checking to see if it   *
224
 *  search using the idle threads, this thread has to be    *
467
 *  is acceptable to do a gratuitous split here.            *
225
 *  inserted as well.  However, since it is possible that   *
468
 *                                                          *
226
 *  this thread may finish before any or all of the other   *
469
 *  (1) if we are too far from the root we do not do        *
227
 *  parallel threads, this thread is sent to ThreadWait()   *
470
 *      gratuitous splits to avoid the overhead.            *
228
 *  which will immediately send it to SearchParallel() like *
471
 *                                                          *
229
 *  the other threads.  Going to ThreadWait() allows this   *
472
 *  (2) if we have searched more than one move at this ply, *
230
 *  thread to join others if it runs out of work to do.  We *
473
 *      we don't do any further tests to see if a           *
231
 *  do pass ThreadWait() the address of the parent thread   *
474
 *      gratuitous split is acceptable, since we have       *
232
 *  block, so that if this thread becomes idle, and this    *
475
 *      previously done this test at this ply and decided   *
233
 *  thread block shows no threads are still busy, then this *
476
 *      one should not be done.  That condition has likely  *
234
 *  thread can return to here and then back up into the     *
477
 *      not changed.                                        *
-
 
478
 *                                                          *
-
 
479
 *  (3) if we have pre-existing gratuitous split points for *
235
 *  previous ply as it should.  Note that no other thread   *
480
 *      this thread, we make certain we don't create more   *
236
 *  can back up to the previous ply since their recursive   *
481
 *      than the gratuitous split limit as excessive splits *
237
 *  call stacks are not set for that.                       *
482
 *      just add to the overhead with no benefit.           *
238
 *                                                          *
483
 *                                                          *
239
 ************************************************************
484
 ************************************************************
240
 */
485
 */
241
  smp_split = tsplit;
486
  if (!smp_split) {
-
 
487
    if (depth < (int) smp_gratuitous_depth || done > 1) // Pierre-Marie Baty -- added type cast
-
 
488
      return 0;
-
 
489
    tblocks = ~thread[tree->thread_id].blocks;
242
  Unlock(lock_smp);
490
    while (tblocks) {
-
 
491
      temp = LSB(tblocks);
-
 
492
      used = block[temp + tree->thread_id * 64 + 1];
-
 
493
      if (used->joinable && !used->joined)
-
 
494
        unused++;
-
 
495
      Clear(temp, tblocks);
-
 
496
    }
-
 
497
    if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
-
 
498
      return 0;
-
 
499
  }
-
 
500
/*
-
 
501
 ************************************************************
-
 
502
 *                                                          *
-
 
503
 *  If smp_split IS set, we are checking to see if it is    *
-
 
504
 *  acceptable to do a split because there are idle threads *
-
 
505
 *  that need work to do.                                   *
-
 
506
 *                                                          *
-
 
507
 *  The only reason this would be false is if we have a     *
-
 
508
 *  pre-existing split point that is joinable but has not   *
-
 
509
 *  been joined. If one exists, there is no need to split   *
-
 
510
 *  again as there is already an accessible split point.    *
-
 
511
 *  Otherwise, if we are at the root and we are either not  *
-
 
512
 *  allowed to split at the root, or we have additional     *
-
 
513
 *  root moves that have to be searched one at a time using *
-
 
514
 *  all available threads we also can not split here.       *
-
 
515
 *                                                          *
-
 
516
 ************************************************************
-
 
517
 */
-
 
518
  else {
-
 
519
    if (ply == 1 && (!smp_split_at_root || !NextRootMoveParallel() ||
-
 
520
            alpha == o_alpha))
-
 
521
      return 0;
243
  ThreadWait(tree->thread_id, tree);
522
    tblocks = ~thread[tree->thread_id].blocks;
-
 
523
    while (tblocks) {
-
 
524
      temp = LSB(tblocks);
-
 
525
      used = block[temp + tree->thread_id * 64 + 1];
-
 
526
      if (used->joinable && !used->joined)
-
 
527
        unused++;
-
 
528
      Clear(temp, tblocks);
-
 
529
    }
-
 
530
    if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
-
 
531
      return 0;
-
 
532
  }
244
  return 1;
533
  return 1;
245
}
534
}
246
 
535
 
247
/* modified 04/30/14 */
536
/* modified 11/04/15 */
248
/*
537
/*
249
 *******************************************************************************
538
 *******************************************************************************
250
 *                                                                             *
539
 *                                                                             *
251
 *   CopyToChild() is used to copy data from a parent thread to a particular   *
540
 *   ThreadStop() is called from SearchMoveList() when it detects a beta       *
-
 
541
 *   cutoff (fail high) at a node that is being searched in parallel.  We need *
-
 
542
 *   to stop all threads here, and since this threading algorithm is recursive *
-
 
543
 *   it may be necessary to stop other threads that are helping search this    *
-
 
544
 *   branch further down into the tree.  This function simply sets appropriate *
252
 *   child thread.  This only copies the appropriate parts of the TREE         *
545
 *   tree->stop variables to 1, which will stop those particular threads       *
253
 *   structure to avoid burning memory bandwidth by copying everything.        *
546
 *   instantly and return them to the idle loop in ThreadWait().               *
254
 *                                                                             *
547
 *                                                                             *
255
 *******************************************************************************
548
 *******************************************************************************
256
 */
549
 */
-
 
550
void ThreadStop(TREE * RESTRICT tree) {
-
 
551
  int proc;
-
 
552
 
-
 
553
  Lock(tree->lock);
-
 
554
  tree->stop = 1;
-
 
555
  tree->joinable = 0;
-
 
556
  for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
-
 
557
    if (tree->siblings[proc])
-
 
558
      ThreadStop(tree->siblings[proc]);
-
 
559
  Unlock(tree->lock);
-
 
560
}
-
 
561
 
-
 
562
/* modified 11/04/15 */
-
 
563
/*
-
 
564
 *******************************************************************************
-
 
565
 *                                                                             *
-
 
566
 *   ThreadTrace() is a debugging tool that simply walks the split block tree  *
-
 
567
 *   and displays interesting data to help debug the parallel search whenever  *
-
 
568
 *   changes break things.                                                     *
-
 
569
 *                                                                             *
-
 
570
 *******************************************************************************
-
 
571
 */
257
void CopyToChild(TREE * RESTRICT parent, int thread) {
572
void ThreadTrace(TREE * RESTRICT tree, int depth, int brief) {
-
 
573
  int proc, i;
-
 
574
 
-
 
575
  Lock(tree->lock);
-
 
576
  Lock(lock_io);
-
 
577
  if (!brief) {
-
 
578
    for (i = 0; i < 4 * depth; i++)
-
 
579
      Print(4095, " ");
-
 
580
    depth++;
-
 
581
    Print(4095, "block[%d]  thread=%d  ply=%d  nprocs=%d  ",
-
 
582
        FindBlockID(tree), tree->thread_id, tree->ply, tree->nprocs);
-
 
583
    Print(4095, "joined=%d  joinable=%d  stop=%d  nodes=%d", tree->joined,
-
 
584
        tree->joinable, tree->stop, tree->nodes_searched);
-
 
585
    Print(4095, "  parent=%d\n", FindBlockID(tree->parent));
-
 
586
  } else {
-
 
587
    if (tree->nprocs > 1) {
-
 
588
      for (i = 0; i < 4 * depth; i++)
-
 
589
        Print(4095, " ");
-
 
590
      depth++;
258
  int i, j, /*max, */ply = parent->ply;
591
      Print(4095, "(ply %d)", tree->ply);
-
 
592
    }
-
 
593
  }
-
 
594
  if (tree->nprocs) {
-
 
595
    if (!brief) {
-
 
596
      for (i = 0; i < 4 * depth; i++)
-
 
597
        Print(4095, " ");
-
 
598
      Print(4095, "          parent=%d  sibling threads=",
-
 
599
          FindBlockID(tree->parent));
259
  unsigned int max_; // Pierre-Marie Baty -- should be unsigned
600
      for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
-
 
601
        if (tree->siblings[proc])
-
 
602
          Print(4095, " %d(%d)", proc, FindBlockID(tree->siblings[proc]));
-
 
603
      Print(4095, "\n");
260
  TREE *child;
604
    } else {
261
  static int warnings = 0;
605
      if (tree->nprocs > 1) {
-
 
606
        Print(4095, " helping= ");
-
 
607
        for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
-
 
608
          if (tree->siblings[proc]) {
262
  int first = thread * MAX_BLOCKS_PER_CPU + 1;
609
            if (proc == tree->thread_id)
-
 
610
              Print(4095, "[");
-
 
611
            Print(4095, "%d", proc);
263
  int last = first + MAX_BLOCKS_PER_CPU;
612
            if (proc == tree->thread_id)
-
 
613
              Print(4095, "]");
-
 
614
            Print(4095, " ");
-
 
615
          }
-
 
616
        Print(4095, "\n");
-
 
617
      }
-
 
618
    }
-
 
619
  }
-
 
620
  Unlock(lock_io);
-
 
621
  for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
-
 
622
    if (tree->siblings[proc])
-
 
623
      ThreadTrace(tree->siblings[proc], depth, brief);
-
 
624
  Unlock(tree->lock);
-
 
625
}
-
 
626
 
-
 
627
/* modified 11/04/15 */
-
 
628
/*
-
 
629
 *******************************************************************************
-
 
630
 *                                                                             *
-
 
631
 *   ThreadWait() is the idle loop for the N threads that are created at the   *
-
 
632
 *   beginning when Crafty searches.  Threads are "parked" here waiting on a   *
-
 
633
 *   pointer to something they should search (a parameter block built in the   *
-
 
634
 *   function Split() in this case.  When this pointer becomes non-zero, each  *
-
 
635
 *   thread "parked" here will immediately call SearchMoveList() and begin the *
-
 
636
 *   parallel search as directed.                                              *
-
 
637
 *                                                                             *
-
 
638
 *   Upon entry, all threads except for the "master" will arrive here with a   *
-
 
639
 *   value of zero (0) in the waiting parameter below.  This indicates that    *
-
 
640
 *   they will search and them be done.  The "master" will arrive here with a  *
-
 
641
 *   pointer to the parent split block in "waiting" which says I will sit here *
-
 
642
 *   waiting for work OR when the waiting split block has no threads working   *
-
 
643
 *   on it, at which point I should return which will let me "unsplit" here    *
-
 
644
 *   and clean things up.  The call to here in Split() passes this block       *
-
 
645
 *   address while threads that are helping get here with a zero.              *
-
 
646
 *                                                                             *
-
 
647
 *******************************************************************************
-
 
648
 */
264
  int maxb = smp_max_threads * MAX_BLOCKS_PER_CPU + 1;
649
int ThreadWait(int tid, TREE * RESTRICT waiting) {
-
 
650
  int value, tstart, tend;
265
 
651
 
266
/*
652
/*
267
 ************************************************************
653
 ************************************************************
268
 *                                                          *
654
 *                                                          *
269
 *  One NUMA-related trick is that we first try to allocate *
655
 *  When we reach this point, one of three possible         *
270
 *  a split block in the thread's local memory.  Each       *
656
 *  conditions is true (1) we already have work to do, as   *
271
 *  thread has a group of split blocks that were first      *
657
 *  we are the "master thread" and we have already split    *
272
 *  touched by the correct CPU so that the split blocks     *
658
 *  the tree, we are coming here to join in;  (2) we are    *
273
 *  page faulted into local memory for that specific        *
659
 *  the master, and we are waiting on our split point to    *
274
 *  processor.  If we can't find an optimal-placed block,   *
660
 *  complete, so we come here to join and help currently    *
-
 
661
 *  active threads;  (3) we have no work to do, so we will  *
-
 
662
 *  spin until Join() locates a split pont we can join to   *
-
 
663
 *  help out.                                               *
-
 
664
 *                                                          *
275
 *  the second pass will find the first available block.    *
665
 *  Note that when we get here, the parent already has a    *
-
 
666
 *  split block and does not need to call Join(), it simply *
276
 *  If none can be found, we return as we can not split     *
667
 *  falls through the while spin loop below because its     *
277
 *  until other split blocks are freed up.                  *
668
 *  "tree" pointer is already non-zero.                     *
278
 *                                                          *
669
 *                                                          *
279
 ************************************************************
670
 ************************************************************
280
 */
671
 */
-
 
672
  while (1) {
281
  for (i = first; i < last && block[i]->used; i++);
673
    tstart = ReadClock();
-
 
674
    while (!thread[tid].tree && (!waiting || waiting->nprocs) && !Join(tid) &&
-
 
675
        !thread[tid].terminate)
282
  if (i >= last) {
676
      Pause();
-
 
677
    tend = ReadClock();
283
    if (++warnings < 6)
678
    if (!thread[tid].tree)
-
 
679
      thread[tid].tree = waiting;
-
 
680
    thread[tid].idle += tend - tstart;
-
 
681
    if (thread[tid].tree == waiting || thread[tid].terminate)
284
      Print(128,
682
      return 0;
-
 
683
/*
-
 
684
 ************************************************************
-
 
685
 *                                                          *
-
 
686
 *  Once we get here, we have a good split point, so we are *
-
 
687
 *  ready to participate in a parallel search.  Once we     *
-
 
688
 *  return from SearchMoveList() we copy our results back   *
-
 
689
 *  to the parent via CopyToParent() before we look for a   *
-
 
690
 *  new split point.  If we are a parent, we will slip out  *
285
          "WARNING.  optimal SMP block cannot be allocated, thread %d\n",
691
 *  of the spin loop at the top and return to the normal    *
-
 
692
 *  serial search to finish up here.                        *
-
 
693
 *                                                          *
-
 
694
 *  When we return from SearchMoveList(), we need to        *
-
 
695
 *  decrement the "nprocs" value since there is now one     *
-
 
696
 *  less thread working at this split point.                *
-
 
697
 *                                                          *
-
 
698
 *  Note:  CopyToParent() marks the current split block as  *
-
 
699
 *  unused once the copy is completed, so we don't have to  *
-
 
700
 *  do anything about that here.                            *
-
 
701
 *                                                          *
-
 
702
 ************************************************************
-
 
703
 */
286
          thread);
704
    value =
-
 
705
        SearchMoveList(thread[tid].tree, thread[tid].tree->ply,
-
 
706
        thread[tid].tree->depth, thread[tid].tree->wtm,
287
    for (i = 1; i < maxb && block[i]->used; i++);
707
        thread[tid].tree->alpha, thread[tid].tree->beta,
-
 
708
        thread[tid].tree->searched, thread[tid].tree->in_check, 0, parallel);
288
    if (i >= maxb) {
709
    tstart = ReadClock();
-
 
710
    Lock(thread[tid].tree->parent->lock);
-
 
711
    thread[tid].tree->parent->joinable = 0;
289
      Print(128, "ERROR.  no SMP block can be allocated\n");
712
    CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value);
-
 
713
    thread[tid].tree->parent->nprocs--;
-
 
714
    thread[tid].tree->parent->siblings[tid] = 0;
-
 
715
    Unlock(thread[tid].tree->parent->lock);
290
      exit(1);
716
    thread[tid].tree = 0;
291
    }
717
    tend = ReadClock();
-
 
718
    thread[tid].idle += tend - tstart;
292
  }
719
  }
-
 
720
}
-
 
721
 
293
  max_ = 0;
722
/* modified 11/04/15 */
-
 
723
/*
-
 
724
 *******************************************************************************
-
 
725
 *                                                                             *
-
 
726
 *   CopyFromParent() is used to copy data from a parent thread to a child     *
-
 
727
 *   thread.  This only copies the appropriate parts of the TREE structure to  *
-
 
728
 *   avoid burning memory bandwidth by copying everything.                     *
-
 
729
 *                                                                             *
-
 
730
 *******************************************************************************
-
 
731
 */
294
  for (j = 1; j < maxb; j++)
732
void CopyFromParent(TREE * RESTRICT child) {
295
    if (block[j]->used)
733
  TREE *parent = child->parent;
296
      max_++;
734
  int i, ply;
297
  max_split_blocks = Max(max_split_blocks, max_);
-
 
-
 
735
 
298
/*
736
/*
299
 ************************************************************
737
 ************************************************************
300
 *                                                          *
738
 *                                                          *
301
 *  We have allocated a split block.  Now we copy the tree  *
739
 *  We have allocated a split block.  Now we copy the tree  *
302
 *  search state from the parent block to the child in      *
740
 *  search state from the parent block to the child in      *
303
 *  preparation for starting the parallel search.           *
741
 *  preparation for starting the parallel search.           *
304
 *                                                          *
742
 *                                                          *
305
 ************************************************************
743
 ************************************************************
306
 */
744
 */
307
  child = block[i];
745
  ply = parent->ply;
308
  child->used = 1;
-
 
309
  child->stop = 0;
-
 
310
  for (i = 0; i < smp_max_threads; i++)
-
 
311
    child->siblings[i] = 0;
-
 
312
  child->nprocs = 0;
-
 
313
  child->ply = ply;
746
  child->ply = ply;
314
  child->position = parent->position;
747
  child->position = parent->position;
315
  child->pv[ply - 1] = parent->pv[ply - 1];
-
 
316
  child->pv[ply] = parent->pv[ply];
-
 
317
  child->next_status[ply] = parent->next_status[ply];
-
 
318
  child->save_hash_key[ply] = parent->save_hash_key[ply];
-
 
319
  child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply];
-
 
320
  child->rep_index = parent->rep_index;
-
 
321
  for (i = 0; i <= parent->rep_index + parent->ply; i++)
748
  for (i = 0; i <= rep_index + parent->ply; i++)
322
    child->rep_list[i] = parent->rep_list[i];
749
    child->rep_list[i] = parent->rep_list[i];
323
  child->last[ply] = child->move_list;
750
  for (i = ply - 1; i < MAXPLY; i++)
324
  child->hash_move[ply] = parent->hash_move[ply];
751
    child->killers[i] = parent->killers[i];
325
  for (i = 1; i <= ply + 1; i++) {
752
  for (i = ply - 1; i <= ply; i++) {
326
    child->status[i] = parent->status[i];
-
 
327
    child->curmv[i] = parent->curmv[i];
753
    child->curmv[i] = parent->curmv[i];
328
    child->inchk[i] = parent->inchk[i];
-
 
329
    child->phase[i] = parent->phase[i];
754
    child->pv[i] = parent->pv[i];
330
  }
755
  }
331
  for (i = 1; i < MAXPLY; i++)
756
  child->in_check = parent->in_check;
-
 
757
  child->last[ply] = child->move_list;
-
 
758
  child->status[ply] = parent->status[ply];
332
    child->killers[i] = parent->killers[i];
759
  child->status[1] = parent->status[1];
-
 
760
  child->save_hash_key[ply] = parent->save_hash_key[ply];
-
 
761
  child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply];
333
  child->nodes_searched = 0;
762
  child->nodes_searched = 0;
334
  child->fail_highs = 0;
763
  child->fail_highs = 0;
335
  child->fail_high_first_move = 0;
764
  child->fail_high_first_move = 0;
336
  child->evaluations = 0;
765
  child->evaluations = 0;
337
  child->egtb_probes = 0;
766
  child->egtb_probes = 0;
338
  child->egtb_probes_successful = 0;
767
  child->egtb_hits = 0;
339
  child->extensions_done = 0;
768
  child->extensions_done = 0;
340
  child->qchecks_done = 0;
769
  child->qchecks_done = 0;
341
  child->reductions_done = 0;
-
 
342
  child->moves_fpruned = 0;
770
  child->moves_fpruned = 0;
-
 
771
  child->moves_mpruned = 0;
-
 
772
  for (i = 0; i < 16; i++) {
-
 
773
    child->LMR_done[i] = 0;
-
 
774
    child->null_done[i] = 0;
-
 
775
  }
343
  child->alpha = parent->alpha;
776
  child->alpha = parent->alpha;
344
  child->beta = parent->beta;
777
  child->beta = parent->beta;
345
  child->value = parent->value;
778
  child->value = parent->value;
346
  child->side = parent->side;
779
  child->wtm = parent->wtm;
347
  child->depth = parent->depth;
780
  child->depth = parent->depth;
348
  parent->siblings[thread] = child;
-
 
349
  child->thread_id = thread;
-
 
350
  child->parent = parent;
781
  child->searched = parent->searched;
351
  parent->nprocs++;
-
 
352
  strcpy_s(child->root_move_text, sizeof (child->root_move_text), parent->root_move_text); // Pierre-Marie Baty -- use safe version
782
  strcpy(child->root_move_text, parent->root_move_text);
353
  strcpy_s(child->remaining_moves_text, sizeof (child->remaining_moves_text), parent->remaining_moves_text); // Pierre-Marie Baty -- use safe version
783
  strcpy(child->remaining_moves_text, parent->remaining_moves_text);
354
}
784
}
355
 
785
 
356
/* modified 04/18/14 */
786
/* modified 11/04/15 */
357
/*
787
/*
358
 *******************************************************************************
788
 *******************************************************************************
359
 *                                                                             *
789
 *                                                                             *
360
 *   CopyToParent() is used to copy data from a child thread to a parent       *
790
 *   CopyToParent() is used to copy data from a child thread to a parent       *
361
 *   thread.  This only copies the appropriate parts of the TREE structure to  *
791
 *   thread.  This only copies the appropriate parts of the TREE structure to  *
362
 *   avoid burning memory bandwidth by copying everything.                     *
792
 *   avoid burning memory bandwidth by copying everything.                     *
363
 *                                                                             *
793
 *                                                                             *
364
 *******************************************************************************
794
 *******************************************************************************
365
 */
795
 */
366
void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) {
796
void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) {
367
  int ply = parent->ply;
797
  int i, ply = parent->ply, which;
368
 
798
 
369
/*
799
/*
370
 ************************************************************
800
 ************************************************************
371
 *                                                          *
801
 *                                                          *
372
 *  Only concern here is to make sure that the info is only *
802
 *  The only concern here is to make sure that the info is  *
373
 *  copied to the parent if our score is > than the parent  *
803
 *  only copied to the parent if our score is > than the    *
374
 *  value, and that we were not stopped for any reason      *
804
 *  parent value, and that we were not stopped for any      *
375
 *  which could produce a partial score that is worthless.  *
805
 *  reason which could produce a partial score that is      *
-
 
806
 *  worthless and dangerous to use.                         *
-
 
807
 *                                                          *
-
 
808
 *  One important special case.  If we get here with the    *
-
 
809
 *  thread->stop flag set, and ply is 1, then we need to    *
-
 
810
 *  clear the "this move has been searched" flag in the ply *
-
 
811
 *  1 move list since we did not complete the search.  If   *
-
 
812
 *  we fail to do this, then a move being searched in       *
-
 
813
 *  parallel at the root will be "lost" for this iteration  *
-
 
814
 *  and won't be searched again until the next iteration.   *
376
 *                                                          *
815
 *                                                          *
377
 *  In any case, we add our statistical counters to the     *
816
 *  In any case, we add our statistical counters to the     *
378
 *  parent's totals no matter whether we finished or not    *
817
 *  parent's totals no matter whether we finished or not    *
379
 *  since the total nodes searched and such should consider *
818
 *  since the total nodes searched and such should consider *
380
 *  everything searched, not just the "useful stuff."       *
819
 *  everything searched, not just the "useful stuff."       *
-
 
820
 *                                                          *
-
 
821
 *  After we finish copying everything, we mark this split  *
-
 
822
 *  block as free in the split block bitmap.                *
381
 *                                                          *
823
 *                                                          *
382
 ************************************************************
824
 ************************************************************
383
 */
825
 */
384
  if (child->nodes_searched && !child->stop && value > parent->value &&
826
  if (child->nodes_searched && !child->stop && value > parent->value &&
385
      !abort_search) {
827
      !abort_search) {
386
    parent->pv[ply] = child->pv[ply];
828
    parent->pv[ply] = child->pv[ply];
387
    parent->value = value;
829
    parent->value = value;
388
    parent->cutmove = child->curmv[ply];
830
    parent->cutmove = child->curmv[ply];
389
  }
831
  }
-
 
832
  if (child->stop && ply == 1)
-
 
833
    for (which = 0; which < n_root_moves; which++)
-
 
834
      if (root_moves[which].move == child->curmv[ply]) {
-
 
835
        root_moves[which].status &= 7;
-
 
836
        break;
-
 
837
      }
390
  parent->nodes_searched += child->nodes_searched;
838
  parent->nodes_searched += child->nodes_searched;
391
  parent->fail_highs += child->fail_highs;
839
  parent->fail_highs += child->fail_highs;
392
  parent->fail_high_first_move += child->fail_high_first_move;
840
  parent->fail_high_first_move += child->fail_high_first_move;
393
  parent->evaluations += child->evaluations;
841
  parent->evaluations += child->evaluations;
394
  parent->egtb_probes += child->egtb_probes;
842
  parent->egtb_probes += child->egtb_probes;
395
  parent->egtb_probes_successful += child->egtb_probes_successful;
843
  parent->egtb_hits += child->egtb_hits;
396
  parent->extensions_done += child->extensions_done;
844
  parent->extensions_done += child->extensions_done;
397
  parent->qchecks_done += child->qchecks_done;
845
  parent->qchecks_done += child->qchecks_done;
398
  parent->reductions_done += child->reductions_done;
-
 
399
  parent->moves_fpruned += child->moves_fpruned;
846
  parent->moves_fpruned += child->moves_fpruned;
-
 
847
  parent->moves_mpruned += child->moves_mpruned;
400
  child->used = 0;
848
  for (i = 1; i < 16; i++) {
-
 
849
    parent->LMR_done[i] += child->LMR_done[i];
-
 
850
    parent->null_done[i] += child->null_done[i];
-
 
851
  }
-
 
852
  which = FindBlockID(child) - 64 * child->thread_id - 1;
-
 
853
  Set(which, thread[child->thread_id].blocks);
401
}
854
}
402
 
855
 
-
 
856
/* modified 11/04/15 */
403
/*
857
/*
404
 *******************************************************************************
858
 *******************************************************************************
405
 *                                                                             *
859
 *                                                                             *
406
 *   WaitForAllThreadsInitialized() waits until all smp_max_threads are        *
860
 *   GetBlock() is used to allocate a split block and fill in only SMP-        *
407
 *   initialized.  We have to initialize each thread and malloc() its split    *
861
 *   critical information.  The child process will copy the rest of the split  *
408
 *   blocks before we start the actual parallel search.  Otherwise we will see *
862
 *   block information as needed.                                              *
409
 *   invalid memory accesses and crash instantly.                              *
863
 *                                                                             *
-
 
864
 *   When we arrive here, the parent split block must be locked since we are   *
-
 
865
 *   going to change data in that block as well as copy data from that block   *
-
 
866
 *   the current split block.  The only exception is when this is the original *
-
 
867
 *   split operation, since this is done "out of sight" of other threads which *
-
 
868
 *   means no locks are needed until after the "joinable" flag is set, which   *
-
 
869
 *   exposes this split point to other threads instantly.                      *
410
 *                                                                             *
870
 *                                                                             *
411
 *******************************************************************************
871
 *******************************************************************************
412
 */
872
 */
-
 
873
TREE *GetBlock(TREE * /*RESTRICT*/ parent, int tid) { // Pierre-Marie Baty -- keyword not present in declaration
-
 
874
  TREE *child;
-
 
875
  static int warnings = 0;
-
 
876
  int i, unused;
-
 
877
/*
-
 
878
 ************************************************************
-
 
879
 *                                                          *
-
 
880
 *  One NUMA-related trick is that we only allocate a split *
-
 
881
 *  block in the thread's local memory.  Each thread has a  *
-
 
882
 *  group of split blocks that were first touched by the    *
-
 
883
 *  correct CPU so that the split blocks page-faulted into  *
-
 
884
 *  local memory for that specific processor.  If we can't  *
-
 
885
 *  find an optimally-placed block, we return a zero which  *
-
 
886
 *  will prevent this thread from joining the split point.  *
-
 
887
 *  This is highly unlikely as it would mean the current    *
-
 
888
 *  thread has 64 active split blocks where it is waiting   *
-
 
889
 *  on other threads to complete the last bit of work at    *
-
 
890
 *  each.  This is extremely unlikely.                      *
-
 
891
 *                                                          *
-
 
892
 *  Here we use a simple 64 bit bit-map per thread that     *
-
 
893
 *  indicates which blocks are free (1) and which blocks    *
-
 
894
 *  used (0).  We simply use LSB() to find the rightmost    *
-
 
895
 *  one-bit set and that is the local block number.  We     *
-
 
896
 *  convert that to a global block number by doing the      *
-
 
897
 *  simple computation:                                     *
-
 
898
 *                                                          *
-
 
899
 *     global = local + 64 * tid + 1                        *
-
 
900
 *                                                          *
-
 
901
 *  Each thread has exactly 64 split blocks, and block 0    *
-
 
902
 *  is the "master block" that never gets allocated or      *
-
 
903
 *  freed.  Once we find a free block for the current       *
-
 
904
 *  thread, we zero that bit so that the block won't be     *
-
 
905
 *  used again until it is released.                        *
-
 
906
 *                                                          *
-
 
907
 ************************************************************
-
 
908
 */
413
void WaitForAllThreadsInitialized(void) {
909
  if (thread[tid].blocks) {
-
 
910
    unused = LSB(thread[tid].blocks);
-
 
911
    Clear(unused, thread[tid].blocks);
-
 
912
    Set(unused, thread[tid].max_blocks);
-
 
913
  } else {
-
 
914
    if (++warnings < 6)
-
 
915
      Print(2048,
-
 
916
          "WARNING.  local SMP block cannot be allocated, thread %d\n", tid);
-
 
917
    return 0;
-
 
918
  }
-
 
919
  child = block[unused + tid * 64 + 1];
-
 
920
/*
-
 
921
 ************************************************************
-
 
922
 *                                                          *
-
 
923
 *  Found a split block.  Now we need to fill in only the   *
-
 
924
 *  critical information that can't be delayed due to race  *
-
 
925
 *  conditions.  When we get here, the parent split block   *
-
 
926
 *  must be locked, that lets us safely update the number   *
-
 
927
 *  of processors working here, etc, without any ugly race  *
-
 
928
 *  conditions that would corrupt this critical data.       *
-
 
929
 *                                                          *
-
 
930
 ************************************************************
-
 
931
 */
414
  while (initialized_threads < smp_max_threads);        /* Do nothing */
932
  for (i = 0; i < (int) smp_max_threads; i++) // Pierre-Marie Baty -- added type cast
-
 
933
    child->siblings[i] = 0;
-
 
934
  child->nprocs = 0;
-
 
935
  child->stop = 0;
-
 
936
  child->joinable = 0;
-
 
937
  child->joined = 0;
-
 
938
  child->parent = parent;
-
 
939
  child->thread_id = tid;
-
 
940
  parent->nprocs++;
-
 
941
  parent->siblings[tid] = child;
-
 
942
  parent->joined = 1;
-
 
943
  return child;
415
}
944
}
416
 
945
 
417
/* modified 06/07/09 */
-
 
418
/*
946
/*
419
 *******************************************************************************
947
 *******************************************************************************
420
 *                                                                             *
948
 *                                                                             *
421
 *   ThreadInit() is called after a process is created.  Its main task is to   *
949
 *   WaitForAllThreadsInitialized() waits until all smp_max_threads are        *
422
 *   initialize the process local memory so that it will fault in and be       *
950
 *   initialized.  We have to initialize each thread and malloc() its split    *
423
 *   allocated on the local node rather than the node where the original       *
951
 *   blocks before we start the actual parallel search.  Otherwise we will see *
424
 *   (first) process was running.  All threads will hang here via a custom     *
-
 
425
 *   WaitForALlThreadsInitialized() procedure so that all the local thread     *
-
 
426
 *   blocks are usable before the search actually begins.                      *
952
 *   invalid memory accesses and crash instantly.                              *
427
 *                                                                             *
953
 *                                                                             *
428
 *******************************************************************************
954
 *******************************************************************************
429
 */
955
 */
430
void *STDCALL ThreadInit(void *tid) {
956
void WaitForAllThreadsInitialized(void) {
431
  int i, j;
-
 
432
#if defined(AFFINITY)
-
 
433
  int64_t k;
-
 
434
  cpu_set_t cpuset;
-
 
435
  pthread_t current_thread = pthread_self();
-
 
436
 
-
 
437
  k = (int64_t) tid;
-
 
438
  CPU_ZERO(&cpuset);
-
 
439
  CPU_SET(k, &cpuset);
-
 
440
  pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
-
 
441
#endif
-
 
442
#if !defined(UNIX)
-
 
443
  ThreadMalloc((uint64_t) tid);
-
 
444
#endif
-
 
445
  for (i = 0; i < MAX_BLOCKS_PER_CPU; i++) {
-
 
446
    memset((void *) block[(uint64_t) tid * MAX_BLOCKS_PER_CPU + i + 1], 0,
957
  while (initialized_threads < (int) smp_max_threads); /* Do nothing */ // Pierre-Marie Baty -- added type cast
447
        sizeof(TREE));
-
 
448
    block[(uint64_t) tid * MAX_BLOCKS_PER_CPU + i + 1]->used = 0;
-
 
449
    block[(uint64_t) tid * MAX_BLOCKS_PER_CPU + i + 1]->parent = NULL;
-
 
450
    LockInit(block[(uint64_t) tid * MAX_BLOCKS_PER_CPU + i + 1]->lock);
-
 
451
    for (j = 0; j < 64; j++)
-
 
452
      block[(uint64_t) tid * MAX_BLOCKS_PER_CPU + i + 1]->cache_n[j] = ~0ull;
-
 
453
  }
-
 
454
  Lock(lock_smp);
-
 
455
  initialized_threads++;
-
 
456
  Unlock(lock_smp);
-
 
457
  WaitForAllThreadsInitialized();
-
 
458
  ThreadWait((uint64_t) tid, (TREE *) 0);
-
 
459
  Lock(lock_smp);
-
 
460
  smp_threads--;
-
 
461
  Unlock(lock_smp);
-
 
462
  return 0;
-
 
463
}
958
}
464
 
959
 
465
#if !defined (UNIX)
960
#  if !defined (UNIX)
466
/* modified 01/17/09 */
961
/* modified 01/17/09 */
467
/*
962
/*
468
 *******************************************************************************
963
 *******************************************************************************
469
 *                                                                             *
964
 *                                                                             *
470
 *   ThreadMalloc() is called from the ThreadInit() function.  It malloc's the *
965
 *   ThreadMalloc() is called from the ThreadInit() function.  It malloc's the *
Line 473... Line 968...
473
 *                                                                             *
968
 *                                                                             *
474
 *******************************************************************************
969
 *******************************************************************************
475
 */
970
 */
476
extern void *WinMalloc(size_t, int);
971
extern void *WinMalloc(size_t, int);
477
void ThreadMalloc(int64_t tid) {
972
void ThreadMalloc(int64_t tid) {
478
  int i, n = MAX_BLOCKS_PER_CPU;
973
  int i;
479
 
974
 
480
  for (i = MAX_BLOCKS_PER_CPU * (int) (tid) + 1; n; i++, n--) { // Pierre-Marie Baty -- added type cast
975
  for (i = (int) tid * 64 + 1; i < (int) tid * 64 + 65; i++) { // Pierre-Marie Baty -- added type casts
481
    if (block[i] == NULL)
976
    if (block[i] == NULL)
482
      block[i] =
977
      block[i] =
483
          (TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) +
978
          (TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) +
484
                  127, (int) tid))); // Pierre-Marie Baty -- added type cast
979
                  127, (int) tid))); // Pierre-Marie Baty -- added type cast
485
    block[i]->used = 0;
-
 
486
    block[i]->parent = NULL;
980
    block[i]->parent = NULL;
487
    LockInit(block[i]->lock);
981
    LockInit(block[i]->lock);
488
  }
982
  }
489
}
983
}
-
 
984
#  endif
490
#endif
985
#endif
491
 
-
 
492
/* modified 01/17/09 */
-
 
493
/*
-
 
494
 *******************************************************************************
-
 
495
 *                                                                             *
-
 
496
 *   ThreadStop() is called from SearchParallel() when it detects a beta       *
-
 
497
 *   cutoff (fail high) at a node that is being searched in parallel.  We need *
-
 
498
 *   to stop all threads here, and since this threading algorithm is recursive *
-
 
499
 *   it may be necessary to stop other threads that are helping search this    *
-
 
500
 *   branch further down into the tree.  This function simply sets tree->stop  *
-
 
501
 *   to 1, which will stop that particular thread instantly and return it to   *
-
 
502
 *   the idle loop in ThreadWait().                                            *
-
 
503
 *                                                                             *
-
 
504
 *******************************************************************************
-
 
505
 */
-
 
506
void ThreadStop(TREE * RESTRICT tree) {
-
 
507
  int proc;
-
 
508
 
-
 
509
  Lock(tree->lock);
-
 
510
  tree->stop = 1;
-
 
511
  for (proc = 0; proc < smp_max_threads; proc++)
-
 
512
    if (tree->siblings[proc])
-
 
513
      ThreadStop(tree->siblings[proc]);
-
 
514
  Unlock(tree->lock);
-
 
515
}
-
 
516
 
-
 
517
/* modified 04/23/14 */
-
 
518
/*
-
 
519
 *******************************************************************************
-
 
520
 *                                                                             *
-
 
521
 *   ThreadWait() is the idle loop for the N threads that are created at the   *
-
 
522
 *   beginning when Crafty searches.  Threads are "parked" here waiting on a   *
-
 
523
 *   pointer to something they should search (a parameter block built in the   *
-
 
524
 *   function Thread() in this case.  When this pointer becomes non-zero, each *
-
 
525
 *   thread "parked" here will immediately call SearchParallel() and begin the *
-
 
526
 *   parallel search as directed.                                              *
-
 
527
 *                                                                             *
-
 
528
 *******************************************************************************
-
 
529
 */
-
 
530
int ThreadWait(int64_t tid, TREE * RESTRICT waiting) {
-
 
531
  int value, tstart, tend;
-
 
532
 
-
 
533
/*
-
 
534
 ************************************************************
-
 
535
 *                                                          *
-
 
536
 *  The N initial threads enter here and are kept penned    *
-
 
537
 *  here forever.  However, once a thread leaves here it    *
-
 
538
 *  may well re-enter ThreadWait() from the top while it    *
-
 
539
 *  waits for a parallel search to complete.  While it      *
-
 
540
 *  waits here it can also join in to help other busy       *
-
 
541
 *  threads search their subtrees as well.                  *
-
 
542
 *                                                          *
-
 
543
 ************************************************************
-
 
544
 */
-
 
545
  while (1) {
-
 
546
    tstart = ReadClock();
-
 
547
    Lock(lock_split);
-
 
548
    Lock(lock_smp);
-
 
549
    smp_idle++;
-
 
550
    if (!waiting && smp_split == 0)
-
 
551
      smp_split = 1;
-
 
552
    if (!thread[tid].tree)
-
 
553
      thread[tid].idle = 1;
-
 
554
    Unlock(lock_smp);
-
 
555
    Unlock(lock_split);
-
 
556
/*
-
 
557
 ************************************************************
-
 
558
 *                                                          *
-
 
559
 *  We can exit if our thread[i] is non-zero, or if we are  *
-
 
560
 *  waiting on others to finish a block that *we* have to   *
-
 
561
 *  return through.  When the busy count on such a block    *
-
 
562
 *  hits zero, we return immediately which unwinds the      *
-
 
563
 *  search as it should be.                                 *
-
 
564
 *                                                          *
-
 
565
 ************************************************************
-
 
566
 */
-
 
567
    while (!thread[tid].tree && (!waiting || waiting->nprocs))
-
 
568
      Pause();
-
 
569
    thread[tid].idle = 0;
-
 
570
    tend = ReadClock();
-
 
571
    Lock(lock_smp);
-
 
572
    idle_time += tend - tstart;
-
 
573
    if (!thread[tid].tree)
-
 
574
      thread[tid].tree = waiting;
-
 
575
/*
-
 
576
 ************************************************************
-
 
577
 *                                                          *
-
 
578
 *  We either have work to do, or threads we were waiting   *
-
 
579
 *  on have finished their work.                            *
-
 
580
 *                                                          *
-
 
581
 ************************************************************
-
 
582
 */
-
 
583
    smp_idle--;
-
 
584
    Unlock(lock_smp);
-
 
585
/*
-
 
586
 ************************************************************
-
 
587
 *                                                          *
-
 
588
 *  If we are waiting on a block and the busy count is now  *
-
 
589
 *  zero, we simply return to finish up the bookkeeping at  *
-
 
590
 *  that point.                                             *
-
 
591
 *                                                          *
-
 
592
 ************************************************************
-
 
593
 */
-
 
594
    if (thread[tid].tree == waiting || thread[tid].tree == (TREE *) - 1)
-
 
595
      return 0;
-
 
596
/*
-
 
597
 ************************************************************
-
 
598
 *                                                          *
-
 
599
 *  Else our thread[i] pointer is non-zero, meaning we have *
-
 
600
 *  been assigned something to search.  Hi-ho, hi-ho, it's  *
-
 
601
 *  off to work we go...                                    *
-
 
602
 *                                                          *
-
 
603
 ************************************************************
-
 
604
 */
-
 
605
    value =
-
 
606
        SearchParallel(thread[tid].tree, thread[tid].tree->alpha,
-
 
607
        thread[tid].tree->beta, thread[tid].tree->value,
-
 
608
        thread[tid].tree->side, thread[tid].tree->depth,
-
 
609
        thread[tid].tree->ply);
-
 
610
    Lock(thread[tid].tree->parent->lock);
-
 
611
    CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value);
-
 
612
    thread[tid].tree->parent->nprocs--;
-
 
613
    thread[tid].tree->parent->siblings[tid] = 0;
-
 
614
    Unlock(thread[tid].tree->parent->lock);
-
 
615
    thread[tid].tree = 0;
-
 
616
  }
-
 
617
}
-