Rev 154 | Details | Compare with Previous | Last modification | View Log | RSS feed
| Rev | Author | Line No. | Line | 
|---|---|---|---|
| 33 | pmbaty | 1 | #include "chess.h" | 
| 2 | #include "data.h" | ||
| 3 | #include "epdglue.h" | ||
| 154 | pmbaty | 4 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 5 | /* | 
| 6 |  ******************************************************************************* | ||
| 7 |  *                                                                             * | ||
| 108 | pmbaty | 8 |  *   Split() is the driver for the threaded parallel search in Crafty.  The    * | 
| 33 | pmbaty | 9 |  *   basic idea is that whenever we notice that one (or more) threads are in   * | 
| 108 | pmbaty | 10 |  *   their idle loop, we drop into Split(), from Search(), and begin a new     * | 
| 11 |  *   parallel search from this node.  This is simply a problem of establishing * | ||
| 12 |  *   a new split point, and then letting each thread join this split point and * | ||
| 13 |  *   copy whatever data they need.                                             * | ||
| 33 | pmbaty | 14 |  *                                                                             * | 
| 108 | pmbaty | 15 |  *   This is generation II of Split().  The primary differences address two    * | 
| 16 |  *   specific performance-robbing issues.  (1) Excessive waiting for a split   * | ||
| 17 |  *   to be done, and (b) excessive waiting on specific locks.  Generation II   * | ||
| 18 |  *   addresses both of these to significantly improve performance.             * | ||
| 19 |  *                                                                             * | ||
| 20 |  *   The main difference between Gen I and Gen II is the effort required to    * | ||
| 21 |  *   split the tree and which thread(s) expend this effort.  In generation I,  * | ||
| 22 |  *   the parent thread was responsible for allocating a split block for each   * | ||
| 23 |  *   child thread, and then copying the necessary data from the parent split   * | ||
| 24 |  *   block to these child split blocks.  When all of this was completed, the   * | ||
| 25 |  *   child processes were released to start the parallel search after being    * | ||
| 26 |  *   held while the split / copy operations were done.  In the generation II   * | ||
| 27 |  *   Split() we now simply allocate a new split block for THIS thread, flag    * | ||
| 28 |  *   the parent split block as joinable, and then go directly to ThreadWait()  * | ||
| 29 |  *   which will drop us back in to the search immediately.  The idle threads   * | ||
| 30 |  *   are continually looping on Join() which will jump them right into this    * | ||
| 31 |  *   split block letting them do ALL the work of allocating a split block,     * | ||
| 32 |  *   filling it in, and then copying the data to their local split block.      * | ||
| 33 |  *   This distributes the split overhead among all the threads that split,     * | ||
| 34 |  *   rather than this thread having to do all the work while the other threads * | ||
| 35 |  *   sit idle.                                                                 * | ||
| 36 |  *                                                                             * | ||
| 37 |  *   Generation II is also much more lightweight, in that it copies only the   * | ||
| 38 |  *   bare minimum from parent to child.  Generation I safely copied far too    * | ||
| 39 |  *   much since this code was being changed regularly, but that is no longer   * | ||
| 40 |  *   necessary overhead.                                                       * | ||
| 41 |  *                                                                             * | ||
| 42 |  *   Generation II has a zero contention split algorithm.  In the past, when a * | ||
| 43 |  *   thread became idle, it posted a global split request and any thread that  * | ||
| 44 |  *   was at an appropriate point would try to split.  But while it was doing   * | ||
| 45 |  *   the splitting, other threads that were also willing to split would "get   * | ||
| 46 |  *   in line" because Crafty used a global lock to prevent two threads from    * | ||
| 47 |  *   attempting to split at the same instant in time.  They got in line, and   * | ||
| 48 |  *   waited for the original splitter to release the lock, but now they have   * | ||
| 49 |  *   no idle threads to split with.  A waste of time.  Now we allow ANY thread * | ||
| 50 |  *   to attempt to split at the current ply.  When we do what might be called  * | ||
| 51 |  *   a "gratuitous split" the only restriction is that if we have existing     * | ||
| 52 |  *   "gratuitous split points" (split blocks that are joinable but have not    * | ||
| 53 |  *   yet been joined), then we limit the number of such splits (per thread) to * | ||
| 54 |  *   avoid excessive overhead.                                                 * | ||
| 55 |  *                                                                             * | ||
| 56 |  *   Generation II takes another idea from DTS, the idea of "late-join".  The  * | ||
| 57 |  *   idea is fairly simple.  If, when a thread becomes idle, there are already * | ||
| 58 |  *   other split points being searched in parallel, then we will try to join   * | ||
| 59 |  *   one of them rather than waiting for someone to ask us to help.  We use    * | ||
| 60 |  *   some simple criteria:  (1) The split point must be joinable, which simply * | ||
| 61 |  *   means that no processor has exited the split point yet (which would let   * | ||
| 62 |  *   us know there is no more work here and a join would be futile);  (2) We   * | ||
| 63 |  *   compute an "interest" value which is a simple formula based on depth at   * | ||
| 64 |  *   the split point, and the number of moves already searched.  It seems less * | ||
| 65 |  *   risky to choose a split point with max depth AND minimum moves already    * | ||
| 66 |  *   searched so that there is plenty to do.  This was quite simple to add     * | ||
| 67 |  *   after the rest of the Generation II rewrite.  In fact, this is now THE    * | ||
| 68 |  *   way threads join a split point, period, which further simplifies the code * | ||
| 69 |  *   and improves efficiency.  IE, a thread can split when idle threads are    * | ||
| 70 |  *   noticed, or if it is far enough away from the tips to make the cost       * | ||
| 71 |  *   negligible.  At that point any idle thread(s) can join immediately, those * | ||
| 72 |  *   that become idle later can join when they are ready.                      * | ||
| 73 |  *                                                                             * | ||
| 33 | pmbaty | 74 |  *   There are a number of settable options via the command-line or .craftyrc  * | 
| 75 |  *   initialization file.  Here's a concise explanation for each option and an * | ||
| 76 |  *   occasional suggestion for testing/tuning.                                 * | ||
| 77 |  *                                                                             * | ||
| 154 | pmbaty | 78 |  *   smp_affinity (command = smpaffinity=<n> <p> is used to enable or disable  * | 
| 108 | pmbaty | 79 |  *      processor affinity.  -1 disables affinity and lets threads run on any  * | 
| 80 |  *      available core.  If you use an integer <n> then thread zero will bind  * | ||
| 81 |  *      itself to cpu <n> and each additional thread will bind to the next     * | ||
| 82 |  *      higher cpu number.  This is useful if you try to run two copies of     * | ||
| 83 |  *      crafty on the same machine, now you can cause one to bind to the first * | ||
| 84 |  *      <n> cores, and the second to the last <n> cores.  For the first        * | ||
| 85 |  *      instance of Crafty, you would use smpaffinity=0, and for the second    * | ||
| 86 |  *      smpaffinity=8, assuming you are running 8 threads per copy on a 16 cpu * | ||
| 87 |  *      machine.  If you get this wrong, you can have more than one thread on  * | ||
| 88 |  *      the same cpu which will significantly impact performance.              * | ||
| 89 |  *                                                                             * | ||
| 33 | pmbaty | 90 |  *   smp_max_threads (command = smpmt=n) sets the total number of allowable    * | 
| 91 |  *      threads for the search.  The default is one (1) as Crafty does not     * | ||
| 92 |  *      assume it should use all available resources.  For optimal performance * | ||
| 93 |  *      this should be set to the number of physical cores your machine has,   * | ||
| 94 |  *      which does NOT include hyperthreaded cores.                            * | ||
| 95 |  *                                                                             * | ||
| 96 |  *   smp_split_group (command = smpgroup=n) sets the maximum number of threads * | ||
| 108 | pmbaty | 97 |  *      at any single split point, with the exception of split points fairly   * | 
| 98 |  *      close to the root where ALL threads are allowed to split together,     * | ||
| 99 |  *      ignoring this limit.  Note that this is ignored in the first 1/2 of    * | ||
| 100 |  *      the tree (the nodes closer to the root).  There it is actually good to * | ||
| 101 |  *      split and get all active threads involved.                             * | ||
| 33 | pmbaty | 102 |  *                                                                             * | 
| 103 |  *   smp_min_split_depth (command = smpmin=n) avoids splitting when remaining  * | ||
| 104 |  *      depth < n.  This is used to balance splitting overhead cost against    * | ||
| 108 | pmbaty | 105 |  *      the speed gain the parallel search produces.  The default is currently * | 
| 33 | pmbaty | 106 |  *      5 (which could change with future generations of Intel hardware) but   * | 
| 107 |  *      values between 4 and 8 will work.  Larger values allow somewhat fewer  * | ||
| 108 |  *      splits, which reduces overhead, but it also increases the percentage   * | ||
| 109 |  *      of the time where a thread is waiting on work.                         * | ||
| 110 |  *                                                                             * | ||
| 108 | pmbaty | 111 |  *   smp_split_at_root (command = smproot=0 or 1) enables (1) or disables (0)  * | 
| 33 | pmbaty | 112 |  *      splitting the tree at the root.  This defaults to 1 which produces the * | 
| 113 |  *      best performance by a signficiant margin.  But it can be disabled if   * | ||
| 114 |  *      you are playing with code changes.                                     * | ||
| 115 |  *                                                                             * | ||
| 108 | pmbaty | 116 |  *   smp_gratuitous_depth (command = smpgd=<n>) controls " gratuitous splits"  * | 
| 117 |  *      which are splits that are done without any idle threads.  This sets a  * | ||
| 118 |  *      depth limit (remaining depth) that must be present before such a split * | ||
| 119 |  *      can be done.  Making this number larger will reduce the number of      * | ||
| 120 |  *      these splits.  Making it too small will increase overhead slightly and * | ||
| 121 |  *      increase split block usage significantly.                              * | ||
| 33 | pmbaty | 122 |  *                                                                             * | 
| 108 | pmbaty | 123 |  *   smp_gratuitous_limit (command = smpgl=<n>) limits the number of these     * | 
| 124 |  *      splits that a thread can do.  Once a thread has this number of         * | ||
| 125 |  *      unjoined split points, it will not be allowed to split any more until  * | ||
| 126 |  *      one or more threads join at least one of the existing split points.    * | ||
| 127 |  *      In the smp search statistics, where you see output that looks like     * | ||
| 128 |  *      this:                                                                  * | ||
| 129 |  *                                                                             * | ||
| 130 |  *        splits=m(n) ...                                                      * | ||
| 131 |  *                                                                             * | ||
| 132 |  *      m is the total splits done, n is the number of "wasted splits" which   * | ||
| 133 |  *      are basically gratuitous splits where no thread joined before this     * | ||
| 134 |  *      split point was completed and deallocated.                             * | ||
| 135 |  *                                                                             * | ||
| 136 |  *   The best way to tune all of these paramenters is to use the "autotune"    * | ||
| 137 |  *   command (see autotune.c and help autotune) which will automatically run   * | ||
| 138 |  *   tests and optimize the parameters.  More details are in the autotune.c    * | ||
| 139 |  *   source file.                                                              * | ||
| 140 |  *                                                                             * | ||
| 33 | pmbaty | 141 |  *   A few basic "rules of the road" for anyone interested in changing or      * | 
| 142 |  *   adding to any of this code.                                               * | ||
| 143 |  *                                                                             * | ||
| 144 |  *   1.  If, at any time, you want to modify your private split block, no lock * | ||
| 145 |  *       is required.                                                          * | ||
| 146 |  *                                                                             * | ||
| 147 |  *   2.  If, at any time, you want to modify another split block, such as the  * | ||
| 148 |  *       parent split block shared move list, you must acquire the lock in the * | ||
| 108 | pmbaty | 149 |  *       split block first.  IE tree->parent->lock to lock the parent split    * | 
| 150 |  *       block during NextMove() and such.                                     * | ||
| 33 | pmbaty | 151 |  *                                                                             * | 
| 108 | pmbaty | 152 |  *   3.  If you want to modify any SMP-related data that spans multiple split  * | 
| 153 |  *       blocks, such as telling sibling processes to stop, etc, then you must * | ||
| 154 |  *       acquire the global "lock_smp" lock first.  This prevents a deadlock   * | ||
| 155 |  *       caused by two different threads walking the split block chain from    * | ||
| 156 |  *       different directions, and acquiring the split block locks in          * | ||
| 157 |  *       different orders, which could cause a catastrophic deadlock to occur. * | ||
| 158 |  *       This is an infrequent event so the overhead is not significant.       * | ||
| 33 | pmbaty | 159 |  *                                                                             * | 
| 160 |  *   4.  If you want to do any sort of I/O operation, you must acquire the     * | ||
| 161 |  *       "lock_io" lock first.  Since threads share descriptors, there are     * | ||
| 162 |  *       lots of potential race conditions, from the simple tty-output getting * | ||
| 163 |  *       interlaced from different threads, to outright data corruption in the * | ||
| 164 |  *       book or log files.                                                    * | ||
| 165 |  *                                                                             * | ||
| 166 |  *   Some of the bugs caused by failing to acquire the correct lock will only  * | ||
| 167 |  *   occur infrequently, and they are extremely difficult to find.  Some only  * | ||
| 168 |  *   show up in a public game where everyone is watching, to cause maximum     * | ||
| 169 |  *   embarassment and causes the program to do something extremely stupid.     * | ||
| 170 |  *                                                                             * | ||
| 171 |  ******************************************************************************* | ||
| 172 |  */ | ||
| 108 | pmbaty | 173 | int Split(TREE * RESTRICT tree) { | 
| 174 | TREE *child; | ||
| 175 | int tid, tstart, tend; | ||
| 33 | pmbaty | 176 | |
| 177 | /* | ||
| 178 |  ************************************************************ | ||
| 179 |  *                                                          * | ||
| 108 | pmbaty | 180 |  *  Here we prepare to split the tree.  All we really do in * | 
| 181 |  *  the Generation II threading is grab a split block for   * | ||
| 182 |  *  this thread, then flag the parent as "joinable" and     * | ||
| 183 |  *  then jump right to ThreadWait() to resume where we left * | ||
| 184 |  *  off, with the expectation (but not a requirement) that  * | ||
| 185 |  *  other threads will join us to help.                     * | ||
| 33 | pmbaty | 186 |  *                                                          * | 
| 108 | pmbaty | 187 |  *  Idle threads are sitting in ThreadWait() repeatedly     * | 
| 188 |  *  calling Join() to find them a split point, which we are * | ||
| 189 |  *  fixing to provide.  They will then join as quickly as   * | ||
| 190 |  *  they can, and other threads that become idle later can  * | ||
| 191 |  *  also join without any further splitting needed.         * | ||
| 192 |  *                                                          * | ||
| 193 |  *  If we are unable to allocate a split block, we simply   * | ||
| 194 |  *  abort this attempted split and return to the search     * | ||
| 195 |  *  since other threads will also split quickly.            * | ||
| 196 |  *                                                          * | ||
| 33 | pmbaty | 197 |  ************************************************************ | 
| 198 |  */ | ||
| 108 | pmbaty | 199 | tstart = ReadClock(); | 
| 200 | tree->nprocs = 0; | ||
| 154 | pmbaty | 201 | for (tid = 0; tid < smp_max_threads; tid++) | 
| 108 | pmbaty | 202 | tree->siblings[tid] = 0; | 
| 203 | child = GetBlock(tree, tree->thread_id); | ||
| 204 | if (!child) | ||
| 33 | pmbaty | 205 | return 0; | 
| 108 | pmbaty | 206 | CopyFromParent(child); | 
| 207 | thread[tree->thread_id].tree = child; | ||
| 208 | tree->joined = 0; | ||
| 209 | tree->joinable = 1; | ||
| 210 |   parallel_splits++; | ||
| 211 | smp_split = 0; | ||
| 212 | tend = ReadClock(); | ||
| 213 | thread[tree->thread_id].idle += tend - tstart; | ||
| 33 | pmbaty | 214 | /* | 
| 215 |  ************************************************************ | ||
| 216 |  *                                                          * | ||
| 108 | pmbaty | 217 |  *  We have successfully created a split point, which means * | 
| 218 |  *  we are done.  The instant we set the "joinable" flag,   * | ||
| 219 |  *  idle threads may begin to join in at this split point   * | ||
| 220 |  *  to help.  Since this thread may finish before any or    * | ||
| 221 |  *  all of the other parallel threads, this thread is sent  * | ||
| 222 |  *  to ThreadWait() which will immediately send it to       * | ||
| 223 |  *  SearchMoveList() like the other threads; however, going * | ||
| 224 |  *  to ThreadWait() allows this thread to join others if it * | ||
| 225 |  *  runs out of work to do.  We do pass ThreadWait() the    * | ||
| 226 |  *  address of the parent split block, so that if this      * | ||
| 227 |  *  thread becomes idle, and this thread block shows no     * | ||
| 228 |  *  threads are still busy, then this thread can return to  * | ||
| 229 |  *  here and then back up into the previous ply as it       * | ||
| 230 |  *  should.  Note that no other thread can back up to the   * | ||
| 231 |  *  previous ply since their recursive call stacks are not  * | ||
| 232 |  *  set for that, while this call stack will bring us back  * | ||
| 233 |  *  to this point where we return to the normal search,     * | ||
| 234 |  *  which we just completed.                                * | ||
| 33 | pmbaty | 235 |  *                                                          * | 
| 108 | pmbaty | 236 |  ************************************************************ | 
| 237 |  */ | ||
| 238 | ThreadWait(tree->thread_id, tree); | ||
| 239 | if (!tree->joined) | ||
| 240 |     parallel_splits_wasted++; | ||
| 241 | return 1; | ||
| 242 | } | ||
| 243 | |||
| 154 | pmbaty | 244 | /* modified 08/03/16 */ | 
| 108 | pmbaty | 245 | /* | 
| 246 |  ******************************************************************************* | ||
| 247 |  *                                                                             * | ||
| 248 |  *   Join() is called just when we enter the usual spin-loop waiting for work. * | ||
| 249 |  *   We take a quick look at all active split blocks to see if any look        * | ||
| 250 |  *   "joinable".  If so, we compute an "interest" value, which will be defined * | ||
| 251 |  *   below.  We then join the most interesting split point directly. This      * | ||
| 252 |  *   split point might have been created specifically for this thread to join, * | ||
| 253 |  *   or it might be one that was already active when this thread became idle,  * | ||
| 254 |  *   which allows us to join that existing split point and not request a new   * | ||
| 255 |  *   split operation, saving time.                                             * | ||
| 256 |  *                                                                             * | ||
| 257 |  ******************************************************************************* | ||
| 258 |  */ | ||
| 259 | int Join(int64_t tid) { | ||
| 260 | TREE *tree, *join_block, *child; | ||
| 261 | int interest, best_interest, current, pass = 0; | ||
| 262 | |||
| 263 | /* | ||
| 264 |  ************************************************************ | ||
| 33 | pmbaty | 265 |  *                                                          * | 
| 108 | pmbaty | 266 |  *  First we pass over ALL split blocks, looking for those  * | 
| 267 |  *  flagged as "joinable" (which means they are actually    * | ||
| 268 |  *  active split points and that no processor at that split * | ||
| 269 |  *  point has run out of work (there is no point in joining * | ||
| 270 |  *  a split point with no remaining work) and no fail high  * | ||
| 271 |  *  has been found which would raise the "stop" flag.) This * | ||
| 272 |  *  is "racy" because we do not acquire any locks, which    * | ||
| 273 |  *  means that the busy threads continue working, and there * | ||
| 274 |  *  is a small probability that the split point will        * | ||
| 275 |  *  disappear while we are in this loop.  To resolve the    * | ||
| 276 |  *  potential race, after we find the most attractive split * | ||
| 277 |  *  point, we acquire the lock for that split block and     * | ||
| 278 |  *  test again, but this time if the block is joinable, we  * | ||
| 279 |  *  can safely join under control of the lock, which is not * | ||
| 280 |  *  held for very long at all.  If the block is not         * | ||
| 281 |  *  joinable once we acquire the lock, we abort joining     * | ||
| 282 |  *  since it is futile.  Note that if this happens, we will * | ||
| 283 |  *  try to find an existing split point we can join three   * | ||
| 284 |  *  times before we exit, setting split to 1 to ask other   * | ||
| 285 |  *  threads to produce more candidate split points.         * | ||
| 33 | pmbaty | 286 |  *                                                          * | 
| 154 | pmbaty | 287 |  *  Special case:  We don't want to join a split point that * | 
| 288 |  *  was created by this thread.  While it works, it can add * | ||
| 289 |  *  overhead since we can encounter a later split point     * | ||
| 290 |  *  that originated at the current split point, and we      * | ||
| 291 |  *  would continue searching even though most of the work   * | ||
| 292 |  *  has already been completed.  The hash table would help  * | ||
| 293 |  *  avoid most (if not all) of this overhead, but there is  * | ||
| 294 |  *  no good reason to take the chance of this happening.    * | ||
| 295 |  *                                                          * | ||
| 108 | pmbaty | 296 |  ************************************************************ | 
| 297 |  */ | ||
| 298 | for (pass = 0; pass < 3; pass++) { | ||
| 299 | best_interest = -999999; | ||
| 300 | join_block = 0; | ||
| 154 | pmbaty | 301 | for (current = 0; current <= smp_max_threads * 64; current++) { | 
| 108 | pmbaty | 302 | tree = block[current]; | 
| 303 | if (tree->joinable && (tree->ply <= tree->depth / 2 || | ||
| 156 | pmbaty | 304 | tree->nprocs < (int) smp_split_group) && tree->thread_id != tid) { // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 305 | interest = tree->depth * 2 - tree->searched[0]; | 
| 306 | if (interest > best_interest) { | ||
| 307 | best_interest = interest; | ||
| 308 | join_block = tree; | ||
| 309 |         } | ||
| 310 |       } | ||
| 311 |     } | ||
| 312 | /* | ||
| 313 |  ************************************************************ | ||
| 33 | pmbaty | 314 |  *                                                          * | 
| 108 | pmbaty | 315 |  *  Now we acquire the lock for this split block, and then  * | 
| 316 |  *  check to see if the block is still flagged as joinable. * | ||
| 317 |  *  If so, we set things up, and then we get pretty tricky  * | ||
| 318 |  *  as we then release the lock, and then copy the data     * | ||
| 319 |  *  from the parent to our split block.  There is a chance  * | ||
| 320 |  *  that while we are copying this data, the split point    * | ||
| 321 |  *  gets completed by other threads.  Which would leave an  * | ||
| 322 |  *  apparent race condition exposed where we start copying  * | ||
| 323 |  *  data here, the split point is completed, the parent     * | ||
| 324 |  *  block is released and then reacquired and we continue   * | ||
| 325 |  *  if nothing has happened here, getting data copied from  * | ||
| 326 |  *  two different positions.                                * | ||
| 327 |  *                                                          * | ||
| 328 |  *  Fortunately, we linked this new split block to the old  * | ||
| 329 |  *  (original parent).  If that split block is released, we * | ||
| 330 |  *  will discover this because that operation will also set * | ||
| 331 |  *  our "stop" flag which will prevent us from using this   * | ||
| 332 |  *  data and breaking things.  We allow threads to copy     * | ||
| 333 |  *  this data without any lock protection to eliminate a    * | ||
| 334 |  *  serialization (each node would copy the data serially,  * | ||
| 335 |  *  rather than all at once) with the only consequence to   * | ||
| 336 |  *  this being the overhead of copying and then throwing    * | ||
| 337 |  *  the data away, which can happen on occasion even if we  * | ||
| 338 |  *  used a lock for protection, since once we release the   * | ||
| 339 |  *  lock it still takes time to get into the search and we  * | ||
| 340 |  *  could STILL find that this split block has already been * | ||
| 341 |  *  completed, once again.  Less contention and serial      * | ||
| 342 |  *  computing improves performance.                         * | ||
| 343 |  *                                                          * | ||
| 33 | pmbaty | 344 |  ************************************************************ | 
| 345 |  */ | ||
| 108 | pmbaty | 346 | if (join_block) { | 
| 347 | Lock(join_block->lock); | ||
| 348 | if (join_block->joinable) { | ||
| 156 | pmbaty | 349 | child = GetBlock(join_block, (int) tid); // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 350 | Unlock(join_block->lock); | 
| 351 | if (child) { | ||
| 352 | CopyFromParent(child); | ||
| 353 | thread[tid].tree = child; | ||
| 354 |           parallel_joins++; | ||
| 355 | return 1; | ||
| 356 |         } | ||
| 357 | } else { | ||
| 358 | Unlock(join_block->lock); | ||
| 33 | pmbaty | 359 | break; | 
| 108 | pmbaty | 360 |       } | 
| 33 | pmbaty | 361 |     } | 
| 362 |   } | ||
| 363 | /* | ||
| 364 |  ************************************************************ | ||
| 365 |  *                                                          * | ||
| 108 | pmbaty | 366 |  *  We did not acquire a split point to join, so we set     * | 
| 367 |  *  smp_split to 1 to ask busy threads to create joinable   * | ||
| 368 |  *  split points.                                           * | ||
| 33 | pmbaty | 369 |  *                                                          * | 
| 370 |  ************************************************************ | ||
| 371 |  */ | ||
| 108 | pmbaty | 372 | smp_split = 1; | 
| 373 | return 0; | ||
| 374 | } | ||
| 375 | |||
| 154 | pmbaty | 376 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 377 | /* | 
| 108 | pmbaty | 378 |  ******************************************************************************* | 
| 379 |  *                                                                             * | ||
| 154 | pmbaty | 380 |  *   ThreadAffinity() is called to "pin" a thread to a specific processor.  It * | 
| 381 |  *   is a "noop" (no-operation) if Crafty was not compiled with -DAFFINITY, or * | ||
| 382 |  *   if smp_affinity is negative (smpaffinity=-1 disables affinity).  It       * | ||
| 383 |  *   simply sets the affinity for the current thread to the requested CPU and  * | ||
| 384 |  *   returns.  NOTE:  If hyperthreading is enabled, there is no guarantee that * | ||
| 385 |  *   this will work as expected and pin one thread per physical core.  It      * | ||
| 386 |  *   depends on how the O/S numbers the SMT cores.                             * | ||
| 387 |  *                                                                             * | ||
| 388 |  ******************************************************************************* | ||
| 389 |  */ | ||
| 390 | void ThreadAffinity(int cpu) { | ||
| 391 | #if defined(AFFINITY) | ||
| 392 |   cpu_set_t cpuset; | ||
| 393 | pthread_t current_thread = pthread_self(); | ||
| 394 | |||
| 395 | if (smp_affinity >= 0) { | ||
| 396 | CPU_ZERO(&cpuset); | ||
| 397 | CPU_SET(smp_affinity_increment * (cpu + smp_affinity), &cpuset); | ||
| 398 | pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); | ||
| 399 |   } | ||
| 400 | #endif | ||
| 401 | } | ||
| 402 | |||
| 403 | /* modified 08/03/16 */ | ||
| 404 | /* | ||
| 405 |  ******************************************************************************* | ||
| 406 |  *                                                                             * | ||
| 108 | pmbaty | 407 |  *   ThreadInit() is called after a process is created.  Its main task is to   * | 
| 408 |  *   initialize the process local memory so that it will fault in and be       * | ||
| 409 |  *   allocated on the local node rather than the node where the original       * | ||
| 410 |  *   (first) process was running.  All threads will hang here via a custom     * | ||
| 411 |  *   WaitForALlThreadsInitialized() procedure so that all the local thread     * | ||
| 412 |  *   blocks are usable before the search actually begins.                      * | ||
| 413 |  *                                                                             * | ||
| 414 |  ******************************************************************************* | ||
| 415 |  */ | ||
| 416 | void *STDCALL ThreadInit(void *t) { | ||
| 417 | int tid = (int64_t) t; | ||
| 418 | |||
| 419 | ThreadAffinity(tid); | ||
| 154 | pmbaty | 420 | #if !defined(UNIX) | 
| 108 | pmbaty | 421 | ThreadMalloc((uint64_t) tid); | 
| 154 | pmbaty | 422 | #endif | 
| 108 | pmbaty | 423 | thread[tid].blocks = 0xffffffffffffffffull; | 
| 424 | Lock(lock_smp); | ||
| 425 |   initialized_threads++; | ||
| 426 | Unlock(lock_smp); | ||
| 427 | WaitForAllThreadsInitialized(); | ||
| 428 | ThreadWait(tid, (TREE *) 0); | ||
| 429 | Lock(lock_smp); | ||
| 430 |   smp_threads--; | ||
| 431 | Unlock(lock_smp); | ||
| 432 | return 0; | ||
| 433 | } | ||
| 434 | |||
| 154 | pmbaty | 435 | /* modified 08/03/16 */ | 
| 108 | pmbaty | 436 | /* | 
| 437 |  ******************************************************************************* | ||
| 438 |  *                                                                             * | ||
| 439 |  *   ThreadSplit() is used to determine if we should split at the current ply. * | ||
| 440 |  *   There are some basic constraints on when splits can be done, such as the  * | ||
| 441 |  *   depth remaining in the search (don't split to near the tips), and have we * | ||
| 442 |  *   searched at least one move to get a score or bound (YBW condition).       * | ||
| 443 |  *                                                                             * | ||
| 444 |  *   If those conditions are satisfied, AND either a thread has requested a    * | ||
| 445 |  *   split OR we are far enough away from the tips of the tree to justify a    * | ||
| 446 |  *   "gratuitout split" then we return "success."  A "gratuitout split" is a   * | ||
| 447 |  *   split done without any idle threads.  Since splits are not free, we only  * | ||
| 448 |  *   do this well away from tips to limit overhead.  We do this so that when a * | ||
| 449 |  *   thread becomes idle, it will find these split points immediately and not  * | ||
| 450 |  *   have to wait for a split after the fact.                                  * | ||
| 451 |  *                                                                             * | ||
| 452 |  ******************************************************************************* | ||
| 453 |  */ | ||
| 156 | pmbaty | 454 | int ThreadSplit(TREE *RESTRICT tree, int ply, int depth, int alpha, int o_alpha, // Pierre-Marie Baty -- missing RESTRICT keyword | 
| 108 | pmbaty | 455 | int done) { | 
| 456 | TREE *used; | ||
| 457 | int64_t tblocks; | ||
| 458 | int temp, unused = 0; | ||
| 459 | |||
| 460 | /* | ||
| 33 | pmbaty | 461 |  ************************************************************ | 
| 462 |  *                                                          * | ||
| 108 | pmbaty | 463 |  *  First, we see if we meet the basic criteria to create a * | 
| 464 |  *  split point, that being that we must not be too far     * | ||
| 465 |  *  from the root (smp_min_split_depth).                    * | ||
| 33 | pmbaty | 466 |  *                                                          * | 
| 467 |  ************************************************************ | ||
| 468 |  */ | ||
| 156 | pmbaty | 469 | if (depth < (int) smp_min_split_depth) // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 470 | return 0; | 
| 33 | pmbaty | 471 | /* | 
| 472 |  ************************************************************ | ||
| 473 |  *                                                          * | ||
| 108 | pmbaty | 474 |  *  If smp_split is NOT set, we are checking to see if it   * | 
| 475 |  *  is acceptable to do a gratuitous split here.            * | ||
| 33 | pmbaty | 476 |  *                                                          * | 
| 108 | pmbaty | 477 |  *  (1) if we are too far from the root we do not do        * | 
| 478 |  *      gratuitous splits to avoid the overhead.            * | ||
| 479 |  *                                                          * | ||
| 480 |  *  (2) if we have searched more than one move at this ply, * | ||
| 481 |  *      we don't do any further tests to see if a           * | ||
| 482 |  *      gratuitous split is acceptable, since we have       * | ||
| 483 |  *      previously done this test at this ply and decided   * | ||
| 484 |  *      one should not be done.  That condition has likely  * | ||
| 485 |  *      not changed.                                        * | ||
| 486 |  *                                                          * | ||
| 487 |  *  (3) if we have pre-existing gratuitous split points for * | ||
| 488 |  *      this thread, we make certain we don't create more   * | ||
| 489 |  *      than the gratuitous split limit as excessive splits * | ||
| 490 |  *      just add to the overhead with no benefit.           * | ||
| 491 |  *                                                          * | ||
| 33 | pmbaty | 492 |  ************************************************************ | 
| 493 |  */ | ||
| 108 | pmbaty | 494 | if (!smp_split) { | 
| 156 | pmbaty | 495 | if (depth < (int) smp_gratuitous_depth || done > 1) // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 496 | return 0; | 
| 497 | tblocks = ~thread[tree->thread_id].blocks; | ||
| 498 | while (tblocks) { | ||
| 499 | temp = LSB(tblocks); | ||
| 500 | used = block[temp + tree->thread_id * 64 + 1]; | ||
| 501 | if (used->joinable && !used->joined) | ||
| 502 |         unused++; | ||
| 503 | Clear(temp, tblocks); | ||
| 504 |     } | ||
| 156 | pmbaty | 505 | if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 506 | return 0; | 
| 507 |   } | ||
| 508 | /* | ||
| 509 |  ************************************************************ | ||
| 510 |  *                                                          * | ||
| 511 |  *  If smp_split IS set, we are checking to see if it is    * | ||
| 512 |  *  acceptable to do a split because there are idle threads * | ||
| 513 |  *  that need work to do.                                   * | ||
| 514 |  *                                                          * | ||
| 515 |  *  The only reason this would be false is if we have a     * | ||
| 516 |  *  pre-existing split point that is joinable but has not   * | ||
| 517 |  *  been joined. If one exists, there is no need to split   * | ||
| 518 |  *  again as there is already an accessible split point.    * | ||
| 519 |  *  Otherwise, if we are at the root and we are either not  * | ||
| 520 |  *  allowed to split at the root, or we have additional     * | ||
| 521 |  *  root moves that have to be searched one at a time using * | ||
| 522 |  *  all available threads we also can not split here.       * | ||
| 523 |  *                                                          * | ||
| 524 |  ************************************************************ | ||
| 525 |  */ | ||
| 526 | else { | ||
| 527 | if (ply == 1 && (!smp_split_at_root || !NextRootMoveParallel() || | ||
| 528 | alpha == o_alpha)) | ||
| 529 | return 0; | ||
| 530 | tblocks = ~thread[tree->thread_id].blocks; | ||
| 531 | while (tblocks) { | ||
| 532 | temp = LSB(tblocks); | ||
| 533 | used = block[temp + tree->thread_id * 64 + 1]; | ||
| 534 | if (used->joinable && !used->joined) | ||
| 535 |         unused++; | ||
| 536 | Clear(temp, tblocks); | ||
| 537 |     } | ||
| 156 | pmbaty | 538 | if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast | 
| 108 | pmbaty | 539 | return 0; | 
| 540 |   } | ||
| 33 | pmbaty | 541 | return 1; | 
| 542 | } | ||
| 543 | |||
| 154 | pmbaty | 544 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 545 | /* | 
| 546 |  ******************************************************************************* | ||
| 547 |  *                                                                             * | ||
| 108 | pmbaty | 548 |  *   ThreadStop() is called from SearchMoveList() when it detects a beta       * | 
| 549 |  *   cutoff (fail high) at a node that is being searched in parallel.  We need * | ||
| 550 |  *   to stop all threads here, and since this threading algorithm is recursive * | ||
| 551 |  *   it may be necessary to stop other threads that are helping search this    * | ||
| 552 |  *   branch further down into the tree.  This function simply sets appropriate * | ||
| 553 |  *   tree->stop variables to 1, which will stop those particular threads       * | ||
| 554 |  *   instantly and return them to the idle loop in ThreadWait().               * | ||
| 33 | pmbaty | 555 |  *                                                                             * | 
| 556 |  ******************************************************************************* | ||
| 557 |  */ | ||
| 108 | pmbaty | 558 | void ThreadStop(TREE * RESTRICT tree) { | 
| 559 | int proc; | ||
| 33 | pmbaty | 560 | |
| 108 | pmbaty | 561 | Lock(tree->lock); | 
| 562 | tree->stop = 1; | ||
| 563 | tree->joinable = 0; | ||
| 154 | pmbaty | 564 | for (proc = 0; proc < smp_max_threads; proc++) | 
| 108 | pmbaty | 565 | if (tree->siblings[proc]) | 
| 566 | ThreadStop(tree->siblings[proc]); | ||
| 567 | Unlock(tree->lock); | ||
| 568 | } | ||
| 569 | |||
| 154 | pmbaty | 570 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 571 | /* | 
| 108 | pmbaty | 572 |  ******************************************************************************* | 
| 573 |  *                                                                             * | ||
| 574 |  *   ThreadTrace() is a debugging tool that simply walks the split block tree  * | ||
| 575 |  *   and displays interesting data to help debug the parallel search whenever  * | ||
| 576 |  *   changes break things.                                                     * | ||
| 577 |  *                                                                             * | ||
| 578 |  ******************************************************************************* | ||
| 579 |  */ | ||
| 580 | void ThreadTrace(TREE * RESTRICT tree, int depth, int brief) { | ||
| 581 | int proc, i; | ||
| 582 | |||
| 583 | Lock(tree->lock); | ||
| 584 | Lock(lock_io); | ||
| 585 | if (!brief) { | ||
| 586 | for (i = 0; i < 4 * depth; i++) | ||
| 587 | Print(4095, " "); | ||
| 588 |     depth++; | ||
| 589 | Print(4095, "block[%d] thread=%d ply=%d nprocs=%d ", | ||
| 590 | FindBlockID(tree), tree->thread_id, tree->ply, tree->nprocs); | ||
| 591 | Print(4095, "joined=%d joinable=%d stop=%d nodes=%d", tree->joined, | ||
| 592 | tree->joinable, tree->stop, tree->nodes_searched); | ||
| 593 | Print(4095, " parent=%d\n", FindBlockID(tree->parent)); | ||
| 594 | } else { | ||
| 595 | if (tree->nprocs > 1) { | ||
| 596 | for (i = 0; i < 4 * depth; i++) | ||
| 597 | Print(4095, " "); | ||
| 598 |       depth++; | ||
| 599 | Print(4095, "(ply %d)", tree->ply); | ||
| 600 |     } | ||
| 601 |   } | ||
| 602 | if (tree->nprocs) { | ||
| 603 | if (!brief) { | ||
| 604 | for (i = 0; i < 4 * depth; i++) | ||
| 605 | Print(4095, " "); | ||
| 606 | Print(4095, " parent=%d sibling threads=", | ||
| 607 | FindBlockID(tree->parent)); | ||
| 154 | pmbaty | 608 | for (proc = 0; proc < smp_max_threads; proc++) | 
| 108 | pmbaty | 609 | if (tree->siblings[proc]) | 
| 610 | Print(4095, " %d(%d)", proc, FindBlockID(tree->siblings[proc])); | ||
| 611 | Print(4095, "\n"); | ||
| 612 | } else { | ||
| 613 | if (tree->nprocs > 1) { | ||
| 614 | Print(4095, " helping= "); | ||
| 154 | pmbaty | 615 | for (proc = 0; proc < smp_max_threads; proc++) | 
| 108 | pmbaty | 616 | if (tree->siblings[proc]) { | 
| 617 | if (proc == tree->thread_id) | ||
| 618 | Print(4095, "["); | ||
| 619 | Print(4095, "%d", proc); | ||
| 620 | if (proc == tree->thread_id) | ||
| 621 | Print(4095, "]"); | ||
| 622 | Print(4095, " "); | ||
| 623 |           } | ||
| 624 | Print(4095, "\n"); | ||
| 625 |       } | ||
| 626 |     } | ||
| 627 |   } | ||
| 628 | Unlock(lock_io); | ||
| 154 | pmbaty | 629 | for (proc = 0; proc < smp_max_threads; proc++) | 
| 108 | pmbaty | 630 | if (tree->siblings[proc]) | 
| 631 | ThreadTrace(tree->siblings[proc], depth, brief); | ||
| 632 | Unlock(tree->lock); | ||
| 633 | } | ||
| 634 | |||
| 154 | pmbaty | 635 | /* modified 08/03/16 */ | 
| 108 | pmbaty | 636 | /* | 
| 637 |  ******************************************************************************* | ||
| 638 |  *                                                                             * | ||
| 639 |  *   ThreadWait() is the idle loop for the N threads that are created at the   * | ||
| 640 |  *   beginning when Crafty searches.  Threads are "parked" here waiting on a   * | ||
| 641 |  *   pointer to something they should search (a parameter block built in the   * | ||
| 642 |  *   function Split() in this case.  When this pointer becomes non-zero, each  * | ||
| 643 |  *   thread "parked" here will immediately call SearchMoveList() and begin the * | ||
| 644 |  *   parallel search as directed.                                              * | ||
| 645 |  *                                                                             * | ||
| 646 |  *   Upon entry, all threads except for the "master" will arrive here with a   * | ||
| 647 |  *   value of zero (0) in the waiting parameter below.  This indicates that    * | ||
| 648 |  *   they will search and them be done.  The "master" will arrive here with a  * | ||
| 649 |  *   pointer to the parent split block in "waiting" which says I will sit here * | ||
| 650 |  *   waiting for work OR when the waiting split block has no threads working   * | ||
| 651 |  *   on it, at which point I should return which will let me "unsplit" here    * | ||
| 652 |  *   and clean things up.  The call to here in Split() passes this block       * | ||
| 653 |  *   address while threads that are helping get here with a zero.              * | ||
| 654 |  *                                                                             * | ||
| 655 |  ******************************************************************************* | ||
| 656 |  */ | ||
| 657 | int ThreadWait(int tid, TREE * RESTRICT waiting) { | ||
| 658 | int value, tstart, tend; | ||
| 659 | |||
| 660 | /* | ||
| 33 | pmbaty | 661 |  ************************************************************ | 
| 662 |  *                                                          * | ||
| 108 | pmbaty | 663 |  *  When we reach this point, one of three possible         * | 
| 664 |  *  conditions is true (1) we already have work to do, as   * | ||
| 665 |  *  we are the "master thread" and we have already split    * | ||
| 666 |  *  the tree, we are coming here to join in;  (2) we are    * | ||
| 667 |  *  the master, and we are waiting on our split point to    * | ||
| 668 |  *  complete, so we come here to join and help currently    * | ||
| 669 |  *  active threads;  (3) we have no work to do, so we will  * | ||
| 670 |  *  spin until Join() locates a split pont we can join to   * | ||
| 671 |  *  help out.                                               * | ||
| 33 | pmbaty | 672 |  *                                                          * | 
| 108 | pmbaty | 673 |  *  Note that when we get here, the parent already has a    * | 
| 674 |  *  split block and does not need to call Join(), it simply * | ||
| 675 |  *  falls through the while spin loop below because its     * | ||
| 676 |  *  "tree" pointer is already non-zero.                     * | ||
| 677 |  *                                                          * | ||
| 33 | pmbaty | 678 |  ************************************************************ | 
| 679 |  */ | ||
| 154 | pmbaty | 680 | while (FOREVER) { | 
| 108 | pmbaty | 681 | tstart = ReadClock(); | 
| 682 | while (!thread[tid].tree && (!waiting || waiting->nprocs) && !Join(tid) && | ||
| 154 | pmbaty | 683 | !thread[tid].terminate); | 
| 108 | pmbaty | 684 | tend = ReadClock(); | 
| 685 | if (!thread[tid].tree) | ||
| 686 | thread[tid].tree = waiting; | ||
| 687 | thread[tid].idle += tend - tstart; | ||
| 688 | if (thread[tid].tree == waiting || thread[tid].terminate) | ||
| 689 | return 0; | ||
| 690 | /* | ||
| 691 |  ************************************************************ | ||
| 692 |  *                                                          * | ||
| 693 |  *  Once we get here, we have a good split point, so we are * | ||
| 694 |  *  ready to participate in a parallel search.  Once we     * | ||
| 695 |  *  return from SearchMoveList() we copy our results back   * | ||
| 696 |  *  to the parent via CopyToParent() before we look for a   * | ||
| 697 |  *  new split point.  If we are a parent, we will slip out  * | ||
| 698 |  *  of the spin loop at the top and return to the normal    * | ||
| 699 |  *  serial search to finish up here.                        * | ||
| 700 |  *                                                          * | ||
| 701 |  *  When we return from SearchMoveList(), we need to        * | ||
| 702 |  *  decrement the "nprocs" value since there is now one     * | ||
| 703 |  *  less thread working at this split point.                * | ||
| 704 |  *                                                          * | ||
| 705 |  *  Note:  CopyToParent() marks the current split block as  * | ||
| 706 |  *  unused once the copy is completed, so we don't have to  * | ||
| 707 |  *  do anything about that here.                            * | ||
| 708 |  *                                                          * | ||
| 709 |  ************************************************************ | ||
| 710 |  */ | ||
| 711 |     value = | ||
| 712 | SearchMoveList(thread[tid].tree, thread[tid].tree->ply, | ||
| 713 | thread[tid].tree->depth, thread[tid].tree->wtm, | ||
| 714 | thread[tid].tree->alpha, thread[tid].tree->beta, | ||
| 715 | thread[tid].tree->searched, thread[tid].tree->in_check, 0, parallel); | ||
| 716 | tstart = ReadClock(); | ||
| 717 | Lock(thread[tid].tree->parent->lock); | ||
| 718 | thread[tid].tree->parent->joinable = 0; | ||
| 719 | CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value); | ||
| 720 | thread[tid].tree->parent->nprocs--; | ||
| 721 | thread[tid].tree->parent->siblings[tid] = 0; | ||
| 722 | Unlock(thread[tid].tree->parent->lock); | ||
| 723 | thread[tid].tree = 0; | ||
| 724 | tend = ReadClock(); | ||
| 725 | thread[tid].idle += tend - tstart; | ||
| 33 | pmbaty | 726 |   } | 
| 108 | pmbaty | 727 | } | 
| 728 | |||
| 154 | pmbaty | 729 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 730 | /* | 
| 108 | pmbaty | 731 |  ******************************************************************************* | 
| 732 |  *                                                                             * | ||
| 733 |  *   CopyFromParent() is used to copy data from a parent thread to a child     * | ||
| 734 |  *   thread.  This only copies the appropriate parts of the TREE structure to  * | ||
| 735 |  *   avoid burning memory bandwidth by copying everything.                     * | ||
| 736 |  *                                                                             * | ||
| 737 |  ******************************************************************************* | ||
| 738 |  */ | ||
| 739 | void CopyFromParent(TREE * RESTRICT child) { | ||
| 740 | TREE *parent = child->parent; | ||
| 741 | int i, ply; | ||
| 742 | |||
| 743 | /* | ||
| 33 | pmbaty | 744 |  ************************************************************ | 
| 745 |  *                                                          * | ||
| 746 |  *  We have allocated a split block.  Now we copy the tree  * | ||
| 747 |  *  search state from the parent block to the child in      * | ||
| 748 |  *  preparation for starting the parallel search.           * | ||
| 749 |  *                                                          * | ||
| 750 |  ************************************************************ | ||
| 751 |  */ | ||
| 108 | pmbaty | 752 | ply = parent->ply; | 
| 33 | pmbaty | 753 | child->ply = ply; | 
| 754 | child->position = parent->position; | ||
| 108 | pmbaty | 755 | for (i = 0; i <= rep_index + parent->ply; i++) | 
| 33 | pmbaty | 756 | child->rep_list[i] = parent->rep_list[i]; | 
| 108 | pmbaty | 757 | for (i = ply - 1; i < MAXPLY; i++) | 
| 758 | child->killers[i] = parent->killers[i]; | ||
| 154 | pmbaty | 759 | for (i = 0; i < 4096; i++) { | 
| 760 | child->counter_move[i] = parent->counter_move[i]; | ||
| 761 | child->move_pair[i] = parent->move_pair[i]; | ||
| 762 |   } | ||
| 108 | pmbaty | 763 | for (i = ply - 1; i <= ply; i++) { | 
| 33 | pmbaty | 764 | child->curmv[i] = parent->curmv[i]; | 
| 108 | pmbaty | 765 | child->pv[i] = parent->pv[i]; | 
| 33 | pmbaty | 766 |   } | 
| 108 | pmbaty | 767 | child->in_check = parent->in_check; | 
| 768 | child->last[ply] = child->move_list; | ||
| 769 | child->status[ply] = parent->status[ply]; | ||
| 770 | child->status[1] = parent->status[1]; | ||
| 771 | child->save_hash_key[ply] = parent->save_hash_key[ply]; | ||
| 772 | child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply]; | ||
| 33 | pmbaty | 773 | child->nodes_searched = 0; | 
| 774 | child->fail_highs = 0; | ||
| 775 | child->fail_high_first_move = 0; | ||
| 776 | child->evaluations = 0; | ||
| 777 | child->egtb_probes = 0; | ||
| 108 | pmbaty | 778 | child->egtb_hits = 0; | 
| 33 | pmbaty | 779 | child->extensions_done = 0; | 
| 780 | child->qchecks_done = 0; | ||
| 781 | child->moves_fpruned = 0; | ||
| 108 | pmbaty | 782 | child->moves_mpruned = 0; | 
| 783 | for (i = 0; i < 16; i++) { | ||
| 784 | child->LMR_done[i] = 0; | ||
| 785 | child->null_done[i] = 0; | ||
| 786 |   } | ||
| 33 | pmbaty | 787 | child->alpha = parent->alpha; | 
| 788 | child->beta = parent->beta; | ||
| 789 | child->value = parent->value; | ||
| 108 | pmbaty | 790 | child->wtm = parent->wtm; | 
| 33 | pmbaty | 791 | child->depth = parent->depth; | 
| 108 | pmbaty | 792 | child->searched = parent->searched; | 
| 793 | strcpy(child->root_move_text, parent->root_move_text); | ||
| 794 | strcpy(child->remaining_moves_text, parent->remaining_moves_text); | ||
| 33 | pmbaty | 795 | } | 
| 796 | |||
| 154 | pmbaty | 797 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 798 | /* | 
| 799 |  ******************************************************************************* | ||
| 800 |  *                                                                             * | ||
| 801 |  *   CopyToParent() is used to copy data from a child thread to a parent       * | ||
| 802 |  *   thread.  This only copies the appropriate parts of the TREE structure to  * | ||
| 803 |  *   avoid burning memory bandwidth by copying everything.                     * | ||
| 804 |  *                                                                             * | ||
| 805 |  ******************************************************************************* | ||
| 806 |  */ | ||
| 807 | void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) { | ||
| 108 | pmbaty | 808 | int i, ply = parent->ply, which; | 
| 33 | pmbaty | 809 | |
| 810 | /* | ||
| 811 |  ************************************************************ | ||
| 812 |  *                                                          * | ||
| 108 | pmbaty | 813 |  *  The only concern here is to make sure that the info is  * | 
| 814 |  *  only copied to the parent if our score is > than the    * | ||
| 815 |  *  parent value, and that we were not stopped for any      * | ||
| 816 |  *  reason which could produce a partial score that is      * | ||
| 817 |  *  worthless and dangerous to use.                         * | ||
| 33 | pmbaty | 818 |  *                                                          * | 
| 108 | pmbaty | 819 |  *  One important special case.  If we get here with the    * | 
| 820 |  *  thread->stop flag set, and ply is 1, then we need to    * | ||
| 821 |  *  clear the "this move has been searched" flag in the ply * | ||
| 822 |  *  1 move list since we did not complete the search.  If   * | ||
| 823 |  *  we fail to do this, then a move being searched in       * | ||
| 824 |  *  parallel at the root will be "lost" for this iteration  * | ||
| 825 |  *  and won't be searched again until the next iteration.   * | ||
| 826 |  *                                                          * | ||
| 33 | pmbaty | 827 |  *  In any case, we add our statistical counters to the     * | 
| 828 |  *  parent's totals no matter whether we finished or not    * | ||
| 829 |  *  since the total nodes searched and such should consider * | ||
| 830 |  *  everything searched, not just the "useful stuff."       * | ||
| 831 |  *                                                          * | ||
| 108 | pmbaty | 832 |  *  After we finish copying everything, we mark this split  * | 
| 833 |  *  block as free in the split block bitmap.                * | ||
| 834 |  *                                                          * | ||
| 33 | pmbaty | 835 |  ************************************************************ | 
| 836 |  */ | ||
| 837 | if (child->nodes_searched && !child->stop && value > parent->value && | ||
| 838 | !abort_search) { | ||
| 839 | parent->pv[ply] = child->pv[ply]; | ||
| 840 | parent->value = value; | ||
| 841 | parent->cutmove = child->curmv[ply]; | ||
| 154 | pmbaty | 842 | for (i = 0; i < 4096; i++) { | 
| 843 | parent->counter_move[i] = child->counter_move[i]; | ||
| 844 | parent->move_pair[i] = child->move_pair[i]; | ||
| 845 |     } | ||
| 33 | pmbaty | 846 |   } | 
| 108 | pmbaty | 847 | if (child->stop && ply == 1) | 
| 848 | for (which = 0; which < n_root_moves; which++) | ||
| 849 | if (root_moves[which].move == child->curmv[ply]) { | ||
| 850 | root_moves[which].status &= 7; | ||
| 851 | break; | ||
| 852 |       } | ||
| 33 | pmbaty | 853 | parent->nodes_searched += child->nodes_searched; | 
| 854 | parent->fail_highs += child->fail_highs; | ||
| 855 | parent->fail_high_first_move += child->fail_high_first_move; | ||
| 856 | parent->evaluations += child->evaluations; | ||
| 857 | parent->egtb_probes += child->egtb_probes; | ||
| 108 | pmbaty | 858 | parent->egtb_hits += child->egtb_hits; | 
| 33 | pmbaty | 859 | parent->extensions_done += child->extensions_done; | 
| 860 | parent->qchecks_done += child->qchecks_done; | ||
| 861 | parent->moves_fpruned += child->moves_fpruned; | ||
| 108 | pmbaty | 862 | parent->moves_mpruned += child->moves_mpruned; | 
| 863 | for (i = 1; i < 16; i++) { | ||
| 864 | parent->LMR_done[i] += child->LMR_done[i]; | ||
| 865 | parent->null_done[i] += child->null_done[i]; | ||
| 866 |   } | ||
| 867 | which = FindBlockID(child) - 64 * child->thread_id - 1; | ||
| 868 | Set(which, thread[child->thread_id].blocks); | ||
| 33 | pmbaty | 869 | } | 
| 870 | |||
| 154 | pmbaty | 871 | /* modified 08/03/16 */ | 
| 33 | pmbaty | 872 | /* | 
| 873 |  ******************************************************************************* | ||
| 874 |  *                                                                             * | ||
| 108 | pmbaty | 875 |  *   GetBlock() is used to allocate a split block and fill in only SMP-        * | 
| 876 |  *   critical information.  The child process will copy the rest of the split  * | ||
| 877 |  *   block information as needed.                                              * | ||
| 33 | pmbaty | 878 |  *                                                                             * | 
| 108 | pmbaty | 879 |  *   When we arrive here, the parent split block must be locked since we are   * | 
| 880 |  *   going to change data in that block as well as copy data from that block   * | ||
| 881 |  *   the current split block.  The only exception is when this is the original * | ||
| 882 |  *   split operation, since this is done "out of sight" of other threads which * | ||
| 883 |  *   means no locks are needed until after the "joinable" flag is set, which   * | ||
| 884 |  *   exposes this split point to other threads instantly.                      * | ||
| 885 |  *                                                                             * | ||
| 33 | pmbaty | 886 |  ******************************************************************************* | 
| 887 |  */ | ||
| 154 | pmbaty | 888 | TREE *GetBlock(TREE * RESTRICT parent, int tid) { | 
| 108 | pmbaty | 889 | TREE *child; | 
| 890 | static int warnings = 0; | ||
| 891 | int i, unused; | ||
| 892 | /* | ||
| 893 |  ************************************************************ | ||
| 894 |  *                                                          * | ||
| 895 |  *  One NUMA-related trick is that we only allocate a split * | ||
| 896 |  *  block in the thread's local memory.  Each thread has a  * | ||
| 897 |  *  group of split blocks that were first touched by the    * | ||
| 898 |  *  correct CPU so that the split blocks page-faulted into  * | ||
| 899 |  *  local memory for that specific processor.  If we can't  * | ||
| 900 |  *  find an optimally-placed block, we return a zero which  * | ||
| 901 |  *  will prevent this thread from joining the split point.  * | ||
| 902 |  *  This is highly unlikely as it would mean the current    * | ||
| 903 |  *  thread has 64 active split blocks where it is waiting   * | ||
| 904 |  *  on other threads to complete the last bit of work at    * | ||
| 905 |  *  each.  This is extremely unlikely.                      * | ||
| 906 |  *                                                          * | ||
| 907 |  *  Here we use a simple 64 bit bit-map per thread that     * | ||
| 908 |  *  indicates which blocks are free (1) and which blocks    * | ||
| 909 |  *  used (0).  We simply use LSB() to find the rightmost    * | ||
| 910 |  *  one-bit set and that is the local block number.  We     * | ||
| 911 |  *  convert that to a global block number by doing the      * | ||
| 912 |  *  simple computation:                                     * | ||
| 913 |  *                                                          * | ||
| 914 |  *     global = local + 64 * tid + 1                        * | ||
| 915 |  *                                                          * | ||
| 916 |  *  Each thread has exactly 64 split blocks, and block 0    * | ||
| 917 |  *  is the "master block" that never gets allocated or      * | ||
| 918 |  *  freed.  Once we find a free block for the current       * | ||
| 919 |  *  thread, we zero that bit so that the block won't be     * | ||
| 920 |  *  used again until it is released.                        * | ||
| 921 |  *                                                          * | ||
| 922 |  ************************************************************ | ||
| 923 |  */ | ||
| 924 | if (thread[tid].blocks) { | ||
| 925 | unused = LSB(thread[tid].blocks); | ||
| 926 | Clear(unused, thread[tid].blocks); | ||
| 927 | Set(unused, thread[tid].max_blocks); | ||
| 928 | } else { | ||
| 929 | if (++warnings < 6) | ||
| 930 | Print(2048, | ||
| 931 | "WARNING. local SMP block cannot be allocated, thread %d\n", tid); | ||
| 932 | return 0; | ||
| 933 |   } | ||
| 934 | child = block[unused + tid * 64 + 1]; | ||
| 935 | /* | ||
| 936 |  ************************************************************ | ||
| 937 |  *                                                          * | ||
| 938 |  *  Found a split block.  Now we need to fill in only the   * | ||
| 939 |  *  critical information that can't be delayed due to race  * | ||
| 940 |  *  conditions.  When we get here, the parent split block   * | ||
| 941 |  *  must be locked, that lets us safely update the number   * | ||
| 942 |  *  of processors working here, etc, without any ugly race  * | ||
| 943 |  *  conditions that would corrupt this critical data.       * | ||
| 944 |  *                                                          * | ||
| 945 |  ************************************************************ | ||
| 946 |  */ | ||
| 154 | pmbaty | 947 | for (i = 0; i < smp_max_threads; i++) | 
| 108 | pmbaty | 948 | child->siblings[i] = 0; | 
| 949 | child->nprocs = 0; | ||
| 950 | child->stop = 0; | ||
| 951 | child->joinable = 0; | ||
| 952 | child->joined = 0; | ||
| 953 | child->parent = parent; | ||
| 954 | child->thread_id = tid; | ||
| 955 | parent->nprocs++; | ||
| 956 | parent->siblings[tid] = child; | ||
| 957 | parent->joined = 1; | ||
| 958 | return child; | ||
| 33 | pmbaty | 959 | } | 
| 960 | |||
| 961 | /* | ||
| 962 |  ******************************************************************************* | ||
| 963 |  *                                                                             * | ||
| 108 | pmbaty | 964 |  *   WaitForAllThreadsInitialized() waits until all smp_max_threads are        * | 
| 965 |  *   initialized.  We have to initialize each thread and malloc() its split    * | ||
| 966 |  *   blocks before we start the actual parallel search.  Otherwise we will see * | ||
| 967 |  *   invalid memory accesses and crash instantly.                              * | ||
| 33 | pmbaty | 968 |  *                                                                             * | 
| 969 |  ******************************************************************************* | ||
| 970 |  */ | ||
| 108 | pmbaty | 971 | void WaitForAllThreadsInitialized(void) { | 
| 154 | pmbaty | 972 | while (initialized_threads < smp_max_threads); /* Do nothing */ | 
| 33 | pmbaty | 973 | } | 
| 974 | |||
| 154 | pmbaty | 975 | #if !defined (UNIX) | 
| 976 | /* modified 08/03/16 */ | ||
| 33 | pmbaty | 977 | /* | 
| 978 |  ******************************************************************************* | ||
| 979 |  *                                                                             * | ||
| 980 |  *   ThreadMalloc() is called from the ThreadInit() function.  It malloc's the * | ||
| 981 |  *   split blocks in the local memory for the processor associated with the    * | ||
| 982 |  *   specific thread that is calling this code.                                * | ||
| 983 |  *                                                                             * | ||
| 984 |  ******************************************************************************* | ||
| 985 |  */ | ||
| 986 | extern void *WinMalloc(size_t, int); | ||
| 987 | void ThreadMalloc(int64_t tid) { | ||
| 108 | pmbaty | 988 | int i; | 
| 33 | pmbaty | 989 | |
| 156 | pmbaty | 990 | for (i = (int) (tid * 64 + 1); i < (int) (tid * 64 + 65); i++) { // Pierre-Marie Baty -- added type casts | 
| 33 | pmbaty | 991 | if (block[i] == NULL) | 
| 992 | block[i] = | ||
| 993 | (TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) + | ||
| 156 | pmbaty | 994 | 127, (int) tid))); // Pierre-Marie Baty -- added type cast | 
| 33 | pmbaty | 995 | block[i]->parent = NULL; | 
| 996 | LockInit(block[i]->lock); | ||
| 997 |   } | ||
| 998 | } | ||
| 999 | #endif |