Subversion Repositories Games.Chess Giants

Rev

Rev 33 | Go to most recent revision | Blame | Compare with Previous | Last modification | View Log | Download | RSS feed

  1. #include "chess.h"
  2. #include "data.h"
  3. #include "epdglue.h"
  4. #if (CPUS > 1)
  5. /* modified 11/04/15 */
  6. /*
  7.  *******************************************************************************
  8.  *                                                                             *
  9.  *   Split() is the driver for the threaded parallel search in Crafty.  The    *
  10.  *   basic idea is that whenever we notice that one (or more) threads are in   *
  11.  *   their idle loop, we drop into Split(), from Search(), and begin a new     *
  12.  *   parallel search from this node.  This is simply a problem of establishing *
  13.  *   a new split point, and then letting each thread join this split point and *
  14.  *   copy whatever data they need.                                             *
  15.  *                                                                             *
  16.  *   This is generation II of Split().  The primary differences address two    *
  17.  *   specific performance-robbing issues.  (1) Excessive waiting for a split   *
  18.  *   to be done, and (b) excessive waiting on specific locks.  Generation II   *
  19.  *   addresses both of these to significantly improve performance.             *
  20.  *                                                                             *
  21.  *   The main difference between Gen I and Gen II is the effort required to    *
  22.  *   split the tree and which thread(s) expend this effort.  In generation I,  *
  23.  *   the parent thread was responsible for allocating a split block for each   *
  24.  *   child thread, and then copying the necessary data from the parent split   *
  25.  *   block to these child split blocks.  When all of this was completed, the   *
  26.  *   child processes were released to start the parallel search after being    *
  27.  *   held while the split / copy operations were done.  In the generation II   *
  28.  *   Split() we now simply allocate a new split block for THIS thread, flag    *
  29.  *   the parent split block as joinable, and then go directly to ThreadWait()  *
  30.  *   which will drop us back in to the search immediately.  The idle threads   *
  31.  *   are continually looping on Join() which will jump them right into this    *
  32.  *   split block letting them do ALL the work of allocating a split block,     *
  33.  *   filling it in, and then copying the data to their local split block.      *
  34.  *   This distributes the split overhead among all the threads that split,     *
  35.  *   rather than this thread having to do all the work while the other threads *
  36.  *   sit idle.                                                                 *
  37.  *                                                                             *
  38.  *   Generation II is also much more lightweight, in that it copies only the   *
  39.  *   bare minimum from parent to child.  Generation I safely copied far too    *
  40.  *   much since this code was being changed regularly, but that is no longer   *
  41.  *   necessary overhead.                                                       *
  42.  *                                                                             *
  43.  *   Generation II has a zero contention split algorithm.  In the past, when a *
  44.  *   thread became idle, it posted a global split request and any thread that  *
  45.  *   was at an appropriate point would try to split.  But while it was doing   *
  46.  *   the splitting, other threads that were also willing to split would "get   *
  47.  *   in line" because Crafty used a global lock to prevent two threads from    *
  48.  *   attempting to split at the same instant in time.  They got in line, and   *
  49.  *   waited for the original splitter to release the lock, but now they have   *
  50.  *   no idle threads to split with.  A waste of time.  Now we allow ANY thread *
  51.  *   to attempt to split at the current ply.  When we do what might be called  *
  52.  *   a "gratuitous split" the only restriction is that if we have existing     *
  53.  *   "gratuitous split points" (split blocks that are joinable but have not    *
  54.  *   yet been joined), then we limit the number of such splits (per thread) to *
  55.  *   avoid excessive overhead.                                                 *
  56.  *                                                                             *
  57.  *   Generation II takes another idea from DTS, the idea of "late-join".  The  *
  58.  *   idea is fairly simple.  If, when a thread becomes idle, there are already *
  59.  *   other split points being searched in parallel, then we will try to join   *
  60.  *   one of them rather than waiting for someone to ask us to help.  We use    *
  61.  *   some simple criteria:  (1) The split point must be joinable, which simply *
  62.  *   means that no processor has exited the split point yet (which would let   *
  63.  *   us know there is no more work here and a join would be futile);  (2) We   *
  64.  *   compute an "interest" value which is a simple formula based on depth at   *
  65.  *   the split point, and the number of moves already searched.  It seems less *
  66.  *   risky to choose a split point with max depth AND minimum moves already    *
  67.  *   searched so that there is plenty to do.  This was quite simple to add     *
  68.  *   after the rest of the Generation II rewrite.  In fact, this is now THE    *
  69.  *   way threads join a split point, period, which further simplifies the code *
  70.  *   and improves efficiency.  IE, a thread can split when idle threads are    *
  71.  *   noticed, or if it is far enough away from the tips to make the cost       *
  72.  *   negligible.  At that point any idle thread(s) can join immediately, those *
  73.  *   that become idle later can join when they are ready.                      *
  74.  *                                                                             *
  75.  *   There are a number of settable options via the command-line or .craftyrc  *
  76.  *   initialization file.  Here's a concise explanation for each option and an *
  77.  *   occasional suggestion for testing/tuning.                                 *
  78.  *                                                                             *
  79.  *   smp_affinity (command = smpaffinity=<n> is used to enable or disable      *
  80.  *      processor affinity.  -1 disables affinity and lets threads run on any  *
  81.  *      available core.  If you use an integer <n> then thread zero will bind  *
  82.  *      itself to cpu <n> and each additional thread will bind to the next     *
  83.  *      higher cpu number.  This is useful if you try to run two copies of     *
  84.  *      crafty on the same machine, now you can cause one to bind to the first *
  85.  *      <n> cores, and the second to the last <n> cores.  For the first        *
  86.  *      instance of Crafty, you would use smpaffinity=0, and for the second    *
  87.  *      smpaffinity=8, assuming you are running 8 threads per copy on a 16 cpu *
  88.  *      machine.  If you get this wrong, you can have more than one thread on  *
  89.  *      the same cpu which will significantly impact performance.              *
  90.  *                                                                             *
  91.  *   smp_max_threads (command = smpmt=n) sets the total number of allowable    *
  92.  *      threads for the search.  The default is one (1) as Crafty does not     *
  93.  *      assume it should use all available resources.  For optimal performance *
  94.  *      this should be set to the number of physical cores your machine has,   *
  95.  *      which does NOT include hyperthreaded cores.                            *
  96.  *                                                                             *
  97.  *   smp_split_group (command = smpgroup=n) sets the maximum number of threads *
  98.  *      at any single split point, with the exception of split points fairly   *
  99.  *      close to the root where ALL threads are allowed to split together,     *
  100.  *      ignoring this limit.  Note that this is ignored in the first 1/2 of    *
  101.  *      the tree (the nodes closer to the root).  There it is actually good to *
  102.  *      split and get all active threads involved.                             *
  103.  *                                                                             *
  104.  *   smp_min_split_depth (command = smpmin=n) avoids splitting when remaining  *
  105.  *      depth < n.  This is used to balance splitting overhead cost against    *
  106.  *      the speed gain the parallel search produces.  The default is currently *
  107.  *      5 (which could change with future generations of Intel hardware) but   *
  108.  *      values between 4 and 8 will work.  Larger values allow somewhat fewer  *
  109.  *      splits, which reduces overhead, but it also increases the percentage   *
  110.  *      of the time where a thread is waiting on work.                         *
  111.  *                                                                             *
  112.  *   smp_split_at_root (command = smproot=0 or 1) enables (1) or disables (0)  *
  113.  *      splitting the tree at the root.  This defaults to 1 which produces the *
  114.  *      best performance by a signficiant margin.  But it can be disabled if   *
  115.  *      you are playing with code changes.                                     *
  116.  *                                                                             *
  117.  *   smp_gratuitous_depth (command = smpgd=<n>) controls " gratuitous splits"  *
  118.  *      which are splits that are done without any idle threads.  This sets a  *
  119.  *      depth limit (remaining depth) that must be present before such a split *
  120.  *      can be done.  Making this number larger will reduce the number of      *
  121.  *      these splits.  Making it too small will increase overhead slightly and *
  122.  *      increase split block usage significantly.                              *
  123.  *                                                                             *
  124.  *   smp_gratuitous_limit (command = smpgl=<n>) limits the number of these     *
  125.  *      splits that a thread can do.  Once a thread has this number of         *
  126.  *      unjoined split points, it will not be allowed to split any more until  *
  127.  *      one or more threads join at least one of the existing split points.    *
  128.  *      In the smp search statistics, where you see output that looks like     *
  129.  *      this:                                                                  *
  130.  *                                                                             *
  131.  *        splits=m(n) ...                                                      *
  132.  *                                                                             *
  133.  *      m is the total splits done, n is the number of "wasted splits" which   *
  134.  *      are basically gratuitous splits where no thread joined before this     *
  135.  *      split point was completed and deallocated.                             *
  136.  *                                                                             *
  137.  *   The best way to tune all of these paramenters is to use the "autotune"    *
  138.  *   command (see autotune.c and help autotune) which will automatically run   *
  139.  *   tests and optimize the parameters.  More details are in the autotune.c    *
  140.  *   source file.                                                              *
  141.  *                                                                             *
  142.  *   A few basic "rules of the road" for anyone interested in changing or      *
  143.  *   adding to any of this code.                                               *
  144.  *                                                                             *
  145.  *   1.  If, at any time, you want to modify your private split block, no lock *
  146.  *       is required.                                                          *
  147.  *                                                                             *
  148.  *   2.  If, at any time, you want to modify another split block, such as the  *
  149.  *       parent split block shared move list, you must acquire the lock in the *
  150.  *       split block first.  IE tree->parent->lock to lock the parent split    *
  151.  *       block during NextMove() and such.                                     *
  152.  *                                                                             *
  153.  *   3.  If you want to modify any SMP-related data that spans multiple split  *
  154.  *       blocks, such as telling sibling processes to stop, etc, then you must *
  155.  *       acquire the global "lock_smp" lock first.  This prevents a deadlock   *
  156.  *       caused by two different threads walking the split block chain from    *
  157.  *       different directions, and acquiring the split block locks in          *
  158.  *       different orders, which could cause a catastrophic deadlock to occur. *
  159.  *       This is an infrequent event so the overhead is not significant.       *
  160.  *                                                                             *
  161.  *   4.  If you want to do any sort of I/O operation, you must acquire the     *
  162.  *       "lock_io" lock first.  Since threads share descriptors, there are     *
  163.  *       lots of potential race conditions, from the simple tty-output getting *
  164.  *       interlaced from different threads, to outright data corruption in the *
  165.  *       book or log files.                                                    *
  166.  *                                                                             *
  167.  *   Some of the bugs caused by failing to acquire the correct lock will only  *
  168.  *   occur infrequently, and they are extremely difficult to find.  Some only  *
  169.  *   show up in a public game where everyone is watching, to cause maximum     *
  170.  *   embarassment and causes the program to do something extremely stupid.     *
  171.  *                                                                             *
  172.  *******************************************************************************
  173.  */
  174. int Split(TREE * RESTRICT tree) {
  175.   TREE *child;
  176.   int tid, tstart, tend;
  177.  
  178. /*
  179.  ************************************************************
  180.  *                                                          *
  181.  *  Here we prepare to split the tree.  All we really do in *
  182.  *  the Generation II threading is grab a split block for   *
  183.  *  this thread, then flag the parent as "joinable" and     *
  184.  *  then jump right to ThreadWait() to resume where we left *
  185.  *  off, with the expectation (but not a requirement) that  *
  186.  *  other threads will join us to help.                     *
  187.  *                                                          *
  188.  *  Idle threads are sitting in ThreadWait() repeatedly     *
  189.  *  calling Join() to find them a split point, which we are *
  190.  *  fixing to provide.  They will then join as quickly as   *
  191.  *  they can, and other threads that become idle later can  *
  192.  *  also join without any further splitting needed.         *
  193.  *                                                          *
  194.  *  If we are unable to allocate a split block, we simply   *
  195.  *  abort this attempted split and return to the search     *
  196.  *  since other threads will also split quickly.            *
  197.  *                                                          *
  198.  ************************************************************
  199.  */
  200.   tstart = ReadClock();
  201.   tree->nprocs = 0;
  202.   for (tid = 0; tid < (int) smp_max_threads; tid++) // Pierre-Marie Baty -- added type cast
  203.     tree->siblings[tid] = 0;
  204.   child = GetBlock(tree, tree->thread_id);
  205.   if (!child)
  206.     return 0;
  207.   CopyFromParent(child);
  208.   thread[tree->thread_id].tree = child;
  209.   tree->joined = 0;
  210.   tree->joinable = 1;
  211.   parallel_splits++;
  212.   smp_split = 0;
  213.   tend = ReadClock();
  214.   thread[tree->thread_id].idle += tend - tstart;
  215. /*
  216.  ************************************************************
  217.  *                                                          *
  218.  *  We have successfully created a split point, which means *
  219.  *  we are done.  The instant we set the "joinable" flag,   *
  220.  *  idle threads may begin to join in at this split point   *
  221.  *  to help.  Since this thread may finish before any or    *
  222.  *  all of the other parallel threads, this thread is sent  *
  223.  *  to ThreadWait() which will immediately send it to       *
  224.  *  SearchMoveList() like the other threads; however, going *
  225.  *  to ThreadWait() allows this thread to join others if it *
  226.  *  runs out of work to do.  We do pass ThreadWait() the    *
  227.  *  address of the parent split block, so that if this      *
  228.  *  thread becomes idle, and this thread block shows no     *
  229.  *  threads are still busy, then this thread can return to  *
  230.  *  here and then back up into the previous ply as it       *
  231.  *  should.  Note that no other thread can back up to the   *
  232.  *  previous ply since their recursive call stacks are not  *
  233.  *  set for that, while this call stack will bring us back  *
  234.  *  to this point where we return to the normal search,     *
  235.  *  which we just completed.                                *
  236.  *                                                          *
  237.  ************************************************************
  238.  */
  239.   ThreadWait(tree->thread_id, tree);
  240.   if (!tree->joined)
  241.     parallel_splits_wasted++;
  242.   return 1;
  243. }
  244.  
  245. /* modified 12/16/15 */
  246. /*
  247.  *******************************************************************************
  248.  *                                                                             *
  249.  *   Join() is called just when we enter the usual spin-loop waiting for work. *
  250.  *   We take a quick look at all active split blocks to see if any look        *
  251.  *   "joinable".  If so, we compute an "interest" value, which will be defined *
  252.  *   below.  We then join the most interesting split point directly. This      *
  253.  *   split point might have been created specifically for this thread to join, *
  254.  *   or it might be one that was already active when this thread became idle,  *
  255.  *   which allows us to join that existing split point and not request a new   *
  256.  *   split operation, saving time.                                             *
  257.  *                                                                             *
  258.  *******************************************************************************
  259.  */
  260. int Join(int64_t tid) {
  261.   TREE *tree, *join_block, *child;
  262.   int interest, best_interest, current, pass = 0;
  263.  
  264. /*
  265.  ************************************************************
  266.  *                                                          *
  267.  *  First we pass over ALL split blocks, looking for those  *
  268.  *  flagged as "joinable" (which means they are actually    *
  269.  *  active split points and that no processor at that split *
  270.  *  point has run out of work (there is no point in joining *
  271.  *  a split point with no remaining work) and no fail high  *
  272.  *  has been found which would raise the "stop" flag.) This *
  273.  *  is "racy" because we do not acquire any locks, which    *
  274.  *  means that the busy threads continue working, and there *
  275.  *  is a small probability that the split point will        *
  276.  *  disappear while we are in this loop.  To resolve the    *
  277.  *  potential race, after we find the most attractive split *
  278.  *  point, we acquire the lock for that split block and     *
  279.  *  test again, but this time if the block is joinable, we  *
  280.  *  can safely join under control of the lock, which is not *
  281.  *  held for very long at all.  If the block is not         *
  282.  *  joinable once we acquire the lock, we abort joining     *
  283.  *  since it is futile.  Note that if this happens, we will *
  284.  *  try to find an existing split point we can join three   *
  285.  *  times before we exit, setting split to 1 to ask other   *
  286.  *  threads to produce more candidate split points.         *
  287.  *                                                          *
  288.  ************************************************************
  289.  */
  290.   for (pass = 0; pass < 3; pass++) {
  291.     best_interest = -999999;
  292.     join_block = 0;
  293.     for (current = 0; current <= (int) smp_max_threads * 64; current++) { // Pierre-Marie Baty -- added type cast
  294.       tree = block[current];
  295.       if (tree->joinable && (tree->ply <= tree->depth / 2 ||
  296.               tree->nprocs < (int) smp_split_group)) { // Pierre-Marie Baty -- added type cast
  297.         interest = tree->depth * 2 - tree->searched[0];
  298.         if (interest > best_interest) {
  299.           best_interest = interest;
  300.           join_block = tree;
  301.         }
  302.       }
  303.     }
  304. /*
  305.  ************************************************************
  306.  *                                                          *
  307.  *  Now we acquire the lock for this split block, and then  *
  308.  *  check to see if the block is still flagged as joinable. *
  309.  *  If so, we set things up, and then we get pretty tricky  *
  310.  *  as we then release the lock, and then copy the data     *
  311.  *  from the parent to our split block.  There is a chance  *
  312.  *  that while we are copying this data, the split point    *
  313.  *  gets completed by other threads.  Which would leave an  *
  314.  *  apparent race condition exposed where we start copying  *
  315.  *  data here, the split point is completed, the parent     *
  316.  *  block is released and then reacquired and we continue   *
  317.  *  if nothing has happened here, getting data copied from  *
  318.  *  two different positions.                                *
  319.  *                                                          *
  320.  *  Fortunately, we linked this new split block to the old  *
  321.  *  (original parent).  If that split block is released, we *
  322.  *  will discover this because that operation will also set *
  323.  *  our "stop" flag which will prevent us from using this   *
  324.  *  data and breaking things.  We allow threads to copy     *
  325.  *  this data without any lock protection to eliminate a    *
  326.  *  serialization (each node would copy the data serially,  *
  327.  *  rather than all at once) with the only consequence to   *
  328.  *  this being the overhead of copying and then throwing    *
  329.  *  the data away, which can happen on occasion even if we  *
  330.  *  used a lock for protection, since once we release the   *
  331.  *  lock it still takes time to get into the search and we  *
  332.  *  could STILL find that this split block has already been *
  333.  *  completed, once again.  Less contention and serial      *
  334.  *  computing improves performance.                         *
  335.  *                                                          *
  336.  ************************************************************
  337.  */
  338.     if (join_block) {
  339.       Lock(join_block->lock);
  340.       if (join_block->joinable) {
  341.         child = GetBlock(join_block, (int) tid); // Pierre-Marie Baty -- added type cast
  342.         Unlock(join_block->lock);
  343.         if (child) {
  344.           CopyFromParent(child);
  345.           thread[tid].tree = child;
  346.           parallel_joins++;
  347.           return 1;
  348.         }
  349.       } else {
  350.         Unlock(join_block->lock);
  351.         break;
  352.       }
  353.     }
  354.   }
  355. /*
  356.  ************************************************************
  357.  *                                                          *
  358.  *  We did not acquire a split point to join, so we set     *
  359.  *  smp_split to 1 to ask busy threads to create joinable   *
  360.  *  split points.                                           *
  361.  *                                                          *
  362.  ************************************************************
  363.  */
  364.   smp_split = 1;
  365.   return 0;
  366. }
  367.  
  368. /* modified 11/04/15 */
  369. /*
  370.  *******************************************************************************
  371.  *                                                                             *
  372.  *   ThreadInit() is called after a process is created.  Its main task is to   *
  373.  *   initialize the process local memory so that it will fault in and be       *
  374.  *   allocated on the local node rather than the node where the original       *
  375.  *   (first) process was running.  All threads will hang here via a custom     *
  376.  *   WaitForALlThreadsInitialized() procedure so that all the local thread     *
  377.  *   blocks are usable before the search actually begins.                      *
  378.  *                                                                             *
  379.  *******************************************************************************
  380.  */
  381. void *STDCALL ThreadInit(void *t) {
  382.   int tid = (int64_t) t;
  383.  
  384.   ThreadAffinity(tid);
  385. #  if !defined(UNIX)
  386.   ThreadMalloc((uint64_t) tid);
  387. #  endif
  388.   thread[tid].blocks = 0xffffffffffffffffull;
  389.   Lock(lock_smp);
  390.   initialized_threads++;
  391.   Unlock(lock_smp);
  392.   WaitForAllThreadsInitialized();
  393.   ThreadWait(tid, (TREE *) 0);
  394.   Lock(lock_smp);
  395.   smp_threads--;
  396.   Unlock(lock_smp);
  397.   return 0;
  398. }
  399.  
  400. /* modified 01/08/15 */
  401. /*
  402.  *******************************************************************************
  403.  *                                                                             *
  404.  *   ThreadAffinity() is called to "pin" a thread to a specific processor.  It *
  405.  *   is a "noop" (no-operation) if Crafty was not compiled with -DAFFINITY, or *
  406.  *   if smp_affinity is negative (smpaffinity=-1 disables affinity).  It       *
  407.  *   simply sets the affinity for the current thread to the requested CPU and  *
  408.  *   returns.  NOTE:  If hyperthreading is enabled, there is no guarantee that *
  409.  *   this will work as expected and pin one thread per physical core.  It      *
  410.  *   depends on how the O/S numbers the SMT cores.                             *
  411.  *                                                                             *
  412.  *******************************************************************************
  413.  */
  414. void ThreadAffinity(int cpu) {
  415. #  if defined(AFFINITY)
  416.   cpu_set_t cpuset;
  417.   pthread_t current_thread = pthread_self();
  418.  
  419.   if (smp_affinity >= 0) {
  420.     CPU_ZERO(&cpuset);
  421.     CPU_SET(cpu + smp_affinity, &cpuset);
  422.     pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
  423.   }
  424. #  endif
  425. }
  426.  
  427. /* modified 11/04/15 */
  428. /*
  429.  *******************************************************************************
  430.  *                                                                             *
  431.  *   ThreadSplit() is used to determine if we should split at the current ply. *
  432.  *   There are some basic constraints on when splits can be done, such as the  *
  433.  *   depth remaining in the search (don't split to near the tips), and have we *
  434.  *   searched at least one move to get a score or bound (YBW condition).       *
  435.  *                                                                             *
  436.  *   If those conditions are satisfied, AND either a thread has requested a    *
  437.  *   split OR we are far enough away from the tips of the tree to justify a    *
  438.  *   "gratuitout split" then we return "success."  A "gratuitout split" is a   *
  439.  *   split done without any idle threads.  Since splits are not free, we only  *
  440.  *   do this well away from tips to limit overhead.  We do this so that when a *
  441.  *   thread becomes idle, it will find these split points immediately and not  *
  442.  *   have to wait for a split after the fact.                                  *
  443.  *                                                                             *
  444.  *******************************************************************************
  445.  */
  446. int ThreadSplit(TREE *RESTRICT tree, int ply, int depth, int alpha, int o_alpha, // Pierre-Marie Baty -- missing keyword
  447.     int done) {
  448.   TREE *used;
  449.   int64_t tblocks;
  450.   int temp, unused = 0;
  451.  
  452. /*
  453.  ************************************************************
  454.  *                                                          *
  455.  *  First, we see if we meet the basic criteria to create a *
  456.  *  split point, that being that we must not be too far     *
  457.  *  from the root (smp_min_split_depth).                    *
  458.  *                                                          *
  459.  ************************************************************
  460.  */
  461.   if (depth < (int) smp_min_split_depth) // Pierre-Marie Baty -- added type cast
  462.     return 0;
  463. /*
  464.  ************************************************************
  465.  *                                                          *
  466.  *  If smp_split is NOT set, we are checking to see if it   *
  467.  *  is acceptable to do a gratuitous split here.            *
  468.  *                                                          *
  469.  *  (1) if we are too far from the root we do not do        *
  470.  *      gratuitous splits to avoid the overhead.            *
  471.  *                                                          *
  472.  *  (2) if we have searched more than one move at this ply, *
  473.  *      we don't do any further tests to see if a           *
  474.  *      gratuitous split is acceptable, since we have       *
  475.  *      previously done this test at this ply and decided   *
  476.  *      one should not be done.  That condition has likely  *
  477.  *      not changed.                                        *
  478.  *                                                          *
  479.  *  (3) if we have pre-existing gratuitous split points for *
  480.  *      this thread, we make certain we don't create more   *
  481.  *      than the gratuitous split limit as excessive splits *
  482.  *      just add to the overhead with no benefit.           *
  483.  *                                                          *
  484.  ************************************************************
  485.  */
  486.   if (!smp_split) {
  487.     if (depth < (int) smp_gratuitous_depth || done > 1) // Pierre-Marie Baty -- added type cast
  488.       return 0;
  489.     tblocks = ~thread[tree->thread_id].blocks;
  490.     while (tblocks) {
  491.       temp = LSB(tblocks);
  492.       used = block[temp + tree->thread_id * 64 + 1];
  493.       if (used->joinable && !used->joined)
  494.         unused++;
  495.       Clear(temp, tblocks);
  496.     }
  497.     if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
  498.       return 0;
  499.   }
  500. /*
  501.  ************************************************************
  502.  *                                                          *
  503.  *  If smp_split IS set, we are checking to see if it is    *
  504.  *  acceptable to do a split because there are idle threads *
  505.  *  that need work to do.                                   *
  506.  *                                                          *
  507.  *  The only reason this would be false is if we have a     *
  508.  *  pre-existing split point that is joinable but has not   *
  509.  *  been joined. If one exists, there is no need to split   *
  510.  *  again as there is already an accessible split point.    *
  511.  *  Otherwise, if we are at the root and we are either not  *
  512.  *  allowed to split at the root, or we have additional     *
  513.  *  root moves that have to be searched one at a time using *
  514.  *  all available threads we also can not split here.       *
  515.  *                                                          *
  516.  ************************************************************
  517.  */
  518.   else {
  519.     if (ply == 1 && (!smp_split_at_root || !NextRootMoveParallel() ||
  520.             alpha == o_alpha))
  521.       return 0;
  522.     tblocks = ~thread[tree->thread_id].blocks;
  523.     while (tblocks) {
  524.       temp = LSB(tblocks);
  525.       used = block[temp + tree->thread_id * 64 + 1];
  526.       if (used->joinable && !used->joined)
  527.         unused++;
  528.       Clear(temp, tblocks);
  529.     }
  530.     if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
  531.       return 0;
  532.   }
  533.   return 1;
  534. }
  535.  
  536. /* modified 11/04/15 */
  537. /*
  538.  *******************************************************************************
  539.  *                                                                             *
  540.  *   ThreadStop() is called from SearchMoveList() when it detects a beta       *
  541.  *   cutoff (fail high) at a node that is being searched in parallel.  We need *
  542.  *   to stop all threads here, and since this threading algorithm is recursive *
  543.  *   it may be necessary to stop other threads that are helping search this    *
  544.  *   branch further down into the tree.  This function simply sets appropriate *
  545.  *   tree->stop variables to 1, which will stop those particular threads       *
  546.  *   instantly and return them to the idle loop in ThreadWait().               *
  547.  *                                                                             *
  548.  *******************************************************************************
  549.  */
  550. void ThreadStop(TREE * RESTRICT tree) {
  551.   int proc;
  552.  
  553.   Lock(tree->lock);
  554.   tree->stop = 1;
  555.   tree->joinable = 0;
  556.   for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
  557.     if (tree->siblings[proc])
  558.       ThreadStop(tree->siblings[proc]);
  559.   Unlock(tree->lock);
  560. }
  561.  
  562. /* modified 11/04/15 */
  563. /*
  564.  *******************************************************************************
  565.  *                                                                             *
  566.  *   ThreadTrace() is a debugging tool that simply walks the split block tree  *
  567.  *   and displays interesting data to help debug the parallel search whenever  *
  568.  *   changes break things.                                                     *
  569.  *                                                                             *
  570.  *******************************************************************************
  571.  */
  572. void ThreadTrace(TREE * RESTRICT tree, int depth, int brief) {
  573.   int proc, i;
  574.  
  575.   Lock(tree->lock);
  576.   Lock(lock_io);
  577.   if (!brief) {
  578.     for (i = 0; i < 4 * depth; i++)
  579.       Print(4095, " ");
  580.     depth++;
  581.     Print(4095, "block[%d]  thread=%d  ply=%d  nprocs=%d  ",
  582.         FindBlockID(tree), tree->thread_id, tree->ply, tree->nprocs);
  583.     Print(4095, "joined=%d  joinable=%d  stop=%d  nodes=%d", tree->joined,
  584.         tree->joinable, tree->stop, tree->nodes_searched);
  585.     Print(4095, "  parent=%d\n", FindBlockID(tree->parent));
  586.   } else {
  587.     if (tree->nprocs > 1) {
  588.       for (i = 0; i < 4 * depth; i++)
  589.         Print(4095, " ");
  590.       depth++;
  591.       Print(4095, "(ply %d)", tree->ply);
  592.     }
  593.   }
  594.   if (tree->nprocs) {
  595.     if (!brief) {
  596.       for (i = 0; i < 4 * depth; i++)
  597.         Print(4095, " ");
  598.       Print(4095, "          parent=%d  sibling threads=",
  599.           FindBlockID(tree->parent));
  600.       for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
  601.         if (tree->siblings[proc])
  602.           Print(4095, " %d(%d)", proc, FindBlockID(tree->siblings[proc]));
  603.       Print(4095, "\n");
  604.     } else {
  605.       if (tree->nprocs > 1) {
  606.         Print(4095, " helping= ");
  607.         for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
  608.           if (tree->siblings[proc]) {
  609.             if (proc == tree->thread_id)
  610.               Print(4095, "[");
  611.             Print(4095, "%d", proc);
  612.             if (proc == tree->thread_id)
  613.               Print(4095, "]");
  614.             Print(4095, " ");
  615.           }
  616.         Print(4095, "\n");
  617.       }
  618.     }
  619.   }
  620.   Unlock(lock_io);
  621.   for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
  622.     if (tree->siblings[proc])
  623.       ThreadTrace(tree->siblings[proc], depth, brief);
  624.   Unlock(tree->lock);
  625. }
  626.  
  627. /* modified 11/04/15 */
  628. /*
  629.  *******************************************************************************
  630.  *                                                                             *
  631.  *   ThreadWait() is the idle loop for the N threads that are created at the   *
  632.  *   beginning when Crafty searches.  Threads are "parked" here waiting on a   *
  633.  *   pointer to something they should search (a parameter block built in the   *
  634.  *   function Split() in this case.  When this pointer becomes non-zero, each  *
  635.  *   thread "parked" here will immediately call SearchMoveList() and begin the *
  636.  *   parallel search as directed.                                              *
  637.  *                                                                             *
  638.  *   Upon entry, all threads except for the "master" will arrive here with a   *
  639.  *   value of zero (0) in the waiting parameter below.  This indicates that    *
  640.  *   they will search and them be done.  The "master" will arrive here with a  *
  641.  *   pointer to the parent split block in "waiting" which says I will sit here *
  642.  *   waiting for work OR when the waiting split block has no threads working   *
  643.  *   on it, at which point I should return which will let me "unsplit" here    *
  644.  *   and clean things up.  The call to here in Split() passes this block       *
  645.  *   address while threads that are helping get here with a zero.              *
  646.  *                                                                             *
  647.  *******************************************************************************
  648.  */
  649. int ThreadWait(int tid, TREE * RESTRICT waiting) {
  650.   int value, tstart, tend;
  651.  
  652. /*
  653.  ************************************************************
  654.  *                                                          *
  655.  *  When we reach this point, one of three possible         *
  656.  *  conditions is true (1) we already have work to do, as   *
  657.  *  we are the "master thread" and we have already split    *
  658.  *  the tree, we are coming here to join in;  (2) we are    *
  659.  *  the master, and we are waiting on our split point to    *
  660.  *  complete, so we come here to join and help currently    *
  661.  *  active threads;  (3) we have no work to do, so we will  *
  662.  *  spin until Join() locates a split pont we can join to   *
  663.  *  help out.                                               *
  664.  *                                                          *
  665.  *  Note that when we get here, the parent already has a    *
  666.  *  split block and does not need to call Join(), it simply *
  667.  *  falls through the while spin loop below because its     *
  668.  *  "tree" pointer is already non-zero.                     *
  669.  *                                                          *
  670.  ************************************************************
  671.  */
  672.   while (1) {
  673.     tstart = ReadClock();
  674.     while (!thread[tid].tree && (!waiting || waiting->nprocs) && !Join(tid) &&
  675.         !thread[tid].terminate)
  676.       Pause();
  677.     tend = ReadClock();
  678.     if (!thread[tid].tree)
  679.       thread[tid].tree = waiting;
  680.     thread[tid].idle += tend - tstart;
  681.     if (thread[tid].tree == waiting || thread[tid].terminate)
  682.       return 0;
  683. /*
  684.  ************************************************************
  685.  *                                                          *
  686.  *  Once we get here, we have a good split point, so we are *
  687.  *  ready to participate in a parallel search.  Once we     *
  688.  *  return from SearchMoveList() we copy our results back   *
  689.  *  to the parent via CopyToParent() before we look for a   *
  690.  *  new split point.  If we are a parent, we will slip out  *
  691.  *  of the spin loop at the top and return to the normal    *
  692.  *  serial search to finish up here.                        *
  693.  *                                                          *
  694.  *  When we return from SearchMoveList(), we need to        *
  695.  *  decrement the "nprocs" value since there is now one     *
  696.  *  less thread working at this split point.                *
  697.  *                                                          *
  698.  *  Note:  CopyToParent() marks the current split block as  *
  699.  *  unused once the copy is completed, so we don't have to  *
  700.  *  do anything about that here.                            *
  701.  *                                                          *
  702.  ************************************************************
  703.  */
  704.     value =
  705.         SearchMoveList(thread[tid].tree, thread[tid].tree->ply,
  706.         thread[tid].tree->depth, thread[tid].tree->wtm,
  707.         thread[tid].tree->alpha, thread[tid].tree->beta,
  708.         thread[tid].tree->searched, thread[tid].tree->in_check, 0, parallel);
  709.     tstart = ReadClock();
  710.     Lock(thread[tid].tree->parent->lock);
  711.     thread[tid].tree->parent->joinable = 0;
  712.     CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value);
  713.     thread[tid].tree->parent->nprocs--;
  714.     thread[tid].tree->parent->siblings[tid] = 0;
  715.     Unlock(thread[tid].tree->parent->lock);
  716.     thread[tid].tree = 0;
  717.     tend = ReadClock();
  718.     thread[tid].idle += tend - tstart;
  719.   }
  720. }
  721.  
  722. /* modified 11/04/15 */
  723. /*
  724.  *******************************************************************************
  725.  *                                                                             *
  726.  *   CopyFromParent() is used to copy data from a parent thread to a child     *
  727.  *   thread.  This only copies the appropriate parts of the TREE structure to  *
  728.  *   avoid burning memory bandwidth by copying everything.                     *
  729.  *                                                                             *
  730.  *******************************************************************************
  731.  */
  732. void CopyFromParent(TREE * RESTRICT child) {
  733.   TREE *parent = child->parent;
  734.   int i, ply;
  735.  
  736. /*
  737.  ************************************************************
  738.  *                                                          *
  739.  *  We have allocated a split block.  Now we copy the tree  *
  740.  *  search state from the parent block to the child in      *
  741.  *  preparation for starting the parallel search.           *
  742.  *                                                          *
  743.  ************************************************************
  744.  */
  745.   ply = parent->ply;
  746.   child->ply = ply;
  747.   child->position = parent->position;
  748.   for (i = 0; i <= rep_index + parent->ply; i++)
  749.     child->rep_list[i] = parent->rep_list[i];
  750.   for (i = ply - 1; i < MAXPLY; i++)
  751.     child->killers[i] = parent->killers[i];
  752.   for (i = ply - 1; i <= ply; i++) {
  753.     child->curmv[i] = parent->curmv[i];
  754.     child->pv[i] = parent->pv[i];
  755.   }
  756.   child->in_check = parent->in_check;
  757.   child->last[ply] = child->move_list;
  758.   child->status[ply] = parent->status[ply];
  759.   child->status[1] = parent->status[1];
  760.   child->save_hash_key[ply] = parent->save_hash_key[ply];
  761.   child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply];
  762.   child->nodes_searched = 0;
  763.   child->fail_highs = 0;
  764.   child->fail_high_first_move = 0;
  765.   child->evaluations = 0;
  766.   child->egtb_probes = 0;
  767.   child->egtb_hits = 0;
  768.   child->extensions_done = 0;
  769.   child->qchecks_done = 0;
  770.   child->moves_fpruned = 0;
  771.   child->moves_mpruned = 0;
  772.   for (i = 0; i < 16; i++) {
  773.     child->LMR_done[i] = 0;
  774.     child->null_done[i] = 0;
  775.   }
  776.   child->alpha = parent->alpha;
  777.   child->beta = parent->beta;
  778.   child->value = parent->value;
  779.   child->wtm = parent->wtm;
  780.   child->depth = parent->depth;
  781.   child->searched = parent->searched;
  782.   strcpy(child->root_move_text, parent->root_move_text);
  783.   strcpy(child->remaining_moves_text, parent->remaining_moves_text);
  784. }
  785.  
  786. /* modified 11/04/15 */
  787. /*
  788.  *******************************************************************************
  789.  *                                                                             *
  790.  *   CopyToParent() is used to copy data from a child thread to a parent       *
  791.  *   thread.  This only copies the appropriate parts of the TREE structure to  *
  792.  *   avoid burning memory bandwidth by copying everything.                     *
  793.  *                                                                             *
  794.  *******************************************************************************
  795.  */
  796. void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) {
  797.   int i, ply = parent->ply, which;
  798.  
  799. /*
  800.  ************************************************************
  801.  *                                                          *
  802.  *  The only concern here is to make sure that the info is  *
  803.  *  only copied to the parent if our score is > than the    *
  804.  *  parent value, and that we were not stopped for any      *
  805.  *  reason which could produce a partial score that is      *
  806.  *  worthless and dangerous to use.                         *
  807.  *                                                          *
  808.  *  One important special case.  If we get here with the    *
  809.  *  thread->stop flag set, and ply is 1, then we need to    *
  810.  *  clear the "this move has been searched" flag in the ply *
  811.  *  1 move list since we did not complete the search.  If   *
  812.  *  we fail to do this, then a move being searched in       *
  813.  *  parallel at the root will be "lost" for this iteration  *
  814.  *  and won't be searched again until the next iteration.   *
  815.  *                                                          *
  816.  *  In any case, we add our statistical counters to the     *
  817.  *  parent's totals no matter whether we finished or not    *
  818.  *  since the total nodes searched and such should consider *
  819.  *  everything searched, not just the "useful stuff."       *
  820.  *                                                          *
  821.  *  After we finish copying everything, we mark this split  *
  822.  *  block as free in the split block bitmap.                *
  823.  *                                                          *
  824.  ************************************************************
  825.  */
  826.   if (child->nodes_searched && !child->stop && value > parent->value &&
  827.       !abort_search) {
  828.     parent->pv[ply] = child->pv[ply];
  829.     parent->value = value;
  830.     parent->cutmove = child->curmv[ply];
  831.   }
  832.   if (child->stop && ply == 1)
  833.     for (which = 0; which < n_root_moves; which++)
  834.       if (root_moves[which].move == child->curmv[ply]) {
  835.         root_moves[which].status &= 7;
  836.         break;
  837.       }
  838.   parent->nodes_searched += child->nodes_searched;
  839.   parent->fail_highs += child->fail_highs;
  840.   parent->fail_high_first_move += child->fail_high_first_move;
  841.   parent->evaluations += child->evaluations;
  842.   parent->egtb_probes += child->egtb_probes;
  843.   parent->egtb_hits += child->egtb_hits;
  844.   parent->extensions_done += child->extensions_done;
  845.   parent->qchecks_done += child->qchecks_done;
  846.   parent->moves_fpruned += child->moves_fpruned;
  847.   parent->moves_mpruned += child->moves_mpruned;
  848.   for (i = 1; i < 16; i++) {
  849.     parent->LMR_done[i] += child->LMR_done[i];
  850.     parent->null_done[i] += child->null_done[i];
  851.   }
  852.   which = FindBlockID(child) - 64 * child->thread_id - 1;
  853.   Set(which, thread[child->thread_id].blocks);
  854. }
  855.  
  856. /* modified 11/04/15 */
  857. /*
  858.  *******************************************************************************
  859.  *                                                                             *
  860.  *   GetBlock() is used to allocate a split block and fill in only SMP-        *
  861.  *   critical information.  The child process will copy the rest of the split  *
  862.  *   block information as needed.                                              *
  863.  *                                                                             *
  864.  *   When we arrive here, the parent split block must be locked since we are   *
  865.  *   going to change data in that block as well as copy data from that block   *
  866.  *   the current split block.  The only exception is when this is the original *
  867.  *   split operation, since this is done "out of sight" of other threads which *
  868.  *   means no locks are needed until after the "joinable" flag is set, which   *
  869.  *   exposes this split point to other threads instantly.                      *
  870.  *                                                                             *
  871.  *******************************************************************************
  872.  */
  873. TREE *GetBlock(TREE * /*RESTRICT*/ parent, int tid) { // Pierre-Marie Baty -- keyword not present in declaration
  874.   TREE *child;
  875.   static int warnings = 0;
  876.   int i, unused;
  877. /*
  878.  ************************************************************
  879.  *                                                          *
  880.  *  One NUMA-related trick is that we only allocate a split *
  881.  *  block in the thread's local memory.  Each thread has a  *
  882.  *  group of split blocks that were first touched by the    *
  883.  *  correct CPU so that the split blocks page-faulted into  *
  884.  *  local memory for that specific processor.  If we can't  *
  885.  *  find an optimally-placed block, we return a zero which  *
  886.  *  will prevent this thread from joining the split point.  *
  887.  *  This is highly unlikely as it would mean the current    *
  888.  *  thread has 64 active split blocks where it is waiting   *
  889.  *  on other threads to complete the last bit of work at    *
  890.  *  each.  This is extremely unlikely.                      *
  891.  *                                                          *
  892.  *  Here we use a simple 64 bit bit-map per thread that     *
  893.  *  indicates which blocks are free (1) and which blocks    *
  894.  *  used (0).  We simply use LSB() to find the rightmost    *
  895.  *  one-bit set and that is the local block number.  We     *
  896.  *  convert that to a global block number by doing the      *
  897.  *  simple computation:                                     *
  898.  *                                                          *
  899.  *     global = local + 64 * tid + 1                        *
  900.  *                                                          *
  901.  *  Each thread has exactly 64 split blocks, and block 0    *
  902.  *  is the "master block" that never gets allocated or      *
  903.  *  freed.  Once we find a free block for the current       *
  904.  *  thread, we zero that bit so that the block won't be     *
  905.  *  used again until it is released.                        *
  906.  *                                                          *
  907.  ************************************************************
  908.  */
  909.   if (thread[tid].blocks) {
  910.     unused = LSB(thread[tid].blocks);
  911.     Clear(unused, thread[tid].blocks);
  912.     Set(unused, thread[tid].max_blocks);
  913.   } else {
  914.     if (++warnings < 6)
  915.       Print(2048,
  916.           "WARNING.  local SMP block cannot be allocated, thread %d\n", tid);
  917.     return 0;
  918.   }
  919.   child = block[unused + tid * 64 + 1];
  920. /*
  921.  ************************************************************
  922.  *                                                          *
  923.  *  Found a split block.  Now we need to fill in only the   *
  924.  *  critical information that can't be delayed due to race  *
  925.  *  conditions.  When we get here, the parent split block   *
  926.  *  must be locked, that lets us safely update the number   *
  927.  *  of processors working here, etc, without any ugly race  *
  928.  *  conditions that would corrupt this critical data.       *
  929.  *                                                          *
  930.  ************************************************************
  931.  */
  932.   for (i = 0; i < (int) smp_max_threads; i++) // Pierre-Marie Baty -- added type cast
  933.     child->siblings[i] = 0;
  934.   child->nprocs = 0;
  935.   child->stop = 0;
  936.   child->joinable = 0;
  937.   child->joined = 0;
  938.   child->parent = parent;
  939.   child->thread_id = tid;
  940.   parent->nprocs++;
  941.   parent->siblings[tid] = child;
  942.   parent->joined = 1;
  943.   return child;
  944. }
  945.  
  946. /*
  947.  *******************************************************************************
  948.  *                                                                             *
  949.  *   WaitForAllThreadsInitialized() waits until all smp_max_threads are        *
  950.  *   initialized.  We have to initialize each thread and malloc() its split    *
  951.  *   blocks before we start the actual parallel search.  Otherwise we will see *
  952.  *   invalid memory accesses and crash instantly.                              *
  953.  *                                                                             *
  954.  *******************************************************************************
  955.  */
  956. void WaitForAllThreadsInitialized(void) {
  957.   while (initialized_threads < (int) smp_max_threads); /* Do nothing */ // Pierre-Marie Baty -- added type cast
  958. }
  959.  
  960. #  if !defined (UNIX)
  961. /* modified 01/17/09 */
  962. /*
  963.  *******************************************************************************
  964.  *                                                                             *
  965.  *   ThreadMalloc() is called from the ThreadInit() function.  It malloc's the *
  966.  *   split blocks in the local memory for the processor associated with the    *
  967.  *   specific thread that is calling this code.                                *
  968.  *                                                                             *
  969.  *******************************************************************************
  970.  */
  971. extern void *WinMalloc(size_t, int);
  972. void ThreadMalloc(int64_t tid) {
  973.   int i;
  974.  
  975.   for (i = (int) tid * 64 + 1; i < (int) tid * 64 + 65; i++) { // Pierre-Marie Baty -- added type casts
  976.     if (block[i] == NULL)
  977.       block[i] =
  978.           (TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) +
  979.                   127, (int) tid))); // Pierre-Marie Baty -- added type cast
  980.     block[i]->parent = NULL;
  981.     LockInit(block[i]->lock);
  982.   }
  983. }
  984. #  endif
  985. #endif
  986.