WebSVN – Games.Chess Giants – Blame – /engine-crafty/thread.c

Rev	Author	Line No.	Line
33	pmbaty	1	#include "chess.h"
		2	#include "data.h"
		3	#include "epdglue.h"
154	pmbaty	4	/* modified 08/03/16 */
33	pmbaty	5	/*
		6	*******************************************************************************
		7	* *
108	pmbaty	8	* Split() is the driver for the threaded parallel search in Crafty. The *
33	pmbaty	9	* basic idea is that whenever we notice that one (or more) threads are in *
108	pmbaty	10	* their idle loop, we drop into Split(), from Search(), and begin a new *
		11	* parallel search from this node. This is simply a problem of establishing *
		12	* a new split point, and then letting each thread join this split point and *
		13	* copy whatever data they need. *
33	pmbaty	14	* *
108	pmbaty	15	* This is generation II of Split(). The primary differences address two *
		16	* specific performance-robbing issues. (1) Excessive waiting for a split *
		17	* to be done, and (b) excessive waiting on specific locks. Generation II *
		18	* addresses both of these to significantly improve performance. *
		19	* *
		20	* The main difference between Gen I and Gen II is the effort required to *
		21	* split the tree and which thread(s) expend this effort. In generation I, *
		22	* the parent thread was responsible for allocating a split block for each *
		23	* child thread, and then copying the necessary data from the parent split *
		24	* block to these child split blocks. When all of this was completed, the *
		25	* child processes were released to start the parallel search after being *
		26	* held while the split / copy operations were done. In the generation II *
		27	* Split() we now simply allocate a new split block for THIS thread, flag *
		28	* the parent split block as joinable, and then go directly to ThreadWait() *
		29	* which will drop us back in to the search immediately. The idle threads *
		30	* are continually looping on Join() which will jump them right into this *
		31	* split block letting them do ALL the work of allocating a split block, *
		32	* filling it in, and then copying the data to their local split block. *
		33	* This distributes the split overhead among all the threads that split, *
		34	* rather than this thread having to do all the work while the other threads *
		35	* sit idle. *
		36	* *
		37	* Generation II is also much more lightweight, in that it copies only the *
		38	* bare minimum from parent to child. Generation I safely copied far too *
		39	* much since this code was being changed regularly, but that is no longer *
		40	* necessary overhead. *
		41	* *
		42	* Generation II has a zero contention split algorithm. In the past, when a *
		43	* thread became idle, it posted a global split request and any thread that *
		44	* was at an appropriate point would try to split. But while it was doing *
		45	* the splitting, other threads that were also willing to split would "get *
		46	* in line" because Crafty used a global lock to prevent two threads from *
		47	* attempting to split at the same instant in time. They got in line, and *
		48	* waited for the original splitter to release the lock, but now they have *
		49	* no idle threads to split with. A waste of time. Now we allow ANY thread *
		50	* to attempt to split at the current ply. When we do what might be called *
		51	* a "gratuitous split" the only restriction is that if we have existing *
		52	* "gratuitous split points" (split blocks that are joinable but have not *
		53	* yet been joined), then we limit the number of such splits (per thread) to *
		54	* avoid excessive overhead. *
		55	* *
		56	* Generation II takes another idea from DTS, the idea of "late-join". The *
		57	* idea is fairly simple. If, when a thread becomes idle, there are already *
		58	* other split points being searched in parallel, then we will try to join *
		59	* one of them rather than waiting for someone to ask us to help. We use *
		60	* some simple criteria: (1) The split point must be joinable, which simply *
		61	* means that no processor has exited the split point yet (which would let *
		62	* us know there is no more work here and a join would be futile); (2) We *
		63	* compute an "interest" value which is a simple formula based on depth at *
		64	* the split point, and the number of moves already searched. It seems less *
		65	* risky to choose a split point with max depth AND minimum moves already *
		66	* searched so that there is plenty to do. This was quite simple to add *
		67	* after the rest of the Generation II rewrite. In fact, this is now THE *
		68	* way threads join a split point, period, which further simplifies the code *
		69	* and improves efficiency. IE, a thread can split when idle threads are *
		70	* noticed, or if it is far enough away from the tips to make the cost *
		71	* negligible. At that point any idle thread(s) can join immediately, those *
		72	* that become idle later can join when they are ready. *
		73	* *
33	pmbaty	74	* There are a number of settable options via the command-line or .craftyrc *
		75	* initialization file. Here's a concise explanation for each option and an *
		76	* occasional suggestion for testing/tuning. *
		77	* *
154	pmbaty	78	* smp_affinity (command = smpaffinity=<n> <p> is used to enable or disable *
108	pmbaty	79	* processor affinity. -1 disables affinity and lets threads run on any *
		80	* available core. If you use an integer <n> then thread zero will bind *
		81	* itself to cpu <n> and each additional thread will bind to the next *
		82	* higher cpu number. This is useful if you try to run two copies of *
		83	* crafty on the same machine, now you can cause one to bind to the first *
		84	* <n> cores, and the second to the last <n> cores. For the first *
		85	* instance of Crafty, you would use smpaffinity=0, and for the second *
		86	* smpaffinity=8, assuming you are running 8 threads per copy on a 16 cpu *
		87	* machine. If you get this wrong, you can have more than one thread on *
		88	* the same cpu which will significantly impact performance. *
		89	* *
33	pmbaty	90	* smp_max_threads (command = smpmt=n) sets the total number of allowable *
		91	* threads for the search. The default is one (1) as Crafty does not *
		92	* assume it should use all available resources. For optimal performance *
		93	* this should be set to the number of physical cores your machine has, *
		94	* which does NOT include hyperthreaded cores. *
		95	* *
		96	* smp_split_group (command = smpgroup=n) sets the maximum number of threads *
108	pmbaty	97	* at any single split point, with the exception of split points fairly *
		98	* close to the root where ALL threads are allowed to split together, *
		99	* ignoring this limit. Note that this is ignored in the first 1/2 of *
		100	* the tree (the nodes closer to the root). There it is actually good to *
		101	* split and get all active threads involved. *
33	pmbaty	102	* *
		103	* smp_min_split_depth (command = smpmin=n) avoids splitting when remaining *
		104	* depth < n. This is used to balance splitting overhead cost against *
108	pmbaty	105	* the speed gain the parallel search produces. The default is currently *
33	pmbaty	106	* 5 (which could change with future generations of Intel hardware) but *
		107	* values between 4 and 8 will work. Larger values allow somewhat fewer *
		108	* splits, which reduces overhead, but it also increases the percentage *
		109	* of the time where a thread is waiting on work. *
		110	* *
108	pmbaty	111	* smp_split_at_root (command = smproot=0 or 1) enables (1) or disables (0) *
33	pmbaty	112	* splitting the tree at the root. This defaults to 1 which produces the *
		113	* best performance by a signficiant margin. But it can be disabled if *
		114	* you are playing with code changes. *
		115	* *
108	pmbaty	116	* smp_gratuitous_depth (command = smpgd=<n>) controls " gratuitous splits" *
		117	* which are splits that are done without any idle threads. This sets a *
		118	* depth limit (remaining depth) that must be present before such a split *
		119	* can be done. Making this number larger will reduce the number of *
		120	* these splits. Making it too small will increase overhead slightly and *
		121	* increase split block usage significantly. *
33	pmbaty	122	* *
108	pmbaty	123	* smp_gratuitous_limit (command = smpgl=<n>) limits the number of these *
		124	* splits that a thread can do. Once a thread has this number of *
		125	* unjoined split points, it will not be allowed to split any more until *
		126	* one or more threads join at least one of the existing split points. *
		127	* In the smp search statistics, where you see output that looks like *
		128	* this: *
		129	* *
		130	* splits=m(n) ... *
		131	* *
		132	* m is the total splits done, n is the number of "wasted splits" which *
		133	* are basically gratuitous splits where no thread joined before this *
		134	* split point was completed and deallocated. *
		135	* *
		136	* The best way to tune all of these paramenters is to use the "autotune" *
		137	* command (see autotune.c and help autotune) which will automatically run *
		138	* tests and optimize the parameters. More details are in the autotune.c *
		139	* source file. *
		140	* *
33	pmbaty	141	* A few basic "rules of the road" for anyone interested in changing or *
		142	* adding to any of this code. *
		143	* *
		144	* 1. If, at any time, you want to modify your private split block, no lock *
		145	* is required. *
		146	* *
		147	* 2. If, at any time, you want to modify another split block, such as the *
		148	* parent split block shared move list, you must acquire the lock in the *
108	pmbaty	149	* split block first. IE tree->parent->lock to lock the parent split *
		150	* block during NextMove() and such. *
33	pmbaty	151	* *
108	pmbaty	152	* 3. If you want to modify any SMP-related data that spans multiple split *
		153	* blocks, such as telling sibling processes to stop, etc, then you must *
		154	* acquire the global "lock_smp" lock first. This prevents a deadlock *
		155	* caused by two different threads walking the split block chain from *
		156	* different directions, and acquiring the split block locks in *
		157	* different orders, which could cause a catastrophic deadlock to occur. *
		158	* This is an infrequent event so the overhead is not significant. *
33	pmbaty	159	* *
		160	* 4. If you want to do any sort of I/O operation, you must acquire the *
		161	* "lock_io" lock first. Since threads share descriptors, there are *
		162	* lots of potential race conditions, from the simple tty-output getting *
		163	* interlaced from different threads, to outright data corruption in the *
		164	* book or log files. *
		165	* *
		166	* Some of the bugs caused by failing to acquire the correct lock will only *
		167	* occur infrequently, and they are extremely difficult to find. Some only *
		168	* show up in a public game where everyone is watching, to cause maximum *
		169	* embarassment and causes the program to do something extremely stupid. *
		170	* *
		171	*******************************************************************************
		172	*/
108	pmbaty	173	int Split(TREE * RESTRICT tree) {
		174	TREE *child;
		175	int tid, tstart, tend;
33	pmbaty	176
		177	/*
		178	************************************************************
		179	* *
108	pmbaty	180	* Here we prepare to split the tree. All we really do in *
		181	* the Generation II threading is grab a split block for *
		182	* this thread, then flag the parent as "joinable" and *
		183	* then jump right to ThreadWait() to resume where we left *
		184	* off, with the expectation (but not a requirement) that *
		185	* other threads will join us to help. *
33	pmbaty	186	* *
108	pmbaty	187	* Idle threads are sitting in ThreadWait() repeatedly *
		188	* calling Join() to find them a split point, which we are *
		189	* fixing to provide. They will then join as quickly as *
		190	* they can, and other threads that become idle later can *
		191	* also join without any further splitting needed. *
		192	* *
		193	* If we are unable to allocate a split block, we simply *
		194	* abort this attempted split and return to the search *
		195	* since other threads will also split quickly. *
		196	* *
33	pmbaty	197	************************************************************
		198	*/
108	pmbaty	199	tstart = ReadClock();
		200	tree->nprocs = 0;
154	pmbaty	201	for (tid = 0; tid < smp_max_threads; tid++)
108	pmbaty	202	tree->siblings[tid] = 0;
		203	child = GetBlock(tree, tree->thread_id);
		204	if (!child)
33	pmbaty	205	return 0;
108	pmbaty	206	CopyFromParent(child);
		207	thread[tree->thread_id].tree = child;
		208	tree->joined = 0;
		209	tree->joinable = 1;
		210	parallel_splits++;
		211	smp_split = 0;
		212	tend = ReadClock();
		213	thread[tree->thread_id].idle += tend - tstart;
33	pmbaty	214	/*
		215	************************************************************
		216	* *
108	pmbaty	217	* We have successfully created a split point, which means *
		218	* we are done. The instant we set the "joinable" flag, *
		219	* idle threads may begin to join in at this split point *
		220	* to help. Since this thread may finish before any or *
		221	* all of the other parallel threads, this thread is sent *
		222	* to ThreadWait() which will immediately send it to *
		223	* SearchMoveList() like the other threads; however, going *
		224	* to ThreadWait() allows this thread to join others if it *
		225	* runs out of work to do. We do pass ThreadWait() the *
		226	* address of the parent split block, so that if this *
		227	* thread becomes idle, and this thread block shows no *
		228	* threads are still busy, then this thread can return to *
		229	* here and then back up into the previous ply as it *
		230	* should. Note that no other thread can back up to the *
		231	* previous ply since their recursive call stacks are not *
		232	* set for that, while this call stack will bring us back *
		233	* to this point where we return to the normal search, *
		234	* which we just completed. *
33	pmbaty	235	* *
108	pmbaty	236	************************************************************
		237	*/
		238	ThreadWait(tree->thread_id, tree);
		239	if (!tree->joined)
		240	parallel_splits_wasted++;
		241	return 1;
		242	}
		243
154	pmbaty	244	/* modified 08/03/16 */
108	pmbaty	245	/*
		246	*******************************************************************************
		247	* *
		248	* Join() is called just when we enter the usual spin-loop waiting for work. *
		249	* We take a quick look at all active split blocks to see if any look *
		250	* "joinable". If so, we compute an "interest" value, which will be defined *
		251	* below. We then join the most interesting split point directly. This *
		252	* split point might have been created specifically for this thread to join, *
		253	* or it might be one that was already active when this thread became idle, *
		254	* which allows us to join that existing split point and not request a new *
		255	* split operation, saving time. *
		256	* *
		257	*******************************************************************************
		258	*/
		259	int Join(int64_t tid) {
		260	TREE tree, join_block, *child;
		261	int interest, best_interest, current, pass = 0;
		262
		263	/*
		264	************************************************************
33	pmbaty	265	* *
108	pmbaty	266	* First we pass over ALL split blocks, looking for those *
		267	* flagged as "joinable" (which means they are actually *
		268	* active split points and that no processor at that split *
		269	* point has run out of work (there is no point in joining *
		270	* a split point with no remaining work) and no fail high *
		271	* has been found which would raise the "stop" flag.) This *
		272	* is "racy" because we do not acquire any locks, which *
		273	* means that the busy threads continue working, and there *
		274	* is a small probability that the split point will *
		275	* disappear while we are in this loop. To resolve the *
		276	* potential race, after we find the most attractive split *
		277	* point, we acquire the lock for that split block and *
		278	* test again, but this time if the block is joinable, we *
		279	* can safely join under control of the lock, which is not *
		280	* held for very long at all. If the block is not *
		281	* joinable once we acquire the lock, we abort joining *
		282	* since it is futile. Note that if this happens, we will *
		283	* try to find an existing split point we can join three *
		284	* times before we exit, setting split to 1 to ask other *
		285	* threads to produce more candidate split points. *
33	pmbaty	286	* *
154	pmbaty	287	* Special case: We don't want to join a split point that *
		288	* was created by this thread. While it works, it can add *
		289	* overhead since we can encounter a later split point *
		290	* that originated at the current split point, and we *
		291	* would continue searching even though most of the work *
		292	* has already been completed. The hash table would help *
		293	* avoid most (if not all) of this overhead, but there is *
		294	* no good reason to take the chance of this happening. *
		295	* *
108	pmbaty	296	************************************************************
		297	*/
		298	for (pass = 0; pass < 3; pass++) {
		299	best_interest = -999999;
		300	join_block = 0;
154	pmbaty	301	for (current = 0; current <= smp_max_threads * 64; current++) {
108	pmbaty	302	tree = block[current];
		303	if (tree->joinable && (tree->ply <= tree->depth / 2 \|\|
154	pmbaty	304	tree->nprocs < smp_split_group) && tree->thread_id != tid) {
108	pmbaty	305	interest = tree->depth * 2 - tree->searched[0];
		306	if (interest > best_interest) {
		307	best_interest = interest;
		308	join_block = tree;
		309	}
		310	}
		311	}
		312	/*
		313	************************************************************
33	pmbaty	314	* *
108	pmbaty	315	* Now we acquire the lock for this split block, and then *
		316	* check to see if the block is still flagged as joinable. *
		317	* If so, we set things up, and then we get pretty tricky *
		318	* as we then release the lock, and then copy the data *
		319	* from the parent to our split block. There is a chance *
		320	* that while we are copying this data, the split point *
		321	* gets completed by other threads. Which would leave an *
		322	* apparent race condition exposed where we start copying *
		323	* data here, the split point is completed, the parent *
		324	* block is released and then reacquired and we continue *
		325	* if nothing has happened here, getting data copied from *
		326	* two different positions. *
		327	* *
		328	* Fortunately, we linked this new split block to the old *
		329	* (original parent). If that split block is released, we *
		330	* will discover this because that operation will also set *
		331	* our "stop" flag which will prevent us from using this *
		332	* data and breaking things. We allow threads to copy *
		333	* this data without any lock protection to eliminate a *
		334	* serialization (each node would copy the data serially, *
		335	* rather than all at once) with the only consequence to *
		336	* this being the overhead of copying and then throwing *
		337	* the data away, which can happen on occasion even if we *
		338	* used a lock for protection, since once we release the *
		339	* lock it still takes time to get into the search and we *
		340	* could STILL find that this split block has already been *
		341	* completed, once again. Less contention and serial *
		342	* computing improves performance. *
		343	* *
33	pmbaty	344	************************************************************
		345	*/
108	pmbaty	346	if (join_block) {
		347	Lock(join_block->lock);
		348	if (join_block->joinable) {
154	pmbaty	349	child = GetBlock(join_block, tid);
108	pmbaty	350	Unlock(join_block->lock);
		351	if (child) {
		352	CopyFromParent(child);
		353	thread[tid].tree = child;
		354	parallel_joins++;
		355	return 1;
		356	}
		357	} else {
		358	Unlock(join_block->lock);
33	pmbaty	359	break;
108	pmbaty	360	}
33	pmbaty	361	}
		362	}
		363	/*
		364	************************************************************
		365	* *
108	pmbaty	366	* We did not acquire a split point to join, so we set *
		367	* smp_split to 1 to ask busy threads to create joinable *
		368	* split points. *
33	pmbaty	369	* *
		370	************************************************************
		371	*/
108	pmbaty	372	smp_split = 1;
		373	return 0;
		374	}
		375
154	pmbaty	376	/* modified 08/03/16 */
33	pmbaty	377	/*
108	pmbaty	378	*******************************************************************************
		379	* *
154	pmbaty	380	* ThreadAffinity() is called to "pin" a thread to a specific processor. It *
		381	* is a "noop" (no-operation) if Crafty was not compiled with -DAFFINITY, or *
		382	* if smp_affinity is negative (smpaffinity=-1 disables affinity). It *
		383	* simply sets the affinity for the current thread to the requested CPU and *
		384	* returns. NOTE: If hyperthreading is enabled, there is no guarantee that *
		385	* this will work as expected and pin one thread per physical core. It *
		386	* depends on how the O/S numbers the SMT cores. *
		387	* *
		388	*******************************************************************************
		389	*/
		390	void ThreadAffinity(int cpu) {
		391	#if defined(AFFINITY)
		392	cpu_set_t cpuset;
		393	pthread_t current_thread = pthread_self();
		394
		395	if (smp_affinity >= 0) {
		396	CPU_ZERO(&cpuset);
		397	CPU_SET(smp_affinity_increment * (cpu + smp_affinity), &cpuset);
		398	pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
		399	}
		400	#endif
		401	}
		402
		403	/* modified 08/03/16 */
		404	/*
		405	*******************************************************************************
		406	* *
108	pmbaty	407	* ThreadInit() is called after a process is created. Its main task is to *
		408	* initialize the process local memory so that it will fault in and be *
		409	* allocated on the local node rather than the node where the original *
		410	* (first) process was running. All threads will hang here via a custom *
		411	* WaitForALlThreadsInitialized() procedure so that all the local thread *
		412	* blocks are usable before the search actually begins. *
		413	* *
		414	*******************************************************************************
		415	*/
		416	void STDCALL ThreadInit(void t) {
		417	int tid = (int64_t) t;
		418
		419	ThreadAffinity(tid);
154	pmbaty	420	#if !defined(UNIX)
108	pmbaty	421	ThreadMalloc((uint64_t) tid);
154	pmbaty	422	#endif
108	pmbaty	423	thread[tid].blocks = 0xffffffffffffffffull;
		424	Lock(lock_smp);
		425	initialized_threads++;
		426	Unlock(lock_smp);
		427	WaitForAllThreadsInitialized();
		428	ThreadWait(tid, (TREE *) 0);
		429	Lock(lock_smp);
		430	smp_threads--;
		431	Unlock(lock_smp);
		432	return 0;
		433	}
		434
154	pmbaty	435	/* modified 08/03/16 */
108	pmbaty	436	/*
		437	*******************************************************************************
		438	* *
		439	* ThreadSplit() is used to determine if we should split at the current ply. *
		440	* There are some basic constraints on when splits can be done, such as the *
		441	* depth remaining in the search (don't split to near the tips), and have we *
		442	* searched at least one move to get a score or bound (YBW condition). *
		443	* *
		444	* If those conditions are satisfied, AND either a thread has requested a *
		445	* split OR we are far enough away from the tips of the tree to justify a *
		446	* "gratuitout split" then we return "success." A "gratuitout split" is a *
		447	* split done without any idle threads. Since splits are not free, we only *
		448	* do this well away from tips to limit overhead. We do this so that when a *
		449	* thread becomes idle, it will find these split points immediately and not *
		450	* have to wait for a split after the fact. *
		451	* *
		452	*******************************************************************************
		453	*/
154	pmbaty	454	int ThreadSplit(TREE * tree, int ply, int depth, int alpha, int o_alpha,
108	pmbaty	455	int done) {
		456	TREE *used;
		457	int64_t tblocks;
		458	int temp, unused = 0;
		459
		460	/*
33	pmbaty	461	************************************************************
		462	* *
108	pmbaty	463	* First, we see if we meet the basic criteria to create a *
		464	* split point, that being that we must not be too far *
		465	* from the root (smp_min_split_depth). *
33	pmbaty	466	* *
		467	************************************************************
		468	*/
154	pmbaty	469	if (depth < smp_min_split_depth)
108	pmbaty	470	return 0;
33	pmbaty	471	/*
		472	************************************************************
		473	* *
108	pmbaty	474	* If smp_split is NOT set, we are checking to see if it *
		475	* is acceptable to do a gratuitous split here. *
33	pmbaty	476	* *
108	pmbaty	477	* (1) if we are too far from the root we do not do *
		478	* gratuitous splits to avoid the overhead. *
		479	* *
		480	* (2) if we have searched more than one move at this ply, *
		481	* we don't do any further tests to see if a *
		482	* gratuitous split is acceptable, since we have *
		483	* previously done this test at this ply and decided *
		484	* one should not be done. That condition has likely *
		485	* not changed. *
		486	* *
		487	* (3) if we have pre-existing gratuitous split points for *
		488	* this thread, we make certain we don't create more *
		489	* than the gratuitous split limit as excessive splits *
		490	* just add to the overhead with no benefit. *
		491	* *
33	pmbaty	492	************************************************************
		493	*/
108	pmbaty	494	if (!smp_split) {
154	pmbaty	495	if (depth < smp_gratuitous_depth \|\| done > 1)
108	pmbaty	496	return 0;
		497	tblocks = ~thread[tree->thread_id].blocks;
		498	while (tblocks) {
		499	temp = LSB(tblocks);
		500	used = block[temp + tree->thread_id * 64 + 1];
		501	if (used->joinable && !used->joined)
		502	unused++;
		503	Clear(temp, tblocks);
		504	}
154	pmbaty	505	if (unused > smp_gratuitous_limit)
108	pmbaty	506	return 0;
		507	}
		508	/*
		509	************************************************************
		510	* *
		511	* If smp_split IS set, we are checking to see if it is *
		512	* acceptable to do a split because there are idle threads *
		513	* that need work to do. *
		514	* *
		515	* The only reason this would be false is if we have a *
		516	* pre-existing split point that is joinable but has not *
		517	* been joined. If one exists, there is no need to split *
		518	* again as there is already an accessible split point. *
		519	* Otherwise, if we are at the root and we are either not *
		520	* allowed to split at the root, or we have additional *
		521	* root moves that have to be searched one at a time using *
		522	* all available threads we also can not split here. *
		523	* *
		524	************************************************************
		525	*/
		526	else {
		527	if (ply == 1 && (!smp_split_at_root \|\| !NextRootMoveParallel() \|\|
		528	alpha == o_alpha))
		529	return 0;
		530	tblocks = ~thread[tree->thread_id].blocks;
		531	while (tblocks) {
		532	temp = LSB(tblocks);
		533	used = block[temp + tree->thread_id * 64 + 1];
		534	if (used->joinable && !used->joined)
		535	unused++;
		536	Clear(temp, tblocks);
		537	}
154	pmbaty	538	if (unused > smp_gratuitous_limit)
108	pmbaty	539	return 0;
		540	}
33	pmbaty	541	return 1;
		542	}
		543
154	pmbaty	544	/* modified 08/03/16 */
33	pmbaty	545	/*
		546	*******************************************************************************
		547	* *
108	pmbaty	548	* ThreadStop() is called from SearchMoveList() when it detects a beta *
		549	* cutoff (fail high) at a node that is being searched in parallel. We need *
		550	* to stop all threads here, and since this threading algorithm is recursive *
		551	* it may be necessary to stop other threads that are helping search this *
		552	* branch further down into the tree. This function simply sets appropriate *
		553	* tree->stop variables to 1, which will stop those particular threads *
		554	* instantly and return them to the idle loop in ThreadWait(). *
33	pmbaty	555	* *
		556	*******************************************************************************
		557	*/
108	pmbaty	558	void ThreadStop(TREE * RESTRICT tree) {
		559	int proc;
33	pmbaty	560
108	pmbaty	561	Lock(tree->lock);
		562	tree->stop = 1;
		563	tree->joinable = 0;
154	pmbaty	564	for (proc = 0; proc < smp_max_threads; proc++)
108	pmbaty	565	if (tree->siblings[proc])
		566	ThreadStop(tree->siblings[proc]);
		567	Unlock(tree->lock);
		568	}
		569
154	pmbaty	570	/* modified 08/03/16 */
33	pmbaty	571	/*
108	pmbaty	572	*******************************************************************************
		573	* *
		574	* ThreadTrace() is a debugging tool that simply walks the split block tree *
		575	* and displays interesting data to help debug the parallel search whenever *
		576	* changes break things. *
		577	* *
		578	*******************************************************************************
		579	*/
		580	void ThreadTrace(TREE * RESTRICT tree, int depth, int brief) {
		581	int proc, i;
		582
		583	Lock(tree->lock);
		584	Lock(lock_io);
		585	if (!brief) {
		586	for (i = 0; i < 4 * depth; i++)
		587	Print(4095, " ");
		588	depth++;
		589	Print(4095, "block[%d] thread=%d ply=%d nprocs=%d ",
		590	FindBlockID(tree), tree->thread_id, tree->ply, tree->nprocs);
		591	Print(4095, "joined=%d joinable=%d stop=%d nodes=%d", tree->joined,
		592	tree->joinable, tree->stop, tree->nodes_searched);
		593	Print(4095, " parent=%d\n", FindBlockID(tree->parent));
		594	} else {
		595	if (tree->nprocs > 1) {
		596	for (i = 0; i < 4 * depth; i++)
		597	Print(4095, " ");
		598	depth++;
		599	Print(4095, "(ply %d)", tree->ply);
		600	}
		601	}
		602	if (tree->nprocs) {
		603	if (!brief) {
		604	for (i = 0; i < 4 * depth; i++)
		605	Print(4095, " ");
		606	Print(4095, " parent=%d sibling threads=",
		607	FindBlockID(tree->parent));
154	pmbaty	608	for (proc = 0; proc < smp_max_threads; proc++)
108	pmbaty	609	if (tree->siblings[proc])
		610	Print(4095, " %d(%d)", proc, FindBlockID(tree->siblings[proc]));
		611	Print(4095, "\n");
		612	} else {
		613	if (tree->nprocs > 1) {
		614	Print(4095, " helping= ");
154	pmbaty	615	for (proc = 0; proc < smp_max_threads; proc++)
108	pmbaty	616	if (tree->siblings[proc]) {
		617	if (proc == tree->thread_id)
		618	Print(4095, "[");
		619	Print(4095, "%d", proc);
		620	if (proc == tree->thread_id)
		621	Print(4095, "]");
		622	Print(4095, " ");
		623	}
		624	Print(4095, "\n");
		625	}
		626	}
		627	}
		628	Unlock(lock_io);
154	pmbaty	629	for (proc = 0; proc < smp_max_threads; proc++)
108	pmbaty	630	if (tree->siblings[proc])
		631	ThreadTrace(tree->siblings[proc], depth, brief);
		632	Unlock(tree->lock);
		633	}
		634
154	pmbaty	635	/* modified 08/03/16 */
108	pmbaty	636	/*
		637	*******************************************************************************
		638	* *
		639	* ThreadWait() is the idle loop for the N threads that are created at the *
		640	* beginning when Crafty searches. Threads are "parked" here waiting on a *
		641	* pointer to something they should search (a parameter block built in the *
		642	* function Split() in this case. When this pointer becomes non-zero, each *
		643	* thread "parked" here will immediately call SearchMoveList() and begin the *
		644	* parallel search as directed. *
		645	* *
		646	* Upon entry, all threads except for the "master" will arrive here with a *
		647	* value of zero (0) in the waiting parameter below. This indicates that *
		648	* they will search and them be done. The "master" will arrive here with a *
		649	* pointer to the parent split block in "waiting" which says I will sit here *
		650	* waiting for work OR when the waiting split block has no threads working *
		651	* on it, at which point I should return which will let me "unsplit" here *
		652	* and clean things up. The call to here in Split() passes this block *
		653	* address while threads that are helping get here with a zero. *
		654	* *
		655	*******************************************************************************
		656	*/
		657	int ThreadWait(int tid, TREE * RESTRICT waiting) {
		658	int value, tstart, tend;
		659
		660	/*
33	pmbaty	661	************************************************************
		662	* *
108	pmbaty	663	* When we reach this point, one of three possible *
		664	* conditions is true (1) we already have work to do, as *
		665	* we are the "master thread" and we have already split *
		666	* the tree, we are coming here to join in; (2) we are *
		667	* the master, and we are waiting on our split point to *
		668	* complete, so we come here to join and help currently *
		669	* active threads; (3) we have no work to do, so we will *
		670	* spin until Join() locates a split pont we can join to *
		671	* help out. *
33	pmbaty	672	* *
108	pmbaty	673	* Note that when we get here, the parent already has a *
		674	* split block and does not need to call Join(), it simply *
		675	* falls through the while spin loop below because its *
		676	* "tree" pointer is already non-zero. *
		677	* *
33	pmbaty	678	************************************************************
		679	*/
154	pmbaty	680	while (FOREVER) {
108	pmbaty	681	tstart = ReadClock();
		682	while (!thread[tid].tree && (!waiting \|\| waiting->nprocs) && !Join(tid) &&
154	pmbaty	683	!thread[tid].terminate);
108	pmbaty	684	tend = ReadClock();
		685	if (!thread[tid].tree)
		686	thread[tid].tree = waiting;
		687	thread[tid].idle += tend - tstart;
		688	if (thread[tid].tree == waiting \|\| thread[tid].terminate)
		689	return 0;
		690	/*
		691	************************************************************
		692	* *
		693	* Once we get here, we have a good split point, so we are *
		694	* ready to participate in a parallel search. Once we *
		695	* return from SearchMoveList() we copy our results back *
		696	* to the parent via CopyToParent() before we look for a *
		697	* new split point. If we are a parent, we will slip out *
		698	* of the spin loop at the top and return to the normal *
		699	* serial search to finish up here. *
		700	* *
		701	* When we return from SearchMoveList(), we need to *
		702	* decrement the "nprocs" value since there is now one *
		703	* less thread working at this split point. *
		704	* *
		705	* Note: CopyToParent() marks the current split block as *
		706	* unused once the copy is completed, so we don't have to *
		707	* do anything about that here. *
		708	* *
		709	************************************************************
		710	*/
		711	value =
		712	SearchMoveList(thread[tid].tree, thread[tid].tree->ply,
		713	thread[tid].tree->depth, thread[tid].tree->wtm,
		714	thread[tid].tree->alpha, thread[tid].tree->beta,
		715	thread[tid].tree->searched, thread[tid].tree->in_check, 0, parallel);
		716	tstart = ReadClock();
		717	Lock(thread[tid].tree->parent->lock);
		718	thread[tid].tree->parent->joinable = 0;
		719	CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value);
		720	thread[tid].tree->parent->nprocs--;
		721	thread[tid].tree->parent->siblings[tid] = 0;
		722	Unlock(thread[tid].tree->parent->lock);
		723	thread[tid].tree = 0;
		724	tend = ReadClock();
		725	thread[tid].idle += tend - tstart;
33	pmbaty	726	}
108	pmbaty	727	}
		728
154	pmbaty	729	/* modified 08/03/16 */
33	pmbaty	730	/*
108	pmbaty	731	*******************************************************************************
		732	* *
		733	* CopyFromParent() is used to copy data from a parent thread to a child *
		734	* thread. This only copies the appropriate parts of the TREE structure to *
		735	* avoid burning memory bandwidth by copying everything. *
		736	* *
		737	*******************************************************************************
		738	*/
		739	void CopyFromParent(TREE * RESTRICT child) {
		740	TREE *parent = child->parent;
		741	int i, ply;
		742
		743	/*
33	pmbaty	744	************************************************************
		745	* *
		746	* We have allocated a split block. Now we copy the tree *
		747	* search state from the parent block to the child in *
		748	* preparation for starting the parallel search. *
		749	* *
		750	************************************************************
		751	*/
108	pmbaty	752	ply = parent->ply;
33	pmbaty	753	child->ply = ply;
		754	child->position = parent->position;
108	pmbaty	755	for (i = 0; i <= rep_index + parent->ply; i++)
33	pmbaty	756	child->rep_list[i] = parent->rep_list[i];
108	pmbaty	757	for (i = ply - 1; i < MAXPLY; i++)
		758	child->killers[i] = parent->killers[i];
154	pmbaty	759	for (i = 0; i < 4096; i++) {
		760	child->counter_move[i] = parent->counter_move[i];
		761	child->move_pair[i] = parent->move_pair[i];
		762	}
108	pmbaty	763	for (i = ply - 1; i <= ply; i++) {
33	pmbaty	764	child->curmv[i] = parent->curmv[i];
108	pmbaty	765	child->pv[i] = parent->pv[i];
33	pmbaty	766	}
108	pmbaty	767	child->in_check = parent->in_check;
		768	child->last[ply] = child->move_list;
		769	child->status[ply] = parent->status[ply];
		770	child->status[1] = parent->status[1];
		771	child->save_hash_key[ply] = parent->save_hash_key[ply];
		772	child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply];
33	pmbaty	773	child->nodes_searched = 0;
		774	child->fail_highs = 0;
		775	child->fail_high_first_move = 0;
		776	child->evaluations = 0;
		777	child->egtb_probes = 0;
108	pmbaty	778	child->egtb_hits = 0;
33	pmbaty	779	child->extensions_done = 0;
		780	child->qchecks_done = 0;
		781	child->moves_fpruned = 0;
108	pmbaty	782	child->moves_mpruned = 0;
		783	for (i = 0; i < 16; i++) {
		784	child->LMR_done[i] = 0;
		785	child->null_done[i] = 0;
		786	}
33	pmbaty	787	child->alpha = parent->alpha;
		788	child->beta = parent->beta;
		789	child->value = parent->value;
108	pmbaty	790	child->wtm = parent->wtm;
33	pmbaty	791	child->depth = parent->depth;
108	pmbaty	792	child->searched = parent->searched;
		793	strcpy(child->root_move_text, parent->root_move_text);
		794	strcpy(child->remaining_moves_text, parent->remaining_moves_text);
33	pmbaty	795	}
		796
154	pmbaty	797	/* modified 08/03/16 */
33	pmbaty	798	/*
		799	*******************************************************************************
		800	* *
		801	* CopyToParent() is used to copy data from a child thread to a parent *
		802	* thread. This only copies the appropriate parts of the TREE structure to *
		803	* avoid burning memory bandwidth by copying everything. *
		804	* *
		805	*******************************************************************************
		806	*/
		807	void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) {
108	pmbaty	808	int i, ply = parent->ply, which;
33	pmbaty	809
		810	/*
		811	************************************************************
		812	* *
108	pmbaty	813	* The only concern here is to make sure that the info is *
		814	* only copied to the parent if our score is > than the *
		815	* parent value, and that we were not stopped for any *
		816	* reason which could produce a partial score that is *
		817	* worthless and dangerous to use. *
33	pmbaty	818	* *
108	pmbaty	819	* One important special case. If we get here with the *
		820	* thread->stop flag set, and ply is 1, then we need to *
		821	* clear the "this move has been searched" flag in the ply *
		822	* 1 move list since we did not complete the search. If *
		823	* we fail to do this, then a move being searched in *
		824	* parallel at the root will be "lost" for this iteration *
		825	* and won't be searched again until the next iteration. *
		826	* *
33	pmbaty	827	* In any case, we add our statistical counters to the *
		828	* parent's totals no matter whether we finished or not *
		829	* since the total nodes searched and such should consider *
		830	* everything searched, not just the "useful stuff." *
		831	* *
108	pmbaty	832	* After we finish copying everything, we mark this split *
		833	* block as free in the split block bitmap. *
		834	* *
33	pmbaty	835	************************************************************
		836	*/
		837	if (child->nodes_searched && !child->stop && value > parent->value &&
		838	!abort_search) {
		839	parent->pv[ply] = child->pv[ply];
		840	parent->value = value;
		841	parent->cutmove = child->curmv[ply];
154	pmbaty	842	for (i = 0; i < 4096; i++) {
		843	parent->counter_move[i] = child->counter_move[i];
		844	parent->move_pair[i] = child->move_pair[i];
		845	}
33	pmbaty	846	}
108	pmbaty	847	if (child->stop && ply == 1)
		848	for (which = 0; which < n_root_moves; which++)
		849	if (root_moves[which].move == child->curmv[ply]) {
		850	root_moves[which].status &= 7;
		851	break;
		852	}
33	pmbaty	853	parent->nodes_searched += child->nodes_searched;
		854	parent->fail_highs += child->fail_highs;
		855	parent->fail_high_first_move += child->fail_high_first_move;
		856	parent->evaluations += child->evaluations;
		857	parent->egtb_probes += child->egtb_probes;
108	pmbaty	858	parent->egtb_hits += child->egtb_hits;
33	pmbaty	859	parent->extensions_done += child->extensions_done;
		860	parent->qchecks_done += child->qchecks_done;
		861	parent->moves_fpruned += child->moves_fpruned;
108	pmbaty	862	parent->moves_mpruned += child->moves_mpruned;
		863	for (i = 1; i < 16; i++) {
		864	parent->LMR_done[i] += child->LMR_done[i];
		865	parent->null_done[i] += child->null_done[i];
		866	}
		867	which = FindBlockID(child) - 64 * child->thread_id - 1;
		868	Set(which, thread[child->thread_id].blocks);
33	pmbaty	869	}
		870
154	pmbaty	871	/* modified 08/03/16 */
33	pmbaty	872	/*
		873	*******************************************************************************
		874	* *
108	pmbaty	875	* GetBlock() is used to allocate a split block and fill in only SMP- *
		876	* critical information. The child process will copy the rest of the split *
		877	* block information as needed. *
33	pmbaty	878	* *
108	pmbaty	879	* When we arrive here, the parent split block must be locked since we are *
		880	* going to change data in that block as well as copy data from that block *
		881	* the current split block. The only exception is when this is the original *
		882	* split operation, since this is done "out of sight" of other threads which *
		883	* means no locks are needed until after the "joinable" flag is set, which *
		884	* exposes this split point to other threads instantly. *
		885	* *
33	pmbaty	886	*******************************************************************************
		887	*/
154	pmbaty	888	TREE GetBlock(TREE RESTRICT parent, int tid) {
108	pmbaty	889	TREE *child;
		890	static int warnings = 0;
		891	int i, unused;
		892	/*
		893	************************************************************
		894	* *
		895	* One NUMA-related trick is that we only allocate a split *
		896	* block in the thread's local memory. Each thread has a *
		897	* group of split blocks that were first touched by the *
		898	* correct CPU so that the split blocks page-faulted into *
		899	* local memory for that specific processor. If we can't *
		900	* find an optimally-placed block, we return a zero which *
		901	* will prevent this thread from joining the split point. *
		902	* This is highly unlikely as it would mean the current *
		903	* thread has 64 active split blocks where it is waiting *
		904	* on other threads to complete the last bit of work at *
		905	* each. This is extremely unlikely. *
		906	* *
		907	* Here we use a simple 64 bit bit-map per thread that *
		908	* indicates which blocks are free (1) and which blocks *
		909	* used (0). We simply use LSB() to find the rightmost *
		910	* one-bit set and that is the local block number. We *
		911	* convert that to a global block number by doing the *
		912	* simple computation: *
		913	* *
		914	* global = local + 64 * tid + 1 *
		915	* *
		916	* Each thread has exactly 64 split blocks, and block 0 *
		917	* is the "master block" that never gets allocated or *
		918	* freed. Once we find a free block for the current *
		919	* thread, we zero that bit so that the block won't be *
		920	* used again until it is released. *
		921	* *
		922	************************************************************
		923	*/
		924	if (thread[tid].blocks) {
		925	unused = LSB(thread[tid].blocks);
		926	Clear(unused, thread[tid].blocks);
		927	Set(unused, thread[tid].max_blocks);
		928	} else {
		929	if (++warnings < 6)
		930	Print(2048,
		931	"WARNING. local SMP block cannot be allocated, thread %d\n", tid);
		932	return 0;
		933	}
		934	child = block[unused + tid * 64 + 1];
		935	/*
		936	************************************************************
		937	* *
		938	* Found a split block. Now we need to fill in only the *
		939	* critical information that can't be delayed due to race *
		940	* conditions. When we get here, the parent split block *
		941	* must be locked, that lets us safely update the number *
		942	* of processors working here, etc, without any ugly race *
		943	* conditions that would corrupt this critical data. *
		944	* *
		945	************************************************************
		946	*/
154	pmbaty	947	for (i = 0; i < smp_max_threads; i++)
108	pmbaty	948	child->siblings[i] = 0;
		949	child->nprocs = 0;
		950	child->stop = 0;
		951	child->joinable = 0;
		952	child->joined = 0;
		953	child->parent = parent;
		954	child->thread_id = tid;
		955	parent->nprocs++;
		956	parent->siblings[tid] = child;
		957	parent->joined = 1;
		958	return child;
33	pmbaty	959	}
		960
		961	/*
		962	*******************************************************************************
		963	* *
108	pmbaty	964	* WaitForAllThreadsInitialized() waits until all smp_max_threads are *
		965	* initialized. We have to initialize each thread and malloc() its split *
		966	* blocks before we start the actual parallel search. Otherwise we will see *
		967	* invalid memory accesses and crash instantly. *
33	pmbaty	968	* *
		969	*******************************************************************************
		970	*/
108	pmbaty	971	void WaitForAllThreadsInitialized(void) {
154	pmbaty	972	while (initialized_threads < smp_max_threads); /* Do nothing */
33	pmbaty	973	}
		974
154	pmbaty	975	#if !defined (UNIX)
		976	/* modified 08/03/16 */
33	pmbaty	977	/*
		978	*******************************************************************************
		979	* *
		980	* ThreadMalloc() is called from the ThreadInit() function. It malloc's the *
		981	* split blocks in the local memory for the processor associated with the *
		982	* specific thread that is calling this code. *
		983	* *
		984	*******************************************************************************
		985	*/
		986	extern void *WinMalloc(size_t, int);
		987	void ThreadMalloc(int64_t tid) {
108	pmbaty	988	int i;
33	pmbaty	989
154	pmbaty	990	for (i = tid * 64 + 1; i < tid * 64 + 65; i++) {
33	pmbaty	991	if (block[i] == NULL)
		992	block[i] =
		993	(TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) +
154	pmbaty	994	127, tid)));
33	pmbaty	995	block[i]->parent = NULL;
		996	LockInit(block[i]->lock);
		997	}
		998	}
		999	#endif

Subversion Repositories Games.Chess Giants

Games.Chess Giants/engine-crafty/thread.c – Rev 154