WebSVN – Games.Chess Giants – Blame – //engine-crafty/thread.c

Rev	Author	Line No.	Line
33	pmbaty	1	#include "chess.h"
		2	#include "data.h"
		3	#include "epdglue.h"
108	pmbaty	4	#if (CPUS > 1)
		5	/* modified 11/04/15 */
33	pmbaty	6	/*
		7	*******************************************************************************
		8	* *
108	pmbaty	9	* Split() is the driver for the threaded parallel search in Crafty. The *
33	pmbaty	10	* basic idea is that whenever we notice that one (or more) threads are in *
108	pmbaty	11	* their idle loop, we drop into Split(), from Search(), and begin a new *
		12	* parallel search from this node. This is simply a problem of establishing *
		13	* a new split point, and then letting each thread join this split point and *
		14	* copy whatever data they need. *
33	pmbaty	15	* *
108	pmbaty	16	* This is generation II of Split(). The primary differences address two *
		17	* specific performance-robbing issues. (1) Excessive waiting for a split *
		18	* to be done, and (b) excessive waiting on specific locks. Generation II *
		19	* addresses both of these to significantly improve performance. *
		20	* *
		21	* The main difference between Gen I and Gen II is the effort required to *
		22	* split the tree and which thread(s) expend this effort. In generation I, *
		23	* the parent thread was responsible for allocating a split block for each *
		24	* child thread, and then copying the necessary data from the parent split *
		25	* block to these child split blocks. When all of this was completed, the *
		26	* child processes were released to start the parallel search after being *
		27	* held while the split / copy operations were done. In the generation II *
		28	* Split() we now simply allocate a new split block for THIS thread, flag *
		29	* the parent split block as joinable, and then go directly to ThreadWait() *
		30	* which will drop us back in to the search immediately. The idle threads *
		31	* are continually looping on Join() which will jump them right into this *
		32	* split block letting them do ALL the work of allocating a split block, *
		33	* filling it in, and then copying the data to their local split block. *
		34	* This distributes the split overhead among all the threads that split, *
		35	* rather than this thread having to do all the work while the other threads *
		36	* sit idle. *
		37	* *
		38	* Generation II is also much more lightweight, in that it copies only the *
		39	* bare minimum from parent to child. Generation I safely copied far too *
		40	* much since this code was being changed regularly, but that is no longer *
		41	* necessary overhead. *
		42	* *
		43	* Generation II has a zero contention split algorithm. In the past, when a *
		44	* thread became idle, it posted a global split request and any thread that *
		45	* was at an appropriate point would try to split. But while it was doing *
		46	* the splitting, other threads that were also willing to split would "get *
		47	* in line" because Crafty used a global lock to prevent two threads from *
		48	* attempting to split at the same instant in time. They got in line, and *
		49	* waited for the original splitter to release the lock, but now they have *
		50	* no idle threads to split with. A waste of time. Now we allow ANY thread *
		51	* to attempt to split at the current ply. When we do what might be called *
		52	* a "gratuitous split" the only restriction is that if we have existing *
		53	* "gratuitous split points" (split blocks that are joinable but have not *
		54	* yet been joined), then we limit the number of such splits (per thread) to *
		55	* avoid excessive overhead. *
		56	* *
		57	* Generation II takes another idea from DTS, the idea of "late-join". The *
		58	* idea is fairly simple. If, when a thread becomes idle, there are already *
		59	* other split points being searched in parallel, then we will try to join *
		60	* one of them rather than waiting for someone to ask us to help. We use *
		61	* some simple criteria: (1) The split point must be joinable, which simply *
		62	* means that no processor has exited the split point yet (which would let *
		63	* us know there is no more work here and a join would be futile); (2) We *
		64	* compute an "interest" value which is a simple formula based on depth at *
		65	* the split point, and the number of moves already searched. It seems less *
		66	* risky to choose a split point with max depth AND minimum moves already *
		67	* searched so that there is plenty to do. This was quite simple to add *
		68	* after the rest of the Generation II rewrite. In fact, this is now THE *
		69	* way threads join a split point, period, which further simplifies the code *
		70	* and improves efficiency. IE, a thread can split when idle threads are *
		71	* noticed, or if it is far enough away from the tips to make the cost *
		72	* negligible. At that point any idle thread(s) can join immediately, those *
		73	* that become idle later can join when they are ready. *
		74	* *
33	pmbaty	75	* There are a number of settable options via the command-line or .craftyrc *
		76	* initialization file. Here's a concise explanation for each option and an *
		77	* occasional suggestion for testing/tuning. *
		78	* *
108	pmbaty	79	* smp_affinity (command = smpaffinity=<n> is used to enable or disable *
		80	* processor affinity. -1 disables affinity and lets threads run on any *
		81	* available core. If you use an integer <n> then thread zero will bind *
		82	* itself to cpu <n> and each additional thread will bind to the next *
		83	* higher cpu number. This is useful if you try to run two copies of *
		84	* crafty on the same machine, now you can cause one to bind to the first *
		85	* <n> cores, and the second to the last <n> cores. For the first *
		86	* instance of Crafty, you would use smpaffinity=0, and for the second *
		87	* smpaffinity=8, assuming you are running 8 threads per copy on a 16 cpu *
		88	* machine. If you get this wrong, you can have more than one thread on *
		89	* the same cpu which will significantly impact performance. *
		90	* *
33	pmbaty	91	* smp_max_threads (command = smpmt=n) sets the total number of allowable *
		92	* threads for the search. The default is one (1) as Crafty does not *
		93	* assume it should use all available resources. For optimal performance *
		94	* this should be set to the number of physical cores your machine has, *
		95	* which does NOT include hyperthreaded cores. *
		96	* *
		97	* smp_split_group (command = smpgroup=n) sets the maximum number of threads *
108	pmbaty	98	* at any single split point, with the exception of split points fairly *
		99	* close to the root where ALL threads are allowed to split together, *
		100	* ignoring this limit. Note that this is ignored in the first 1/2 of *
		101	* the tree (the nodes closer to the root). There it is actually good to *
		102	* split and get all active threads involved. *
33	pmbaty	103	* *
		104	* smp_min_split_depth (command = smpmin=n) avoids splitting when remaining *
		105	* depth < n. This is used to balance splitting overhead cost against *
108	pmbaty	106	* the speed gain the parallel search produces. The default is currently *
33	pmbaty	107	* 5 (which could change with future generations of Intel hardware) but *
		108	* values between 4 and 8 will work. Larger values allow somewhat fewer *
		109	* splits, which reduces overhead, but it also increases the percentage *
		110	* of the time where a thread is waiting on work. *
		111	* *
108	pmbaty	112	* smp_split_at_root (command = smproot=0 or 1) enables (1) or disables (0) *
33	pmbaty	113	* splitting the tree at the root. This defaults to 1 which produces the *
		114	* best performance by a signficiant margin. But it can be disabled if *
		115	* you are playing with code changes. *
		116	* *
108	pmbaty	117	* smp_gratuitous_depth (command = smpgd=<n>) controls " gratuitous splits" *
		118	* which are splits that are done without any idle threads. This sets a *
		119	* depth limit (remaining depth) that must be present before such a split *
		120	* can be done. Making this number larger will reduce the number of *
		121	* these splits. Making it too small will increase overhead slightly and *
		122	* increase split block usage significantly. *
33	pmbaty	123	* *
108	pmbaty	124	* smp_gratuitous_limit (command = smpgl=<n>) limits the number of these *
		125	* splits that a thread can do. Once a thread has this number of *
		126	* unjoined split points, it will not be allowed to split any more until *
		127	* one or more threads join at least one of the existing split points. *
		128	* In the smp search statistics, where you see output that looks like *
		129	* this: *
		130	* *
		131	* splits=m(n) ... *
		132	* *
		133	* m is the total splits done, n is the number of "wasted splits" which *
		134	* are basically gratuitous splits where no thread joined before this *
		135	* split point was completed and deallocated. *
		136	* *
		137	* The best way to tune all of these paramenters is to use the "autotune" *
		138	* command (see autotune.c and help autotune) which will automatically run *
		139	* tests and optimize the parameters. More details are in the autotune.c *
		140	* source file. *
		141	* *
33	pmbaty	142	* A few basic "rules of the road" for anyone interested in changing or *
		143	* adding to any of this code. *
		144	* *
		145	* 1. If, at any time, you want to modify your private split block, no lock *
		146	* is required. *
		147	* *
		148	* 2. If, at any time, you want to modify another split block, such as the *
		149	* parent split block shared move list, you must acquire the lock in the *
108	pmbaty	150	* split block first. IE tree->parent->lock to lock the parent split *
		151	* block during NextMove() and such. *
33	pmbaty	152	* *
108	pmbaty	153	* 3. If you want to modify any SMP-related data that spans multiple split *
		154	* blocks, such as telling sibling processes to stop, etc, then you must *
		155	* acquire the global "lock_smp" lock first. This prevents a deadlock *
		156	* caused by two different threads walking the split block chain from *
		157	* different directions, and acquiring the split block locks in *
		158	* different orders, which could cause a catastrophic deadlock to occur. *
		159	* This is an infrequent event so the overhead is not significant. *
33	pmbaty	160	* *
		161	* 4. If you want to do any sort of I/O operation, you must acquire the *
		162	* "lock_io" lock first. Since threads share descriptors, there are *
		163	* lots of potential race conditions, from the simple tty-output getting *
		164	* interlaced from different threads, to outright data corruption in the *
		165	* book or log files. *
		166	* *
		167	* Some of the bugs caused by failing to acquire the correct lock will only *
		168	* occur infrequently, and they are extremely difficult to find. Some only *
		169	* show up in a public game where everyone is watching, to cause maximum *
		170	* embarassment and causes the program to do something extremely stupid. *
		171	* *
		172	*******************************************************************************
		173	*/
108	pmbaty	174	int Split(TREE * RESTRICT tree) {
		175	TREE *child;
		176	int tid, tstart, tend;
33	pmbaty	177
		178	/*
		179	************************************************************
		180	* *
108	pmbaty	181	* Here we prepare to split the tree. All we really do in *
		182	* the Generation II threading is grab a split block for *
		183	* this thread, then flag the parent as "joinable" and *
		184	* then jump right to ThreadWait() to resume where we left *
		185	* off, with the expectation (but not a requirement) that *
		186	* other threads will join us to help. *
33	pmbaty	187	* *
108	pmbaty	188	* Idle threads are sitting in ThreadWait() repeatedly *
		189	* calling Join() to find them a split point, which we are *
		190	* fixing to provide. They will then join as quickly as *
		191	* they can, and other threads that become idle later can *
		192	* also join without any further splitting needed. *
		193	* *
		194	* If we are unable to allocate a split block, we simply *
		195	* abort this attempted split and return to the search *
		196	* since other threads will also split quickly. *
		197	* *
33	pmbaty	198	************************************************************
		199	*/
108	pmbaty	200	tstart = ReadClock();
		201	tree->nprocs = 0;
		202	for (tid = 0; tid < (int) smp_max_threads; tid++) // Pierre-Marie Baty -- added type cast
		203	tree->siblings[tid] = 0;
		204	child = GetBlock(tree, tree->thread_id);
		205	if (!child)
33	pmbaty	206	return 0;
108	pmbaty	207	CopyFromParent(child);
		208	thread[tree->thread_id].tree = child;
		209	tree->joined = 0;
		210	tree->joinable = 1;
		211	parallel_splits++;
		212	smp_split = 0;
		213	tend = ReadClock();
		214	thread[tree->thread_id].idle += tend - tstart;
33	pmbaty	215	/*
		216	************************************************************
		217	* *
108	pmbaty	218	* We have successfully created a split point, which means *
		219	* we are done. The instant we set the "joinable" flag, *
		220	* idle threads may begin to join in at this split point *
		221	* to help. Since this thread may finish before any or *
		222	* all of the other parallel threads, this thread is sent *
		223	* to ThreadWait() which will immediately send it to *
		224	* SearchMoveList() like the other threads; however, going *
		225	* to ThreadWait() allows this thread to join others if it *
		226	* runs out of work to do. We do pass ThreadWait() the *
		227	* address of the parent split block, so that if this *
		228	* thread becomes idle, and this thread block shows no *
		229	* threads are still busy, then this thread can return to *
		230	* here and then back up into the previous ply as it *
		231	* should. Note that no other thread can back up to the *
		232	* previous ply since their recursive call stacks are not *
		233	* set for that, while this call stack will bring us back *
		234	* to this point where we return to the normal search, *
		235	* which we just completed. *
33	pmbaty	236	* *
108	pmbaty	237	************************************************************
		238	*/
		239	ThreadWait(tree->thread_id, tree);
		240	if (!tree->joined)
		241	parallel_splits_wasted++;
		242	return 1;
		243	}
		244
		245	/* modified 12/16/15 */
		246	/*
		247	*******************************************************************************
		248	* *
		249	* Join() is called just when we enter the usual spin-loop waiting for work. *
		250	* We take a quick look at all active split blocks to see if any look *
		251	* "joinable". If so, we compute an "interest" value, which will be defined *
		252	* below. We then join the most interesting split point directly. This *
		253	* split point might have been created specifically for this thread to join, *
		254	* or it might be one that was already active when this thread became idle, *
		255	* which allows us to join that existing split point and not request a new *
		256	* split operation, saving time. *
		257	* *
		258	*******************************************************************************
		259	*/
		260	int Join(int64_t tid) {
		261	TREE tree, join_block, *child;
		262	int interest, best_interest, current, pass = 0;
		263
		264	/*
		265	************************************************************
33	pmbaty	266	* *
108	pmbaty	267	* First we pass over ALL split blocks, looking for those *
		268	* flagged as "joinable" (which means they are actually *
		269	* active split points and that no processor at that split *
		270	* point has run out of work (there is no point in joining *
		271	* a split point with no remaining work) and no fail high *
		272	* has been found which would raise the "stop" flag.) This *
		273	* is "racy" because we do not acquire any locks, which *
		274	* means that the busy threads continue working, and there *
		275	* is a small probability that the split point will *
		276	* disappear while we are in this loop. To resolve the *
		277	* potential race, after we find the most attractive split *
		278	* point, we acquire the lock for that split block and *
		279	* test again, but this time if the block is joinable, we *
		280	* can safely join under control of the lock, which is not *
		281	* held for very long at all. If the block is not *
		282	* joinable once we acquire the lock, we abort joining *
		283	* since it is futile. Note that if this happens, we will *
		284	* try to find an existing split point we can join three *
		285	* times before we exit, setting split to 1 to ask other *
		286	* threads to produce more candidate split points. *
33	pmbaty	287	* *
108	pmbaty	288	************************************************************
		289	*/
		290	for (pass = 0; pass < 3; pass++) {
		291	best_interest = -999999;
		292	join_block = 0;
		293	for (current = 0; current <= (int) smp_max_threads * 64; current++) { // Pierre-Marie Baty -- added type cast
		294	tree = block[current];
		295	if (tree->joinable && (tree->ply <= tree->depth / 2 \|\|
		296	tree->nprocs < (int) smp_split_group)) { // Pierre-Marie Baty -- added type cast
		297	interest = tree->depth * 2 - tree->searched[0];
		298	if (interest > best_interest) {
		299	best_interest = interest;
		300	join_block = tree;
		301	}
		302	}
		303	}
		304	/*
		305	************************************************************
33	pmbaty	306	* *
108	pmbaty	307	* Now we acquire the lock for this split block, and then *
		308	* check to see if the block is still flagged as joinable. *
		309	* If so, we set things up, and then we get pretty tricky *
		310	* as we then release the lock, and then copy the data *
		311	* from the parent to our split block. There is a chance *
		312	* that while we are copying this data, the split point *
		313	* gets completed by other threads. Which would leave an *
		314	* apparent race condition exposed where we start copying *
		315	* data here, the split point is completed, the parent *
		316	* block is released and then reacquired and we continue *
		317	* if nothing has happened here, getting data copied from *
		318	* two different positions. *
		319	* *
		320	* Fortunately, we linked this new split block to the old *
		321	* (original parent). If that split block is released, we *
		322	* will discover this because that operation will also set *
		323	* our "stop" flag which will prevent us from using this *
		324	* data and breaking things. We allow threads to copy *
		325	* this data without any lock protection to eliminate a *
		326	* serialization (each node would copy the data serially, *
		327	* rather than all at once) with the only consequence to *
		328	* this being the overhead of copying and then throwing *
		329	* the data away, which can happen on occasion even if we *
		330	* used a lock for protection, since once we release the *
		331	* lock it still takes time to get into the search and we *
		332	* could STILL find that this split block has already been *
		333	* completed, once again. Less contention and serial *
		334	* computing improves performance. *
		335	* *
33	pmbaty	336	************************************************************
		337	*/
108	pmbaty	338	if (join_block) {
		339	Lock(join_block->lock);
		340	if (join_block->joinable) {
		341	child = GetBlock(join_block, (int) tid); // Pierre-Marie Baty -- added type cast
		342	Unlock(join_block->lock);
		343	if (child) {
		344	CopyFromParent(child);
		345	thread[tid].tree = child;
		346	parallel_joins++;
		347	return 1;
		348	}
		349	} else {
		350	Unlock(join_block->lock);
33	pmbaty	351	break;
108	pmbaty	352	}
33	pmbaty	353	}
		354	}
		355	/*
		356	************************************************************
		357	* *
108	pmbaty	358	* We did not acquire a split point to join, so we set *
		359	* smp_split to 1 to ask busy threads to create joinable *
		360	* split points. *
33	pmbaty	361	* *
		362	************************************************************
		363	*/
108	pmbaty	364	smp_split = 1;
		365	return 0;
		366	}
		367
		368	/* modified 11/04/15 */
33	pmbaty	369	/*
108	pmbaty	370	*******************************************************************************
		371	* *
		372	* ThreadInit() is called after a process is created. Its main task is to *
		373	* initialize the process local memory so that it will fault in and be *
		374	* allocated on the local node rather than the node where the original *
		375	* (first) process was running. All threads will hang here via a custom *
		376	* WaitForALlThreadsInitialized() procedure so that all the local thread *
		377	* blocks are usable before the search actually begins. *
		378	* *
		379	*******************************************************************************
		380	*/
		381	void STDCALL ThreadInit(void t) {
		382	int tid = (int64_t) t;
		383
		384	ThreadAffinity(tid);
		385	# if !defined(UNIX)
		386	ThreadMalloc((uint64_t) tid);
		387	# endif
		388	thread[tid].blocks = 0xffffffffffffffffull;
		389	Lock(lock_smp);
		390	initialized_threads++;
		391	Unlock(lock_smp);
		392	WaitForAllThreadsInitialized();
		393	ThreadWait(tid, (TREE *) 0);
		394	Lock(lock_smp);
		395	smp_threads--;
		396	Unlock(lock_smp);
		397	return 0;
		398	}
		399
		400	/* modified 01/08/15 */
		401	/*
		402	*******************************************************************************
		403	* *
		404	* ThreadAffinity() is called to "pin" a thread to a specific processor. It *
		405	* is a "noop" (no-operation) if Crafty was not compiled with -DAFFINITY, or *
		406	* if smp_affinity is negative (smpaffinity=-1 disables affinity). It *
		407	* simply sets the affinity for the current thread to the requested CPU and *
		408	* returns. NOTE: If hyperthreading is enabled, there is no guarantee that *
		409	* this will work as expected and pin one thread per physical core. It *
		410	* depends on how the O/S numbers the SMT cores. *
		411	* *
		412	*******************************************************************************
		413	*/
		414	void ThreadAffinity(int cpu) {
		415	# if defined(AFFINITY)
		416	cpu_set_t cpuset;
		417	pthread_t current_thread = pthread_self();
		418
		419	if (smp_affinity >= 0) {
		420	CPU_ZERO(&cpuset);
		421	CPU_SET(cpu + smp_affinity, &cpuset);
		422	pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
		423	}
		424	# endif
		425	}
		426
		427	/* modified 11/04/15 */
		428	/*
		429	*******************************************************************************
		430	* *
		431	* ThreadSplit() is used to determine if we should split at the current ply. *
		432	* There are some basic constraints on when splits can be done, such as the *
		433	* depth remaining in the search (don't split to near the tips), and have we *
		434	* searched at least one move to get a score or bound (YBW condition). *
		435	* *
		436	* If those conditions are satisfied, AND either a thread has requested a *
		437	* split OR we are far enough away from the tips of the tree to justify a *
		438	* "gratuitout split" then we return "success." A "gratuitout split" is a *
		439	* split done without any idle threads. Since splits are not free, we only *
		440	* do this well away from tips to limit overhead. We do this so that when a *
		441	* thread becomes idle, it will find these split points immediately and not *
		442	* have to wait for a split after the fact. *
		443	* *
		444	*******************************************************************************
		445	*/
		446	int ThreadSplit(TREE *RESTRICT tree, int ply, int depth, int alpha, int o_alpha, // Pierre-Marie Baty -- missing keyword
		447	int done) {
		448	TREE *used;
		449	int64_t tblocks;
		450	int temp, unused = 0;
		451
		452	/*
33	pmbaty	453	************************************************************
		454	* *
108	pmbaty	455	* First, we see if we meet the basic criteria to create a *
		456	* split point, that being that we must not be too far *
		457	* from the root (smp_min_split_depth). *
33	pmbaty	458	* *
		459	************************************************************
		460	*/
108	pmbaty	461	if (depth < (int) smp_min_split_depth) // Pierre-Marie Baty -- added type cast
		462	return 0;
33	pmbaty	463	/*
		464	************************************************************
		465	* *
108	pmbaty	466	* If smp_split is NOT set, we are checking to see if it *
		467	* is acceptable to do a gratuitous split here. *
33	pmbaty	468	* *
108	pmbaty	469	* (1) if we are too far from the root we do not do *
		470	* gratuitous splits to avoid the overhead. *
		471	* *
		472	* (2) if we have searched more than one move at this ply, *
		473	* we don't do any further tests to see if a *
		474	* gratuitous split is acceptable, since we have *
		475	* previously done this test at this ply and decided *
		476	* one should not be done. That condition has likely *
		477	* not changed. *
		478	* *
		479	* (3) if we have pre-existing gratuitous split points for *
		480	* this thread, we make certain we don't create more *
		481	* than the gratuitous split limit as excessive splits *
		482	* just add to the overhead with no benefit. *
		483	* *
33	pmbaty	484	************************************************************
		485	*/
108	pmbaty	486	if (!smp_split) {
		487	if (depth < (int) smp_gratuitous_depth \|\| done > 1) // Pierre-Marie Baty -- added type cast
		488	return 0;
		489	tblocks = ~thread[tree->thread_id].blocks;
		490	while (tblocks) {
		491	temp = LSB(tblocks);
		492	used = block[temp + tree->thread_id * 64 + 1];
		493	if (used->joinable && !used->joined)
		494	unused++;
		495	Clear(temp, tblocks);
		496	}
		497	if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
		498	return 0;
		499	}
		500	/*
		501	************************************************************
		502	* *
		503	* If smp_split IS set, we are checking to see if it is *
		504	* acceptable to do a split because there are idle threads *
		505	* that need work to do. *
		506	* *
		507	* The only reason this would be false is if we have a *
		508	* pre-existing split point that is joinable but has not *
		509	* been joined. If one exists, there is no need to split *
		510	* again as there is already an accessible split point. *
		511	* Otherwise, if we are at the root and we are either not *
		512	* allowed to split at the root, or we have additional *
		513	* root moves that have to be searched one at a time using *
		514	* all available threads we also can not split here. *
		515	* *
		516	************************************************************
		517	*/
		518	else {
		519	if (ply == 1 && (!smp_split_at_root \|\| !NextRootMoveParallel() \|\|
		520	alpha == o_alpha))
		521	return 0;
		522	tblocks = ~thread[tree->thread_id].blocks;
		523	while (tblocks) {
		524	temp = LSB(tblocks);
		525	used = block[temp + tree->thread_id * 64 + 1];
		526	if (used->joinable && !used->joined)
		527	unused++;
		528	Clear(temp, tblocks);
		529	}
		530	if (unused > (int) smp_gratuitous_limit) // Pierre-Marie Baty -- added type cast
		531	return 0;
		532	}
33	pmbaty	533	return 1;
		534	}
		535
108	pmbaty	536	/* modified 11/04/15 */
33	pmbaty	537	/*
		538	*******************************************************************************
		539	* *
108	pmbaty	540	* ThreadStop() is called from SearchMoveList() when it detects a beta *
		541	* cutoff (fail high) at a node that is being searched in parallel. We need *
		542	* to stop all threads here, and since this threading algorithm is recursive *
		543	* it may be necessary to stop other threads that are helping search this *
		544	* branch further down into the tree. This function simply sets appropriate *
		545	* tree->stop variables to 1, which will stop those particular threads *
		546	* instantly and return them to the idle loop in ThreadWait(). *
33	pmbaty	547	* *
		548	*******************************************************************************
		549	*/
108	pmbaty	550	void ThreadStop(TREE * RESTRICT tree) {
		551	int proc;
33	pmbaty	552
108	pmbaty	553	Lock(tree->lock);
		554	tree->stop = 1;
		555	tree->joinable = 0;
		556	for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
		557	if (tree->siblings[proc])
		558	ThreadStop(tree->siblings[proc]);
		559	Unlock(tree->lock);
		560	}
		561
		562	/* modified 11/04/15 */
33	pmbaty	563	/*
108	pmbaty	564	*******************************************************************************
		565	* *
		566	* ThreadTrace() is a debugging tool that simply walks the split block tree *
		567	* and displays interesting data to help debug the parallel search whenever *
		568	* changes break things. *
		569	* *
		570	*******************************************************************************
		571	*/
		572	void ThreadTrace(TREE * RESTRICT tree, int depth, int brief) {
		573	int proc, i;
		574
		575	Lock(tree->lock);
		576	Lock(lock_io);
		577	if (!brief) {
		578	for (i = 0; i < 4 * depth; i++)
		579	Print(4095, " ");
		580	depth++;
		581	Print(4095, "block[%d] thread=%d ply=%d nprocs=%d ",
		582	FindBlockID(tree), tree->thread_id, tree->ply, tree->nprocs);
		583	Print(4095, "joined=%d joinable=%d stop=%d nodes=%d", tree->joined,
		584	tree->joinable, tree->stop, tree->nodes_searched);
		585	Print(4095, " parent=%d\n", FindBlockID(tree->parent));
		586	} else {
		587	if (tree->nprocs > 1) {
		588	for (i = 0; i < 4 * depth; i++)
		589	Print(4095, " ");
		590	depth++;
		591	Print(4095, "(ply %d)", tree->ply);
		592	}
		593	}
		594	if (tree->nprocs) {
		595	if (!brief) {
		596	for (i = 0; i < 4 * depth; i++)
		597	Print(4095, " ");
		598	Print(4095, " parent=%d sibling threads=",
		599	FindBlockID(tree->parent));
		600	for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
		601	if (tree->siblings[proc])
		602	Print(4095, " %d(%d)", proc, FindBlockID(tree->siblings[proc]));
		603	Print(4095, "\n");
		604	} else {
		605	if (tree->nprocs > 1) {
		606	Print(4095, " helping= ");
		607	for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
		608	if (tree->siblings[proc]) {
		609	if (proc == tree->thread_id)
		610	Print(4095, "[");
		611	Print(4095, "%d", proc);
		612	if (proc == tree->thread_id)
		613	Print(4095, "]");
		614	Print(4095, " ");
		615	}
		616	Print(4095, "\n");
		617	}
		618	}
		619	}
		620	Unlock(lock_io);
		621	for (proc = 0; proc < (int) smp_max_threads; proc++) // Pierre-Marie Baty -- added type cast
		622	if (tree->siblings[proc])
		623	ThreadTrace(tree->siblings[proc], depth, brief);
		624	Unlock(tree->lock);
		625	}
		626
		627	/* modified 11/04/15 */
		628	/*
		629	*******************************************************************************
		630	* *
		631	* ThreadWait() is the idle loop for the N threads that are created at the *
		632	* beginning when Crafty searches. Threads are "parked" here waiting on a *
		633	* pointer to something they should search (a parameter block built in the *
		634	* function Split() in this case. When this pointer becomes non-zero, each *
		635	* thread "parked" here will immediately call SearchMoveList() and begin the *
		636	* parallel search as directed. *
		637	* *
		638	* Upon entry, all threads except for the "master" will arrive here with a *
		639	* value of zero (0) in the waiting parameter below. This indicates that *
		640	* they will search and them be done. The "master" will arrive here with a *
		641	* pointer to the parent split block in "waiting" which says I will sit here *
		642	* waiting for work OR when the waiting split block has no threads working *
		643	* on it, at which point I should return which will let me "unsplit" here *
		644	* and clean things up. The call to here in Split() passes this block *
		645	* address while threads that are helping get here with a zero. *
		646	* *
		647	*******************************************************************************
		648	*/
		649	int ThreadWait(int tid, TREE * RESTRICT waiting) {
		650	int value, tstart, tend;
		651
		652	/*
33	pmbaty	653	************************************************************
		654	* *
108	pmbaty	655	* When we reach this point, one of three possible *
		656	* conditions is true (1) we already have work to do, as *
		657	* we are the "master thread" and we have already split *
		658	* the tree, we are coming here to join in; (2) we are *
		659	* the master, and we are waiting on our split point to *
		660	* complete, so we come here to join and help currently *
		661	* active threads; (3) we have no work to do, so we will *
		662	* spin until Join() locates a split pont we can join to *
		663	* help out. *
33	pmbaty	664	* *
108	pmbaty	665	* Note that when we get here, the parent already has a *
		666	* split block and does not need to call Join(), it simply *
		667	* falls through the while spin loop below because its *
		668	* "tree" pointer is already non-zero. *
		669	* *
33	pmbaty	670	************************************************************
		671	*/
108	pmbaty	672	while (1) {
		673	tstart = ReadClock();
		674	while (!thread[tid].tree && (!waiting \|\| waiting->nprocs) && !Join(tid) &&
		675	!thread[tid].terminate)
		676	Pause();
		677	tend = ReadClock();
		678	if (!thread[tid].tree)
		679	thread[tid].tree = waiting;
		680	thread[tid].idle += tend - tstart;
		681	if (thread[tid].tree == waiting \|\| thread[tid].terminate)
		682	return 0;
		683	/*
		684	************************************************************
		685	* *
		686	* Once we get here, we have a good split point, so we are *
		687	* ready to participate in a parallel search. Once we *
		688	* return from SearchMoveList() we copy our results back *
		689	* to the parent via CopyToParent() before we look for a *
		690	* new split point. If we are a parent, we will slip out *
		691	* of the spin loop at the top and return to the normal *
		692	* serial search to finish up here. *
		693	* *
		694	* When we return from SearchMoveList(), we need to *
		695	* decrement the "nprocs" value since there is now one *
		696	* less thread working at this split point. *
		697	* *
		698	* Note: CopyToParent() marks the current split block as *
		699	* unused once the copy is completed, so we don't have to *
		700	* do anything about that here. *
		701	* *
		702	************************************************************
		703	*/
		704	value =
		705	SearchMoveList(thread[tid].tree, thread[tid].tree->ply,
		706	thread[tid].tree->depth, thread[tid].tree->wtm,
		707	thread[tid].tree->alpha, thread[tid].tree->beta,
		708	thread[tid].tree->searched, thread[tid].tree->in_check, 0, parallel);
		709	tstart = ReadClock();
		710	Lock(thread[tid].tree->parent->lock);
		711	thread[tid].tree->parent->joinable = 0;
		712	CopyToParent((TREE *) thread[tid].tree->parent, thread[tid].tree, value);
		713	thread[tid].tree->parent->nprocs--;
		714	thread[tid].tree->parent->siblings[tid] = 0;
		715	Unlock(thread[tid].tree->parent->lock);
		716	thread[tid].tree = 0;
		717	tend = ReadClock();
		718	thread[tid].idle += tend - tstart;
33	pmbaty	719	}
108	pmbaty	720	}
		721
		722	/* modified 11/04/15 */
33	pmbaty	723	/*
108	pmbaty	724	*******************************************************************************
		725	* *
		726	* CopyFromParent() is used to copy data from a parent thread to a child *
		727	* thread. This only copies the appropriate parts of the TREE structure to *
		728	* avoid burning memory bandwidth by copying everything. *
		729	* *
		730	*******************************************************************************
		731	*/
		732	void CopyFromParent(TREE * RESTRICT child) {
		733	TREE *parent = child->parent;
		734	int i, ply;
		735
		736	/*
33	pmbaty	737	************************************************************
		738	* *
		739	* We have allocated a split block. Now we copy the tree *
		740	* search state from the parent block to the child in *
		741	* preparation for starting the parallel search. *
		742	* *
		743	************************************************************
		744	*/
108	pmbaty	745	ply = parent->ply;
33	pmbaty	746	child->ply = ply;
		747	child->position = parent->position;
108	pmbaty	748	for (i = 0; i <= rep_index + parent->ply; i++)
33	pmbaty	749	child->rep_list[i] = parent->rep_list[i];
108	pmbaty	750	for (i = ply - 1; i < MAXPLY; i++)
		751	child->killers[i] = parent->killers[i];
		752	for (i = ply - 1; i <= ply; i++) {
33	pmbaty	753	child->curmv[i] = parent->curmv[i];
108	pmbaty	754	child->pv[i] = parent->pv[i];
33	pmbaty	755	}
108	pmbaty	756	child->in_check = parent->in_check;
		757	child->last[ply] = child->move_list;
		758	child->status[ply] = parent->status[ply];
		759	child->status[1] = parent->status[1];
		760	child->save_hash_key[ply] = parent->save_hash_key[ply];
		761	child->save_pawn_hash_key[ply] = parent->save_pawn_hash_key[ply];
33	pmbaty	762	child->nodes_searched = 0;
		763	child->fail_highs = 0;
		764	child->fail_high_first_move = 0;
		765	child->evaluations = 0;
		766	child->egtb_probes = 0;
108	pmbaty	767	child->egtb_hits = 0;
33	pmbaty	768	child->extensions_done = 0;
		769	child->qchecks_done = 0;
		770	child->moves_fpruned = 0;
108	pmbaty	771	child->moves_mpruned = 0;
		772	for (i = 0; i < 16; i++) {
		773	child->LMR_done[i] = 0;
		774	child->null_done[i] = 0;
		775	}
33	pmbaty	776	child->alpha = parent->alpha;
		777	child->beta = parent->beta;
		778	child->value = parent->value;
108	pmbaty	779	child->wtm = parent->wtm;
33	pmbaty	780	child->depth = parent->depth;
108	pmbaty	781	child->searched = parent->searched;
		782	strcpy(child->root_move_text, parent->root_move_text);
		783	strcpy(child->remaining_moves_text, parent->remaining_moves_text);
33	pmbaty	784	}
		785
108	pmbaty	786	/* modified 11/04/15 */
33	pmbaty	787	/*
		788	*******************************************************************************
		789	* *
		790	* CopyToParent() is used to copy data from a child thread to a parent *
		791	* thread. This only copies the appropriate parts of the TREE structure to *
		792	* avoid burning memory bandwidth by copying everything. *
		793	* *
		794	*******************************************************************************
		795	*/
		796	void CopyToParent(TREE * RESTRICT parent, TREE * RESTRICT child, int value) {
108	pmbaty	797	int i, ply = parent->ply, which;
33	pmbaty	798
		799	/*
		800	************************************************************
		801	* *
108	pmbaty	802	* The only concern here is to make sure that the info is *
		803	* only copied to the parent if our score is > than the *
		804	* parent value, and that we were not stopped for any *
		805	* reason which could produce a partial score that is *
		806	* worthless and dangerous to use. *
33	pmbaty	807	* *
108	pmbaty	808	* One important special case. If we get here with the *
		809	* thread->stop flag set, and ply is 1, then we need to *
		810	* clear the "this move has been searched" flag in the ply *
		811	* 1 move list since we did not complete the search. If *
		812	* we fail to do this, then a move being searched in *
		813	* parallel at the root will be "lost" for this iteration *
		814	* and won't be searched again until the next iteration. *
		815	* *
33	pmbaty	816	* In any case, we add our statistical counters to the *
		817	* parent's totals no matter whether we finished or not *
		818	* since the total nodes searched and such should consider *
		819	* everything searched, not just the "useful stuff." *
		820	* *
108	pmbaty	821	* After we finish copying everything, we mark this split *
		822	* block as free in the split block bitmap. *
		823	* *
33	pmbaty	824	************************************************************
		825	*/
		826	if (child->nodes_searched && !child->stop && value > parent->value &&
		827	!abort_search) {
		828	parent->pv[ply] = child->pv[ply];
		829	parent->value = value;
		830	parent->cutmove = child->curmv[ply];
		831	}
108	pmbaty	832	if (child->stop && ply == 1)
		833	for (which = 0; which < n_root_moves; which++)
		834	if (root_moves[which].move == child->curmv[ply]) {
		835	root_moves[which].status &= 7;
		836	break;
		837	}
33	pmbaty	838	parent->nodes_searched += child->nodes_searched;
		839	parent->fail_highs += child->fail_highs;
		840	parent->fail_high_first_move += child->fail_high_first_move;
		841	parent->evaluations += child->evaluations;
		842	parent->egtb_probes += child->egtb_probes;
108	pmbaty	843	parent->egtb_hits += child->egtb_hits;
33	pmbaty	844	parent->extensions_done += child->extensions_done;
		845	parent->qchecks_done += child->qchecks_done;
		846	parent->moves_fpruned += child->moves_fpruned;
108	pmbaty	847	parent->moves_mpruned += child->moves_mpruned;
		848	for (i = 1; i < 16; i++) {
		849	parent->LMR_done[i] += child->LMR_done[i];
		850	parent->null_done[i] += child->null_done[i];
		851	}
		852	which = FindBlockID(child) - 64 * child->thread_id - 1;
		853	Set(which, thread[child->thread_id].blocks);
33	pmbaty	854	}
		855
108	pmbaty	856	/* modified 11/04/15 */
33	pmbaty	857	/*
		858	*******************************************************************************
		859	* *
108	pmbaty	860	* GetBlock() is used to allocate a split block and fill in only SMP- *
		861	* critical information. The child process will copy the rest of the split *
		862	* block information as needed. *
33	pmbaty	863	* *
108	pmbaty	864	* When we arrive here, the parent split block must be locked since we are *
		865	* going to change data in that block as well as copy data from that block *
		866	* the current split block. The only exception is when this is the original *
		867	* split operation, since this is done "out of sight" of other threads which *
		868	* means no locks are needed until after the "joinable" flag is set, which *
		869	* exposes this split point to other threads instantly. *
		870	* *
33	pmbaty	871	*******************************************************************************
		872	*/
108	pmbaty	873	TREE GetBlock(TREE /RESTRICT/ parent, int tid) { // Pierre-Marie Baty -- keyword not present in declaration
		874	TREE *child;
		875	static int warnings = 0;
		876	int i, unused;
		877	/*
		878	************************************************************
		879	* *
		880	* One NUMA-related trick is that we only allocate a split *
		881	* block in the thread's local memory. Each thread has a *
		882	* group of split blocks that were first touched by the *
		883	* correct CPU so that the split blocks page-faulted into *
		884	* local memory for that specific processor. If we can't *
		885	* find an optimally-placed block, we return a zero which *
		886	* will prevent this thread from joining the split point. *
		887	* This is highly unlikely as it would mean the current *
		888	* thread has 64 active split blocks where it is waiting *
		889	* on other threads to complete the last bit of work at *
		890	* each. This is extremely unlikely. *
		891	* *
		892	* Here we use a simple 64 bit bit-map per thread that *
		893	* indicates which blocks are free (1) and which blocks *
		894	* used (0). We simply use LSB() to find the rightmost *
		895	* one-bit set and that is the local block number. We *
		896	* convert that to a global block number by doing the *
		897	* simple computation: *
		898	* *
		899	* global = local + 64 * tid + 1 *
		900	* *
		901	* Each thread has exactly 64 split blocks, and block 0 *
		902	* is the "master block" that never gets allocated or *
		903	* freed. Once we find a free block for the current *
		904	* thread, we zero that bit so that the block won't be *
		905	* used again until it is released. *
		906	* *
		907	************************************************************
		908	*/
		909	if (thread[tid].blocks) {
		910	unused = LSB(thread[tid].blocks);
		911	Clear(unused, thread[tid].blocks);
		912	Set(unused, thread[tid].max_blocks);
		913	} else {
		914	if (++warnings < 6)
		915	Print(2048,
		916	"WARNING. local SMP block cannot be allocated, thread %d\n", tid);
		917	return 0;
		918	}
		919	child = block[unused + tid * 64 + 1];
		920	/*
		921	************************************************************
		922	* *
		923	* Found a split block. Now we need to fill in only the *
		924	* critical information that can't be delayed due to race *
		925	* conditions. When we get here, the parent split block *
		926	* must be locked, that lets us safely update the number *
		927	* of processors working here, etc, without any ugly race *
		928	* conditions that would corrupt this critical data. *
		929	* *
		930	************************************************************
		931	*/
		932	for (i = 0; i < (int) smp_max_threads; i++) // Pierre-Marie Baty -- added type cast
		933	child->siblings[i] = 0;
		934	child->nprocs = 0;
		935	child->stop = 0;
		936	child->joinable = 0;
		937	child->joined = 0;
		938	child->parent = parent;
		939	child->thread_id = tid;
		940	parent->nprocs++;
		941	parent->siblings[tid] = child;
		942	parent->joined = 1;
		943	return child;
33	pmbaty	944	}
		945
		946	/*
		947	*******************************************************************************
		948	* *
108	pmbaty	949	* WaitForAllThreadsInitialized() waits until all smp_max_threads are *
		950	* initialized. We have to initialize each thread and malloc() its split *
		951	* blocks before we start the actual parallel search. Otherwise we will see *
		952	* invalid memory accesses and crash instantly. *
33	pmbaty	953	* *
		954	*******************************************************************************
		955	*/
108	pmbaty	956	void WaitForAllThreadsInitialized(void) {
		957	while (initialized_threads < (int) smp_max_threads); /* Do nothing */ // Pierre-Marie Baty -- added type cast
33	pmbaty	958	}
		959
108	pmbaty	960	# if !defined (UNIX)
33	pmbaty	961	/* modified 01/17/09 */
		962	/*
		963	*******************************************************************************
		964	* *
		965	* ThreadMalloc() is called from the ThreadInit() function. It malloc's the *
		966	* split blocks in the local memory for the processor associated with the *
		967	* specific thread that is calling this code. *
		968	* *
		969	*******************************************************************************
		970	*/
		971	extern void *WinMalloc(size_t, int);
		972	void ThreadMalloc(int64_t tid) {
108	pmbaty	973	int i;
33	pmbaty	974
108	pmbaty	975	for (i = (int) tid * 64 + 1; i < (int) tid * 64 + 65; i++) { // Pierre-Marie Baty -- added type casts
33	pmbaty	976	if (block[i] == NULL)
		977	block[i] =
		978	(TREE *) ((~(size_t) 127) & (127 + (size_t) WinMalloc(sizeof(TREE) +
		979	127, (int) tid))); // Pierre-Marie Baty -- added type cast
		980	block[i]->parent = NULL;
		981	LockInit(block[i]->lock);
		982	}
		983	}
108	pmbaty	984	# endif
33	pmbaty	985	#endif

Subversion Repositories Games.Chess Giants

Games.Chess Giants//engine-crafty/thread.c – Rev 108