Get into GameDev:
cheat-codes for gamedev interviews
━ Multithreading ━
Introduction
Avoiding Synchronization Structures
Similar to patterns, synchronization structures are sometimes interpreted by students as the solution to the problem when they often actually the problem themselves. Your goal should be to understand these structures (such as locks and semaphores) but then, in practice, you should strive to design code that does not need them.
Multithreaded programming is about how to design programs to not need locks.
I included the below diagram in the previous Fundamentals chapter but I think it is worth revisiting now to further emphasize just how badly synchronization structures can ruin our performance. Hopefully by now you can appreciate just how horrible a cache miss is and how reading from main memory is super slow... but look at the worst thing on this list! The context switch is super slow... even worse than reading from main RAM! Wow that is really slow! A context switch is going to occur every time we wait on a lock. So the moral of the story is we really need to avoid that from happening at all costs. In this chapter we will explore a number of these structures including how to use them and how to avoid using them.
Multithreaded programming is about how to design programs to not need locks.
I included the below diagram in the previous Fundamentals chapter but I think it is worth revisiting now to further emphasize just how badly synchronization structures can ruin our performance. Hopefully by now you can appreciate just how horrible a cache miss is and how reading from main memory is super slow... but look at the worst thing on this list! The context switch is super slow... even worse than reading from main RAM! Wow that is really slow! A context switch is going to occur every time we wait on a lock. So the moral of the story is we really need to avoid that from happening at all costs. In this chapter we will explore a number of these structures including how to use them and how to avoid using them.
Synchronization Mechanisms
Mutex: “mutual exclusion locking mechanism”, waits if access not granted (example)
Lock Types:
Spinlock: “design alternative to the mutex locking mechanism”, repeats request if not granted
Condition variable: “a mutex with notifications” event-based-check paradigm (example)
Semaphore: “a CV with a count” signaling mechanism (“I am done, you can carry on”)
Other:
- I think of a mutex as NOT a ‘lock’, instead it’s an object (like a door) that you lock or unlock. You can actually have multiple locks objects operate on one mutex such as when you want a read-lock and write lock Example. Mutex has an internal unique lock that can be accessed by mutexInstance.Lock(). Shared_Mutex has an internal unique lock AND an internal shared lock. So that it alone can complete the earlier example. LINK
- a mutex cannot be used as a semaphore because the mutex can only be signaled by the thread that called the wait function
- “recursive” mutexes can be locked >1 times, and must be unlocked x times
- if a thread tries to lock a non-recursive mutex that it already owns, it deadlocks
- lock() blocks and only returns when it has the lock, trylock() returns immediately and can either succeed or fail to obtain the lock.
- When a thread is blocked by a lock, it cannot do anything. The processor has to context switch (an expensive operation) recording and replacing register values, then allowing a different thread to start work on the processor.
- Ticket mutex asserts fairness: https://youtu.be/Zcqwb3CWqs4?t=1868
Lock Types:
- An exclusive or write lock: gives a process exclusive access for writing. While a write lock is in place, no other process can lock that part of the file for reading or writing.
- unique_lock<mutex> Example
- A shared or read lock: prohibits any other process from requesting a write lock on the specified part of the file. However, other processes can request read locks. A
- shared_lock<mutex> Example
- lock-guard or scoped lock: will lock on construction, then unlock in its destructor.
- Fine-grain locks or “Striped Locks” (divide the control of an object across subdomains, each with their own lock, such as subarrays of an array): LINK
- Links: Shared vs Exclusive SO Shared vs Exclusive G4G
Spinlock: “design alternative to the mutex locking mechanism”, repeats request if not granted
- A spinlock is kinda the opposite of a mutex, instead of waiting it will try again repeatedly
- They are very expensive at a high level, but a mutex is actually often implemented using a spinlock to manage its internals
- Sometimes it’s better to use a spinlock than a mutex, for example if 8 threads are editing the same array it would be better to have one trying again and again for access than to have a thread wait because a wait’s context switch is very expensive.
- High contention areas perform better with spinlocks because high frequency of context switches and waits otherwise
- Read LINK
Condition variable: “a mutex with notifications” event-based-check paradigm (example)
- A mutex only allows you to wait until the lock is available; a condition variable allows you to wait until an arbitrary condition has changed.
- A synchronization primitive that can be used to block a thread, or multiple threads at the same time, until another thread both modifies a shared variable (the condition), and notifies the condition_variable.
- Wait() requires a mutex. It atomically unlocks the mutex, then awaits notification. LINK
- Steps to modify a CV’s variable (LINK)
- acquire a std::mutex (typically via std::lock_guard)
- perform the modification while the lock is held
- release lock
- execute notify_one or notify_all on the std::condition_variable
- notify_one / notify_all - notifies one/all waiting thread(s)
- Default CV requires a unique_lock, CV_any can be used with any lockable object LINK
- Spurious wake up - sometimes a wait function will exit before it should (a concession of optimization), so always check the condition is now true before you exit the while loop of a CV.
- cv.wait( locker, [](){ return !q.empty(); } ); ← wait until locker is available and q not empty
- Thundering herd: https://en.wikipedia.org/wiki/Thundering_herd_problem
- Fixed by some prioritization, see “Lock Convoy”
Semaphore: “a CV with a count” signaling mechanism (“I am done, you can carry on”)
- a binary semaphore can be used as a Mutex
- a thread that is waiting on a semaphore can be signaled by another thread.
- “Counting” Semaphores are integer-value semaphores
- no value restriction (besides maybe int_max)
- “Binary” Semaphores
- their value is restricted to 0 and 1
- the wait operation only works when the semaphore is 1
- the signal operation succeeds when semaphore is 0
- V() = “up”=”signal” and P() = “down”=”wait”
- Ex. soda machine where the vendor will call up to add soda, and it is notified if soda count hits 0. The buyers will call down to take sodas and will wait when they try to down, but it's already 0.
- Ex. Question and Answer
Other:
- Thread local storage: in this case, memory is accessible by only that thread and may vary on a different thread.
- Some excellent examples here: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html
- Basically thread locals can be used as an intermediary counter to track progress prior to a merge, this is great because it can avoid an atomic operation which will lock
Practice Question: How would you optimize a structure for many readers but few writers?
You can use a shared mutex that differentiates assessor privileges (basically 2 locks).
Atomicity
An operation is atomic if the data is aligned and the bus is at least as wide as the type being written or read. On x86 and x64, there is no guarantee that reads and writes larger than eight bytes are atomic. Atomic operations are useful because they can complete before another process can interrupt this can provide us with an alternative to locks. For example if we have two threads that are incrementing a value, then we need to have a lock so that the threads can take turns obtaining ownership of the shared data before editing it. However, if we can ensure that our increment is atomic then we don't need the lock because by the time the second thread wants to edit the data, the first thread will be done editing it.
Practice Question: Read through each of the code segments and decide if they are atomic operations or not.
== A ==
DWORD* pData = (DWORD*)(pChar + 1);
*pData = 0;
== B ==
// Not Atomic: three separate operations, more details here: https://stackoverflow.com/questions/48497636/understanding-the-difference-between-i-and-i-at-the-assembly-level
++g_globalCounter;
== C ==
x++;
== D ==
g_alignedGlobal = 0;
== E ==
DWORD local = g_alignedGlobal;
== A ==
DWORD* pData = (DWORD*)(pChar + 1);
*pData = 0;
== B ==
// Not Atomic: three separate operations, more details here: https://stackoverflow.com/questions/48497636/understanding-the-difference-between-i-and-i-at-the-assembly-level
++g_globalCounter;
== C ==
x++;
== D ==
g_alignedGlobal = 0;
== E ==
DWORD local = g_alignedGlobal;
== A ==
// Not Atomic: not natively aligned
DWORD* pData = (DWORD*)(pChar + 1);
*pData = 0;
== B ==
// Not Atomic: three separate operations, more details here: https://stackoverflow.com/questions/48497636/understanding-the-difference-between-i-and-i-at-the-assembly-level
++g_globalCounter;
== C ==
// Yes this is Atomic
x++;
== D ==
// Atomic write
g_alignedGlobal = 0;
== E ==
// Atomic read
DWORD local = g_alignedGlobal;
Atomic Operations
Which operations are atomic will vary based on the hardware you are running on. But typically you will have access to a few atomic APIs such as the ones below. These are implemented thanks to hardware support. All of these functions will be like sequentially consistent fences.
- Fetch-and-Add FAA
- Takes in an address and an increment
- Increments the value at address, and returns the original value that was stored at the address
- Fetch-and-Subtract
- Atomic-Exchange
- InterlockedIncrement ← works like i++ but is atomic
- Compare-and-Swap CAS
- Takes in an address, a new value, and a comparison value
- Swaps the value at the address with the new value if the value found at the address equals the comparison value
- For CAS, beware of ABA wherein a value may be changed and then changed back before the comparison
Atomic Types
Atomic types are types that promise to perform all or a subset of their operations as atomic operations. Atomic types are not necessarily lock free since you could mark any type as 'atomic'.
- std::atomic<T>::is_lock_free() checks if it has a lock-free implementation
- Java calls them synchronized sections, C++ calls them critical sections “CS’s”
- Critical sections must block system interrupts because the ihandler wouldn’t be able to handle the locks
- Atomics are locking, beware. This means that objects which use them (such as Shared_Pointers will lock up.
Compiler Reordering
The order of reads and writes may be changed by a compiler (breaking sequential-consistency”). To combat this problem, we can use read-acquire and write-release barriers or mark code as Volatile.
Processor Reordering: Same thing, but now with the processor. We must use fences to prevent this kind.
- LFence - ensures prior reads (aka Loads) are done
- SFence - ensures prior writes (aka Stores) are done
- MFence - ensure prior reads & writes are done
Xbox guidelines: https://learn.microsoft.com/en-us/windows/win32/dxtecharts/lockless-programming
Atomic vs Locks (mutexes)
- To make lock-free data structures, first remove locks, and then re-design the data flow to use atomic APIs. LINK
- Locks are nice because they’re easier to think about; and we can context switch to be useful while waiting, wheras atomics just busy-wait (spin-lock).
- Leverage processor support like CAS and don't use locks at all
- They don't wait, instead they keep trying until success (so-called busy-waiting)
- They don't incur context-switching overhead, but neither free up cpu resources.
- Faster if contention between threads is sufficiently low.
- Does not stop others from making progress
- More OS-dependent and perform differently on, for example, Win and Linux.
- May prevent other threads from making progress
- Suspend thread execution
- freeing up CPUcpu resources for other tasks
- incurring context-switching overhead when stopping/restarting the thread.
- With locks, we can make a tiered system of readers and writers.
- Muck simpler than lockless data structures.
Practice Question:
An unsigned int-32 would have all 32 bits to represent its value, but we need to remember to subtract one because one combination of bits is used to represent 0. Therefore the max value of a uint-32 is 2^32-1.
But I asked for a signed int-32 which requires the first bit to represent the sign. Therefore the max value of a uint-32 is 2^31-1.
Additional reading:
https://stackoverflow.com/questions/94591/what-is-the-maximum-value-for-an-int32
Additional reading:
https://stackoverflow.com/questions/94591/what-is-the-maximum-value-for-an-int32
Multithreading Data Structures
In a previous chapter we explored common data structures and their tradeoffs. Now we will revisit the stack and list to explore how they can be converted into multithreaded structures. It is common on interviews to ask candidates to code a lockless lists insertion/deletion or a lockless stack's push/pop.
Lockless List Insertion
Insertion (to add newElem right after currElem), LINK
|
Lockless List Deletion
For deletion, one CAS is not enough (why?) Instead, we separate logical and physical deletion.
|
Lockless Stack Push & Pop
Practice Question: So there are some issues with the pop implemented above. A single CAS is actually not enough to provide ABA problem protection. The ABA problem is described in detail here: wikipedia.org/wiki/ABA_problem. How could we augment the pop implementation to protect against ABA?
CAS could implement some hazard pointers or counters to prevent ABA race.
Here is a lecture on the subject: https://www.youtube.com/watch?v=rLvf7Xxh2zg&ab_channel=JonasSkeppstedt
Additional Topics
False Sharing - When data is not well laid out for multiprocessing, a processor may unintentionally grab data that a second processor needs. Simultaneous updates of individual elements in the same cache line coming from different processors invalidates entire cache lines, even though these updates are logically independent of each other. Each update of an individual element of a cache line marks the line as invalid. Other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. ( Soure Link ).
Practice Question: Example Question?
An unsigned int-32 would have all 32 bits to represent its value, but we need to remember to subtract one because one combination of bits is used to represent 0. Therefore the max value of a uint-32 is 2^32-1.
But I asked for a signed int-32 which requires the first bit to represent the sign. Therefore the max value of a uint-32 is 2^31-1.
Additional reading:
https://stackoverflow.com/questions/94591/what-is-the-maximum-value-for-an-int32
Additional reading:
https://stackoverflow.com/questions/94591/what-is-the-maximum-value-for-an-int32