CS3984 Computer Systems in Rust



Exercise 3

Submission

Submission deadline: 11:59 pm, Thursday, October 31st 2024.

Submission method: Run the included submit.py script in your root exercise repository. You will need Python and Requests installed. If you do not have Python and/or Requests, you can either:

This exercises contains 2 parts, Part 1 and Part 2.

Part 1

Summary

  1. Implement a serial counting codes implementation in Rust.
  2. Implement a parallel version using scoped threads and an atomic variable.
  3. Implement a parallel version using regular threads and thread handles.
  4. Implement a parallel version using task parallelism using the rayon crate.
  5. Benchmark your implementations using hyperfine.

Details

  1. Clone the exercise repository.
  2. In the part-1-countingcodes folder, implement the following programs in the following files:

Serial Implementation

You will implement a single-threaded Rust solution for Counting Codes.

You will not need to come up with a solution to the problem; reference solutions in C and C++ are given in the reference folder. You should compile, test, and benchmark the binaries like you would for your Rust solutions.

$ gcc -O3 reference/countingcodes.c -o countingcodes_c$ g++ -O3 reference/countingcodes.cpp -o countingcodes_cpp


The reference solution given uses a recursive, brute-force, backtracking algorithm. You do not need to modify the algorithm used; a 1-1 translation of the code will suffice.

Observe the following hints:

  • You will find it helpful to structure your Rust solution similar to the C++ solution.
  • Use equivalent Rust types for the C++ data structures, eg. Vec for std::vector.

Scoped and Atomic

You will implement a multi-threaded Rust solution for Counting Codes. Read the following carefully:

  1. The provided algorithm computes the solution through brute-force:

    For each possible digit d from 1 to 9,

    for (int d = 1; d <= 9; d++) {
    
    
    

    the digit is placed onto the grid if possible,

    inrow[i][d] = true;
    B[i][j] = d;
    
    
    

    then the count with the placed digit is computed through recursion

    cnt += backtrack(i, j+1);
    
    
    
  2. A simple multi-threaded solution will be to spawn a new thread for each recursive call.

    This requires cloning the data structures used (with the placed digit), then spawning a thread that takes ownership of the new data structure to solve.

  3. Your solution needs to keep track of total count generated by all of the threads spawned.

    Use an atomic integer for this purpose. The atomic can be safely shared and passed to all threads spawned.

  4. Your solution needs to know when all threads have finished brute-forcing their subtask.

    This means all threads should be spawned in a std::thread::scope. The scope should be created in your main function, and the first thread (running the task representing the entire problem) is spawned in the scope. Since a scope implicitly joins all threads spawned in the scope before returning, after the scope ends the value of the atomic integer will be the final count (the solution).

  5. Your solution should limit the number of threads spawned to avoid the overhead associated with creating too many threads.

    Use an integer to track the current recursion depth; fallback to a regular recursive call once the depth reaches a certain level.

Regular Threads

Implement a multi-threaded solution that uses regular threads that should provide a speedup over the scoped atomic version. Read the following carefully:

  1. Notice how after spawning off all tasks (or finishing all recursive calls), the thread currently in backtrack has no more work to do.
  2. Therefore, a simplification of your multi-threaded solution would be to spawn regular threads when spawning new tasks, placing their handles into a vector.
  3. After spawning all tasks (or finishing all recursive calls), the thread should join all handles to obtain the final counts from the tasks spawned.
  4. The backtrack function can then return the count directly, similar to the serial implementation.

Task Parallelism

The rayon crate provides high-level abstractions for data parallelism, which we will use to parallelize the tasks the algorithm performs. This is primarily achieved with parallel iterators, which you will use in this version of your Counting Codes solution. Read the following carefully:

  1. From your Regular Threads implementation, a thread is spawned for each task, with the handles to each thread placed in a vector.
  2. This version should instead place the task itself into the vector.
  3. After all tasks are created, use a parallel iterator provided by rayon to run all tasks in parallel, summing the counts returned by the tasks.

Testing

Test inputs and expected outputs are provided in the tests/ directory. For each test, the input is provided in the *.in file, and the expected output is provided in the *.ans file.

To run all the tests against a particular binary, run the provided test_countingcodes.py script.

Benchmark

You will create a writeup called writeup.md in the part-1-countingcodes directory. The performance metric you will measure is the execution time of the programs.

You will use the command-line tool hyperfine to perform the benchmark. A local copy of the tool on rlogin is available at ~cs3214/bin/hyperfine. Read the output of hyperfine --help or the README of the project to learn how to use the tool.

Your writeup should contain at least the following:

  • Statistics from the execution of all 4 implementations of your Counting Codes solution.
  • Analysis of the difference in execution time between your solutions. You may find the flamegraph utility helpful for supporting your claims.

Requirements

  • You must benchmark the release (optimized) version of your implementations.
  • You should benchmark with the 4x4_zero test, as it runs long enough to prevent noise yet short enough as to not require a coffee break each run.
  • Your serial implementation must be at most 1.5x slower than the provided C and C++ implementations.
  • Your slowest parallel implementation must be at least 2x faster than your serial implementation.

Part 2

You will implement a multi-producer, multi-consumer channel in safe Rust.

Details

  1. You should already have the exercise repository cloned.
  2. In the part-2-channel folder, fill in the todo!() statements.
  3. Implement a multi-consumer version of std::sync::mpsc, with a similar API. Read the documentation included for the corresponding function and types in std::sync::mpsc, as well as the tests to figure out the details of your implementation.

Testing

An integration test is provided in the tests directory. To run the tests, run cargo test (or if you have nextest installed, cargo nextest run).

Requirements and Hints

  • You must ensure you do not publicly expose any unnecessary (internal) fields and methods. Hint: See the documentation on visiblity keywords.
  • Use a std::collections::VecDeque for efficient pushing and popping from two ends of the queue.
  • Hint: use a mutex to protect access to the underlying shared queue.
  • The Receiver must not busy-wait when attempting to retrieve an item from the queue. Hint: Use a condition variable.
  • You cannot use the std::sync::mpsc library in your solution; use the VecDeque previously mentioned and standard library concurrency primitives.