User:Arash: Difference between revisions

Revision as of 05:53, 24 June 2011

Arash Sadrieh is working on developing GPU-based solvers for ASCEND. He is a PhD student at Murdoch University in Western Australia.

Development branch: arash:

Goals

GSOC-2011 Goals

Complete the current prototype.
Implement the batch multi-vector residual evaluator
Integrate the approach to QRCUDA
Integrate the QRCUDA into the ASCEND GUI.
Test the project with different hardware and software platforms.

Project Plan

Complete the current prototype.
- Clear step-by-step instructions allowing a new user to setup and test/use your solver
- General architecture improvement
- Move the initialization and shutdown tasks from the unit test to the “QRCUDA.c”.
- Fix the distillation case study, the current model is unsolvable.
- Optimise the CUDA code
  - Change kernels memory access pattern to coalesced access
  - Store the mapping information into fast texture constant memory
  - Change Memory management model from standard model to PINNED memory management. This makes the memory transfer between host and device faster.
- Implement hybrid CPU/GPU based evaluation instead of GPU-based evaluation. By doing this, the CPU can be used for the small equation groups while the GPU is busy evaluating the large groups.
  - support for models containing 'external relations'
- Prepare a multi platform Makefile to compile and build BinCUDAs
- Complete the external functions in “btcudapl.cu”

Implement the batch multi-vector residual evaluator
- Define the heuristic formula for multi-vector residual evaluator
- Research all of the variations of Armijo's rule (Grippo et al., 1986)
- Convert current kernels from 2D kernels into 3D; the extra dimension is used for each input vector.
- Implement the heuristic formula in the kernels
- Implement a separate kernel that finds the lowest residuals normal and returns the index of the lowest residual normal

Integrate the approach to QRCUDA
- Add block evaluation feature to batch single-vector evaluator.
- Modify standard residual/gradient evaluator to use new single-vector evaluator.
- Integrate batch multi-vector evaluator into QRCUDA line search.
- Modify current line search algorithm to use the batch multi-vector evaluator.
- Benchmark the results.

Integrate the QRCUDA into the ASCEND GUI.
- Fix the Bintoken unloading bug
- Fix Bintoken auto rebuild sensing feature in the PyGTK
- Add GUI menus and dialogs
  - ensuring all required user-configurable parameters are exposed though the solver API
  - implement testing of CUDA hardware availability when the solver is first loaded; only make QRCUDA available if the tests succeed, give user feedback if fails.

Test the project with different hardware and software platforms.
- testing of memory leakage and stability.

Progress

After 23-May
- The GPU memory management model was changed from standard to PINNED. This makes data transferrer between host and device two times faster.
- Batch evaluator can now perform hybrid CPU/GPU evaluations so that the CPU can be used for small equation groups while the GPU is busy evaluating the large groups.
- The benchmark model was modified slightly so it is now solvable.
After 6-June
- Cleanup in the prototype
- The GPU init and shutdown methods are moved to the QRCUDA.
- The dependency to the common makefile and headers (located in sdk samples) was removed.
- The linux version of BinCUDA's makefile was created (windows and mac/os versions are coming soon).
- A testcase for QRCUDA was implemented
- A new form added to the main GUI that shows some information about current CUDA enable devices in the system (speed, number of cores, max memory, number of multiprocessors ....).
- The Bincuda unload bug was fixed in the clean ups.
After 16-June
- More clean-ups in the BinCUDAs.
- The active block evaluation mechanism was added to the batch evaluator.
- QRCUDA is now using GPU-based model evaluation for the residual evaluation in large blocks (the code was tested on arash:models/test/bintok/test2.a4c, more testing is required).
- QRCUDA was tested with arash:models/test/bintok/larg_distil.a4c and after some bug fixes, the GPU evaluator results are now identical to the same results achieved from standard calc_residuals method.

Ideas and Issues

A list of ideas and issues with the current implementation is provided as follows (comments and critiques are greatly appreciated):

In the batch evaluator (relman.c:relman_batch_eval), 60 % of the total time is consumed in the rel_set_residual() calls. How can we optimize this function?.
Can the solver provide cheap feedback to the user showing the degree of parallelism that was achieved during a particular model solution?
Sometimes QRSlv makes use of a Brent solver for blocks with a single equation. Is that the best approach when a GPU is available?
More large demonstration models are needed. Let's go and find some.

Installation

To run BinCUDA objects, the host machine should be supplied with a NVIDIA CUDA enabled GPU card (preferably Fermi or more recent architecture). The card should have the ability to perform 'double' floating point calculations (compute_13+). In addition to the GPU hardware, the CUDA SDK and developer driver should be installed on the host machine and it is necessary to link the BinCUDA's Makefile to the SDK directory.

Installing CUDA SDK on Linux

The following explains step by step instructions for installing CUDA SDK on an Ubuntu (10.04) 32bit machine. It should be noted that the installation process on other flavors of Linux is quiet similar, however, the equivalent file distribution from NVIDIA website should be replaced with the current Ubuntu (10.04) 32bit file addresses.

1) In the terminal window issue

Invalid language.

You need to specify a language like this: <source lang="html">...</source>