ASCEND - User contributions [en]

User:Arash

2011-08-21T11:51:04Z

Arash: /* compiler_bincuda.qrcuda/mwcol/bcacol */

'''Arash Sadrieh''' is working on developing GPU-based solvers for ASCEND. He is a PhD student at Murdoch University in Western Australia.

Development branch: {{srcbranchdir|arash|}}

== Goals ==

GSOC-2011 Goals

* Complete the current prototype.
* Implement the batch multi-vector residual evaluator
* Integrate the approach to QRCUDA
* Integrate the QRCUDA into the ASCEND GUI.
* Test the project with different hardware and software platforms.

== Project Plan ==
* Complete the current prototype.
** Clear step-by-step instructions allowing a new user to setup and test/use your solver
** General architecture improvement
** Move the initialization and shutdown tasks from the unit test to the “QRCUDA.c”.
** Fix the distillation case study, the current model is unsolvable.
** Optimise the CUDA code
*** Change kernels memory access pattern to coalesced access

*** Change Memory management model from standard model to PINNED memory management. This makes the memory transfer between host and device faster.
** Implement hybrid CPU/GPU based evaluation instead of GPU-based evaluation. By doing this, the CPU can be used for the small equation groups while the GPU is busy evaluating the large groups.

** Prepare a  Makefile to compile and build BinCUDAs


* Implement the batch multi-vector residual evaluator
** Define the heuristic formula for multi-vector residual evaluator
** Research all of the variations of Armijo's rule (Grippo et al., 1986)
** Convert current kernels from 2D kernels into 3D; the extra dimension is used for each input vector.
** Implement the heuristic formula in the kernels
** Implement a separate kernel that finds the lowest residuals normal and returns the index of the lowest residual normal

* Integrate the approach to QRCUDA
** Add block evaluation feature to batch single-vector evaluator.
** Modify standard residual/gradient evaluator to use new single-vector evaluator.
** Integrate batch multi-vector evaluator into QRCUDA line search.
** Modify current line search algorithm to use the batch multi-vector evaluator.
** Benchmark the results.

* Integrate the QRCUDA into the ASCEND GUI.
** Fix the Bintoken unloading bug
** Fix Bintoken auto rebuild sensing feature in the PyGTK
** Add GUI menus and dialogs
*** ensuring all required user-configurable parameters are exposed though the solver API
*** implement testing of CUDA hardware availability when the solver is first loaded; only make QRCUDA available if the tests succeed, give user feedback if fails.

* Test the project with different hardware and software platforms.
** testing of memory leakage and stability.

== Progress ==
* After 23-May
** The GPU memory management model was changed from standard to PINNED. This makes data transferrer between host and device two times faster.
** Batch evaluator can now perform hybrid CPU/GPU evaluations so that the CPU can be used for small equation groups while the GPU is busy evaluating the large groups.
** The benchmark model was modified slightly so it is now solvable in mass balance mode.
* After 6-June
** Cleanup in the prototype
** The GPU init and shutdown methods are moved to the QRCUDA.
** The dependency to the common makefile and headers (located in sdk samples) was removed.
** The linux version of BinCUDA's makefile was created (windows and mac/os versions are coming soon).
** A testcase for QRCUDA was implemented
** A new form added to the main GUI that shows some information about current CUDA enable devices in the system (speed, number of cores, max memory, number of multiprocessors ....).
** The Bincuda unload bug was fixed in the clean ups.
* After 16-June
** More clean-ups in the BinCUDAs.
** The active block evaluation mechanism was added to the batch evaluator.
** QRCUDA is now using GPU-based model evaluation for the residual evaluation in large blocks (the code was tested on {{srcbranch|arash|models/test/bintok/bincuda/test2.a4c}}, more testing is required).
** QRCUDA was tested with {{srcbranch|arash|models/test/bintok/bincuda/larg_distil.a4c}} and after some bug fixes, the GPU evaluator results are now identical to the same results achieved from standard calc_residuals method.
* After 26-June
** The testcase was modified to solve the distillation model in both mass balance and energy balance mode.
** Performance analyses with valgrind and gprof.
** Bug fix in PyGTK so now the system is re-analyzed after execution of the methods.
** QRCUDA solved its first large model (31733 equations) in mass balance and energy balance mode, the results are identical to the QRSlv results. Both solvers are converged and the self_test method was executed without any error ({{srcbranch|arash|models/test/bintok/bincuda/mwcolumn.a4c}}).
* After 6-July
** QRCUDA was integrated to PyGTK.
** ASCEND's standard parameter handling mechanism was used in QRCUDA.
** The functionality added in QRCUDA that reports GPU block evaluation timing to PyGTK.
** Extensive search carried out to create large and solvable models (larger than current 30000). During this search, QRCUDA was tested with different models and several bugs were identified in QRCUDA and fixed.
** The next step is to create GPU-based line search.
* After 16-July
** The heuristic formula for multi-vector residual evaluator is defined (Armijo rule)
** Research on different variation of Armijo rule was completed and I decided to use (0.5) as the coefficient, the main reason behind this decision is that we can calculate (0.5) ^N with a combination of shift-left and divide operators which has a great performance advantage over any other coefficient.
* After 26-July
** The evaluator kernels are converted from 2D kernels into 3D kernels (the extra dimension is used for input vectors created with Armijo rule)
** A parallel kernel was implemented to calculate square normal of residuals
*** The normal calculator is extended to calculate the minimum square normal value for multi vector evaluation
** A unit test created for testing multi vector evaluators
* After 6-August
** Concurrent kernel launcher (streaming) is implemented for residual evaluator kernels. (In a model with ~80000 relations, the evaluators are now executed 4x faster compared to the previous version that used sequential kernel launcher)
** Multi-vector evaluators were tested and the results were identical to normal CPU based evaluators
** Multi-vector evaluators were integrated to the line-search algorithm of QRCUDA

== Ideas and Issues ==

A list of ideas and issues with the current implementation is provided as follows (comments and critiques are greatly appreciated):

# In the batch evaluator (relman.c:relman_batch_eval), 60 % of the total time is consumed in the rel_set_residual() calls. How can we optimize this function?.
# Can the solver provide cheap feedback to the user showing the degree of parallelism that was achieved during a particular model solution?
# Sometimes QRSlv makes use of a Brent solver for blocks with a single equation. Is that the best approach when a GPU is available?
# More large demonstration models are needed. Let's go and find some.

== Installation ==

To run BinCUDA objects, the host machine should be supplied with a NVIDIA CUDA enabled GPU card (preferably Fermi or more recent architecture). The card should have the ability to perform 'double' floating point calculations (compute_13+).
In addition to the GPU hardware, the CUDA SDK and developer driver should be installed on the host machine and it is necessary to link the BinCUDA's Makefile to the SDK directory.

=== Installing CUDA SDK on Linux ===

The following explains step by step instructions for installing CUDA SDK on an Ubuntu (10.04) 32bit machine. It should be noted that the installation process on other flavors of Linux is quiet similar, however, the equivalent file distribution from [http://developer.nvidia.com/cuda-downloads NVIDIA website] should be replaced with the current Ubuntu (10.04) 32bit file addresses.

1) In the terminal window issue

<source lang=sh>
wget http://developer.download.nvidia.com/compute/cuda/3_2_prod/drivers/devdriver_3.2_linux_32_260.19.26.run
chmod +x ./devdriver_3.2_linux_32_260.19.26.run
</source>

2) Stop the X Windows by pressing CTRL+ALT+F1 and then issue

<source lang=sh>
sudo /etc/init.d/gdm stop
sudo ./devdriver_3.2_linux_32_260.19.26.run
sudo /etc/init.d/gdm start
</source>

3) The X Windows should be restarted with the new NVIDIA driver and then you should be able to install the CUDA 3.2 toolkit and samples (it is recommended to use default directory - i.e. /usr/local/cuda 

<source lang=sh>
wget http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/cudatoolkit_3.2.16_linux_32_ubuntu10.04.run
chmod +x ./cudatoolkit_3.2.16_linux_32_ubuntu10.04.run
sudo ./cudatoolkit_3.2.16_linux_32_ubuntu10.04.run

</source>

4) Add /usr/local/cuda/bin to PATH and /user/local/cuda/lib to LD_LIBRARY_PATH by appending this text to ~/.bashrc file:

<source lang=sh>
PATH=$PATH:/usr/local/cuda/bin
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib
export PATH
export LD_LIBRARY_PATH
</source>



=== BinCUDA Makefile settings ===

After installing CUDA SDK, the CUDA_INSTALL_PATH variable in the makefile ({{srcbranch|arash|ascend/bintokens/bincuda/Makefile}}) should be pointed to the SDK directory.

== Test models ==

A distillation column model was created to test the GPU-based bintokens, the model was proposed by Ben Allan.

=== Distillation Column Model ===

<source lang="a4c">REQUIRE "column.a4l";
MODEL larg_distil() REFINES test_demo_column();
demo IS_A
demo_column(['n_butane','n_pentane','n_hexane','n_heptane','n_octane','n_nonane','n_decane'],'n_decane',100,51);
METHODS
END larg_distil;
</source>

=== Number of Equations ===
The model originally has 128 unique equation symbolic forms and 19959 equation instances.The number of relations in the model can be adjusted by changing two parameters, 100 and 51, by a multiplicative factor. For example in {{srcbranch|arash|models/test/bintok/bincuda/larg_distil.a4c}},

<source lang="a4c">REQUIRE "column.a4l";
MODEL larg_distil() REFINES test_demo_column();
demo IS_A
demo_column(['n_butane','n_pentane','n_hexane','n_heptane','n_octane','n_nonane','n_decane'],'n_decane',500,255);
METHODS
END larg_distil;
</source>

Alternatively, multiple columns can be used instead of single column ({{srcbranch|arash|models/test/bintok/bincuda/larg_distil_2.a4c}}),

<source lang="a4c">REQUIRE "column.a4l";
MODEL c5_10_demo_column() REFINES test_demo_column();
demo,demo2,demo3,demo4 IS_A
demo_column(['n_butane','n_pentane','n_hexane','n_heptane','n_octane','n_nonane','n_decane'],'n_decane',100,51);
METHODS
END c5_10_demo_column;</source>

== Running the tests ==

A CUnit test suite was prepared to test the QRCUDA solver and the generated CUDA model evaluator objects (i.e. BinCUDAs). The test suite code is located in test_bincuda.c({{srcbranch|arash|ascend/compiler/test/test_bincuda.c}}) and contains six test functions; gen, satpnt, multivec, qrcuda, mwcol and bcacol.
You can run the test by executing "test/test compiler_bincuda.[test function name]" at the top level ASCEND directory.
For more information about how QRCUDA and BinCUDAs are interacting please refer to ({{srcbranch|arash|ascend/bintokens/bincuda/BinCUDA_Readme.txt}}).
To change the current benchmark model, you can change the macro DEF_FILENAMESTEM
in the code. [Please note that if your model includes any specific
ASCEND function (e.g. asc_ipow) the function should be defined in the
btcudapl.cu ({{srcbranch|arash|ascend/bintokens/bincuda/btcudapl.cu}}) file.]

=== compiler_bincuda.gen ===
This test function outputs the CPU-based evaluation time, GPU-based evaluation
time and the number of equations in the model.
It generates the code in the "/tmp" directory and the Makefile located in the same directory
is responsible for building the shared binary object for BinCUDAs. The CUDA
build and compile commands are provided in the Makefile ({{srcbranch|arash|ascend/bintokens/bincuda/Makefile}}).

=== compiler_bincuda.satpnt ===

In the multi-vector residual evaluator, the model is concurrently evaluated for multiple input vectors. As the GPU parallel architecture is used, the evaluation time for multiple inputs is equal to the evaluation time for a single input. The "satpnt" test function is responsible for determining the ''saturation point'' for a specific model. We define the saturation point as the maximum number of vectors where the computational time for concurrent residual evaluation is equal to that time measured for a single input vector evaluation.

Please note that this test function is only measuring the computational time and the time for data transfer between CPU and GPU is not provided in the results.

=== compiler_bincuda.multivec ===

In the "multivec" test function, the results achieved from the multi vector evaluators is verified against the standard CPU based implementation provided in ASCEND framework and then the computational performance of multivector evaluators are measured.

=== compiler_bincuda.qrcuda/mwcol/bcacol ===

These test functions are solving {{srcbranch|arash|models/test/bintok/bincuda/larg_distil.a4c}}, {{srcbranch|arash|models/test/bintok/bincuda/mwcolumn.a4c}} and {{srcbranch|arash|models/test/bintok/bincuda/bcacolumn.a4c}} respectively.
[[Category:GSOC2011]][[Category:ASCEND Contributors]]

2011-06-24T05:46:02Z

2011-04-07T15:20:02Z

Arash: /* Test models */

User:Arash

2011-04-07T15:18:50Z

Arash: /* Large Distillation Column Model */

User:Arash

2011-04-07T15:16:05Z

Arash: