to download the latest version. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via email, [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
to download the latest version into your current directory. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via [email](https://www.hi-mainz.de//people/people/#addr149), [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
## Usage
...
...
@@ -66,16 +66,36 @@ Name of Placeholder | Description
- with `-ni`-option: ` {VarName0} {VarName1} ... {VarNameN} {outputdir}{jobid}/outfile.txt`
- with `-s`-option OR execname containing minimum one `{`-character: ` ` (= empty)
#### Working Details
With the provided information, this includes:
- input directory with its files (*Nf* := number of files)
- variables with names and ranges (*Ni* := Number of steps in range for variable *i*)
- execname to execute (or more advanced: shell command line)
a [set of jobs is created](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L159)(with*Nf** *N1* * ... * *Nn* entries, where *n* is the number of provided variables).
Simplified, each job is executed following (reference to source code: [line 1](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L252) and [line 2](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L258)):
which basically means, that stdout and error out are redirected into files sitting in a job subfolder.
The number of jobs running in parallel is equal to the number of MPI ranks, which is equivalent to the number of
processes (the `-n` option in `mpirun`/`srun`). A round robin manner is applied during runtime
if there are more jobs in the queue than processes available.
### First steps (aka hello world)
Complete the installation steps first, see above.
Complete [the installation steps](#installation) first, see above.
1. On HIMster 2 / Mogon 2, load the following module first
```bash
module load lang/Python/3.6.6-foss-2018b
```
to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you login again.
to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you login again.
2. Next, test the parameters for the workload-manager. To do so, run short tests (with the dry-run option) on the headnode. More examples with different parameters see the next chapter
* On a head node run with
...
...
@@ -90,7 +110,7 @@ Complete the installation steps first, see above.
````
#and do some test runs like in the head node case.
3. Once you found the right launcher arguments, submit the job with
3. Once you found the right launcher arguments, submit the job interactively with
```bash
#load modules for demo analysis and MPI4Py
module purge
...
...
@@ -100,10 +120,57 @@ Complete the installation steps first, see above.
#### How to identify the number of precessors available on a machine?
- Hint for the editor: missing topic: How to identify the right number of cores
Two options:
1. Look up the information before you ask for reseources in the [cluster wiki](https://mogonwiki.zdv.uni-mainz.de/dokuwiki/nodes). Look out for the column named "Cores".
2. The direct way
- You identify the machine you reserved:
```bash
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4576219 devel bash pbotte R 1:02 1 z0477
```
- check for the reserved computer names in the column "NODELIST"
- ssh into these machines, run `cat /proc/cpuinfo` and count the number of processors. Or do all in once:
```bash
ssh {REPLACE WITH A COMPUTER NAME, eg z0477} "cat /proc/cpuinfo | grep processor | wc -l"
````
### Examples
Note, that `/procs/cpuinfo` normally reports a number, which is twice as high as the number of cores. The effect comes from the point, that it treats [HyperThreating](https://en.wikipedia.org/wiki/Hyper-threading) in the same way as normal processors. Generally speaking, better **use only the number of cores** in your jobs.
#### Input / Output File Example
Task: Run the analysis binary for each input file in MyInputDirectory on 20 cores