Commit 3f45187f authored by Peter-Bernd Otte's avatar Peter-Bernd Otte

Update README.md

parent cfd7f0ef
...@@ -20,11 +20,11 @@ If at least one condition is met, then the use is recommended: ...@@ -20,11 +20,11 @@ If at least one condition is met, then the use is recommended:
## Installation ## Installation
In you home directory do In you home directory run
```bash ```bash
git clone https://gitlab.rlp.net/pbotte/workload-manager.git git clone https://gitlab.rlp.net/pbotte/workload-manager.git
``` ```
to download the latest version. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via email, [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request). to download the latest version into your current directory. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via [email](https://www.hi-mainz.de//people/people/#addr149), [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
## Usage ## Usage
...@@ -66,16 +66,36 @@ Name of Placeholder | Description ...@@ -66,16 +66,36 @@ Name of Placeholder | Description
- with `-ni`-option: ` {VarName0} {VarName1} ... {VarNameN} {outputdir}{jobid}/outfile.txt` - with `-ni`-option: ` {VarName0} {VarName1} ... {VarNameN} {outputdir}{jobid}/outfile.txt`
- with `-s`-option OR execname containing minimum one `{`-character: ` ` (= empty) - with `-s`-option OR execname containing minimum one `{`-character: ` ` (= empty)
#### Working Details
With the provided information, this includes:
- input directory with its files (*Nf* := number of files)
- variables with names and ranges (*Ni* := Number of steps in range for variable *i*)
- execname to execute (or more advanced: shell command line)
a [set of jobs is created](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L159) (with *Nf* * *N1* * ... * *Nn* entries, where *n* is the number of provided variables).
Simplified, each job is executed following (reference to source code: [line 1](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L252) and [line 2](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L258)):
```python
os.mkdir(outputdir + str(jobid))
bashCommand = execname + " > " + outputdir + str(jobid) + "/std_out.txt 2> " + outputdir + str(jobid) + "/err_out.txt"
subprocess.run(bashCommand, shell=True)
```
which basically means, that stdout and error out are redirected into files sitting in a job subfolder.
The number of jobs running in parallel is equal to the number of MPI ranks, which is equivalent to the number of
processes (the `-n` option in `mpirun`/`srun`). A round robin manner is applied during runtime
if there are more jobs in the queue than processes available.
### First steps (aka hello world) ### First steps (aka hello world)
Complete the installation steps first, see above. Complete [the installation steps](#installation) first, see above.
1. On HIMster 2 / Mogon 2, load the following module first 1. On HIMster 2 / Mogon 2, load the following module first
```bash ```bash
module load lang/Python/3.6.6-foss-2018b module load lang/Python/3.6.6-foss-2018b
``` ```
to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you login again. to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you log in again.
2. Next, test the parameters for the workload-manager. To do so, run short tests (with the dry-run option) on the headnode. More examples with different parameters see the next chapter 2. Next, test the parameters for the workload-manager. To do so, run short tests (with the dry-run option) on the headnode. More examples with different parameters see the next chapter
* On a head node run with * On a head node run with
...@@ -90,20 +110,67 @@ Complete the installation steps first, see above. ...@@ -90,20 +110,67 @@ Complete the installation steps first, see above.
```` ````
#and do some test runs like in the head node case. #and do some test runs like in the head node case.
3. Once you found the right launcher arguments, submit the job with 3. Once you found the right launcher arguments, submit the job interactively with
```bash ```bash
#load modules for demo analysis and MPI4Py #load modules for demo analysis and MPI4Py
module purge module purge
module load math/SUNDIALS/2.7.0-intel-2018.03 module load math/SUNDIALS/2.7.0-intel-2018.03
module load lang/Python/3.6.6-foss-2018b module load lang/Python/3.6.6-foss-2018b
#run some example provided in the git repository #run some example provided in the git repository
srun -n 20 ~/workload-manager/wkmgr.py -v -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS srun -n 20 ~/workload-manager/wkmgr.py -v -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
``` ```
Interactively in this context means, that you first allocate ressources and later does one or several run steps with `srun`.
4. Run your jobs scripted:
```bash
#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job script to run MPI Job on Mogon.
# This script requests two nodes with all cores. The job
# will have access to all the memory in the nodes.
#-----------------------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.%j.out # Specify stdout output file (%j expands to jobId)
#SBATCH -p devel # Queue name
#SBATCH -N 2 # Total number of nodes requested (32 cores/node)
#SBATCH -n 64 # Total number of tasks
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH -A m2_him_exp # Specify account
# Load all necessary modules if needed
# Loading modules in the script ensures a consistent environment.
module load math/SUNDIALS/2.7.0-intel-2018.03
module load lang/Python/3.6.6-foss-2018b
# Launch the executable
srun ~/workload-manager/wkmgr.py -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
````
Finally, save your script and submit via
```bash
$ sbatch myjobscript
```
### Examples and FAQ
- Hint for the editor: missing topic: How to identify the right number of cores #### How to identify the number of precessors available on a machine?
Two options:
1. Look up the information before you ask for reseources in the [cluster wiki](https://mogonwiki.zdv.uni-mainz.de/dokuwiki/nodes). Look out for the column named "Cores".
2. The direct way
- You identify the machine you reserved:
```bash
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4576219 devel bash pbotte R 1:02 1 z0477
```
- check for the reserved computer names in the column "NODELIST"
- ssh into these machines, run `cat /proc/cpuinfo` and count the number of processors. Or do all in once:
```bash
ssh {REPLACE WITH A COMPUTER NAME, eg z0477} "cat /proc/cpuinfo | grep processor | wc -l"
````
### Examples Note, that `/procs/cpuinfo` normally reports a number, which is twice as high as the number of cores. The effect comes from the point, that it treats [HyperThreating](https://en.wikipedia.org/wiki/Hyper-threading) in the same way as normal processors. Generally speaking, better **use only the number of cores** in your jobs.
#### Input / Output File Example #### Input / Output File Example
Task: Run the analysis binary for each input file in MyInputDirectory on 20 cores Task: Run the analysis binary for each input file in MyInputDirectory on 20 cores
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment