Commit 3f45187f authored by Peter-Bernd Otte's avatar Peter-Bernd Otte

Update README.md

parent cfd7f0ef
......@@ -20,11 +20,11 @@ If at least one condition is met, then the use is recommended:
## Installation
In you home directory do
In you home directory run
```bash
git clone https://gitlab.rlp.net/pbotte/workload-manager.git
```
to download the latest version. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via email, [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
to download the latest version into your current directory. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via [email](https://www.hi-mainz.de//people/people/#addr149), [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
## Usage
......@@ -66,16 +66,36 @@ Name of Placeholder | Description
- with `-ni`-option: ` {VarName0} {VarName1} ... {VarNameN} {outputdir}{jobid}/outfile.txt`
- with `-s`-option OR execname containing minimum one `{`-character: ` ` (= empty)
#### Working Details
With the provided information, this includes:
- input directory with its files (*Nf* := number of files)
- variables with names and ranges (*Ni* := Number of steps in range for variable *i*)
- execname to execute (or more advanced: shell command line)
a [set of jobs is created](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L159) (with *Nf* * *N1* * ... * *Nn* entries, where *n* is the number of provided variables).
Simplified, each job is executed following (reference to source code: [line 1](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L252) and [line 2](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L258)):
```python
os.mkdir(outputdir + str(jobid))
bashCommand = execname + " > " + outputdir + str(jobid) + "/std_out.txt 2> " + outputdir + str(jobid) + "/err_out.txt"
subprocess.run(bashCommand, shell=True)
```
which basically means, that stdout and error out are redirected into files sitting in a job subfolder.
The number of jobs running in parallel is equal to the number of MPI ranks, which is equivalent to the number of
processes (the `-n` option in `mpirun`/`srun`). A round robin manner is applied during runtime
if there are more jobs in the queue than processes available.
### First steps (aka hello world)
Complete the installation steps first, see above.
Complete [the installation steps](#installation) first, see above.
1. On HIMster 2 / Mogon 2, load the following module first
```bash
module load lang/Python/3.6.6-foss-2018b
```
to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you login again.
to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you log in again.
2. Next, test the parameters for the workload-manager. To do so, run short tests (with the dry-run option) on the headnode. More examples with different parameters see the next chapter
* On a head node run with
......@@ -90,20 +110,67 @@ Complete the installation steps first, see above.
````
#and do some test runs like in the head node case.
3. Once you found the right launcher arguments, submit the job with
```bash
#load modules for demo analysis and MPI4Py
module purge
module load math/SUNDIALS/2.7.0-intel-2018.03
module load lang/Python/3.6.6-foss-2018b
3. Once you found the right launcher arguments, submit the job interactively with
```bash
#load modules for demo analysis and MPI4Py
module purge
module load math/SUNDIALS/2.7.0-intel-2018.03
module load lang/Python/3.6.6-foss-2018b
#run some example provided in the git repository
srun -n 20 ~/workload-manager/wkmgr.py -v -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
```
#run some example provided in the git repository
srun -n 20 ~/workload-manager/wkmgr.py -v -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
```
Interactively in this context means, that you first allocate ressources and later does one or several run steps with `srun`.
4. Run your jobs scripted:
```bash
#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job script to run MPI Job on Mogon.
# This script requests two nodes with all cores. The job
# will have access to all the memory in the nodes.
#-----------------------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.%j.out # Specify stdout output file (%j expands to jobId)
#SBATCH -p devel # Queue name
#SBATCH -N 2 # Total number of nodes requested (32 cores/node)
#SBATCH -n 64 # Total number of tasks
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH -A m2_him_exp # Specify account
# Load all necessary modules if needed
# Loading modules in the script ensures a consistent environment.
module load math/SUNDIALS/2.7.0-intel-2018.03
module load lang/Python/3.6.6-foss-2018b
# Launch the executable
srun ~/workload-manager/wkmgr.py -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
````
Finally, save your script and submit via
```bash
$ sbatch myjobscript
```
### Examples and FAQ
- Hint for the editor: missing topic: How to identify the right number of cores
#### How to identify the number of precessors available on a machine?
Two options:
1. Look up the information before you ask for reseources in the [cluster wiki](https://mogonwiki.zdv.uni-mainz.de/dokuwiki/nodes). Look out for the column named "Cores".
2. The direct way
- You identify the machine you reserved:
```bash
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4576219 devel bash pbotte R 1:02 1 z0477
```
- check for the reserved computer names in the column "NODELIST"
- ssh into these machines, run `cat /proc/cpuinfo` and count the number of processors. Or do all in once:
```bash
ssh {REPLACE WITH A COMPUTER NAME, eg z0477} "cat /proc/cpuinfo | grep processor | wc -l"
````
### Examples
Note, that `/procs/cpuinfo` normally reports a number, which is twice as high as the number of cores. The effect comes from the point, that it treats [HyperThreating](https://en.wikipedia.org/wiki/Hyper-threading) in the same way as normal processors. Generally speaking, better **use only the number of cores** in your jobs.
#### Input / Output File Example
Task: Run the analysis binary for each input file in MyInputDirectory on 20 cores
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment