README.md 10.9 KB
Newer Older
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
1 2
# Workload Manager

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
3 4 5
## When to Use
If at least one condition is met, then the use is recommended:
- Single fast analysis step (eg your single analysis runs for only a minute)
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
6
- 1000's or more single analysis steps
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
7
- Easy usage of all cores in node exclusive partitions (true for Mogon 2, such partitions do not exist on HIMster 2)
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
8 9


Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
10
### Comparision with SLURM
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
11
- Queue based work distribution, equal work distribution (in contrast to SLURM multiprog or [staskfarm](https://github.com/cmeesters/staskfarm) from [MogonWiki Node local scheduling](https://mogonwiki.zdv.uni-mainz.de/dokuwiki/node_local_scheduling))
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
12 13 14 15 16
- Usage of MPI: 
  - large connected jobs (>200 cores) are preferred by the job manager
  - efficiently supports both node local and multi node usage 
  - keeps environment , also in multi node sutiations (with GNU parallel only on node local)
- Usage of Python makes changes for users simple
17 18

- Only one disadvantage: Number of ranks fixed during runtime -- in contrast to SLRUM jobs.
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
19 20


Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
21 22
## Installation

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
23
In you home directory run
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
24 25 26
```bash
git clone https://gitlab.rlp.net/pbotte/workload-manager.git
```
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
27
to download the latest version into your current directory. You are free to modify the code and it is higly welcome if suggestions for improvement are made (via [email](https://www.hi-mainz.de//people/people/#addr149), [issue tracker](https://gitlab.rlp.net/pbotte/workload-manager/issues) or pull request).
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
28

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
29
## Usage
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

### Command line parameters
```bash
wkmgr.py [-h] [-v] [-V] [-d DELAY] [-a [ARG_VARIABLE]] [-i INPUT_DIR]
         [-o OUTPUT_DIR] [-s] [-ni] [-n]
         execname ...
``` 

#### Arguments
Parameter |  | Description 
---  | --- | ---
execname | required | (positional argument) name of executable to call OR complete shell command line. If no placeholders are provided, "{inputdir}{inputfilename} {outputdir}{jobid}/outfile.txt" is added.
-h, --help | optional | show this help message and exit
-v, --verbosity | optional | increase output verbosity
-V, --version | optional | Returns the actual program version
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
45
-d DELAY, --delay DELAY | optional | time delay in ms between starts of consecutive jobs to help distributing load on the cluster
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
46 47 48 49 50 51 52
-a [ARG_VARIABLE], --arg-variable [ARG_VARIABLE] | optional | add variables to use as execname argument. example: `-a theta,0,180,2.5 -a energy,0,10`
-i INPUT_DIR, --input-dir INPUT_DIR | optional | specifies the directory with files to process
-o OUTPUT_DIR, --output-dir OUTPUT_DIR | optional | output directory, default = output[datetime]
-s, --single-argument | optional | do not add the default placeholders to the execname
-ni, --no-inputfiles | optional | set this flag if no iteration over input files requested.
-n, --dry-run | optional | Do not launch workers, just build test settings.

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#### Placeholders

Name of Placeholder | Description
--- | ---
{execname} | Execname provided in the arguments. Consists only of the first word, of multiple statements are given
{inputdir} | Input directory provided in the arguments, default is "./"
{jobid} | Internal enumeration of jobs
{outputdir} | root outputdir of jobs. Each job creates a subdirectory with its job number and writes its subdirectory
{inputfilename} | Name of the input file to process. Value changes for each job.
{VarName} | Each variable given with a `-a VarName,...` statement is available as a placeholder

- Default Placeholderstring: 
  - normal: ` {VarName0} {VarName1} ... {VarNameN} {inputdir}{inputfilename} {outputdir}{jobid}/outfile.txt`
  - with `-ni`-option: ` {VarName0} {VarName1} ... {VarNameN} {outputdir}{jobid}/outfile.txt`
  - with `-s`-option OR execname containing minimum one `{`-character: ` ` (= empty)

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
#### Working Details

With the provided information, this includes:
- input directory with its files (*Nf* := number of files)
- variables with names and ranges (*Ni* := Number of steps in range for variable *i*)
- execname to execute (or more advanced: shell command line)

a [set of jobs is created](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L159) (with *Nf* * *N1* * ... * *Nn* entries, where *n* is the number of provided variables). 
Simplified, each job is executed following (reference to source code: [line 1](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L252) and [line 2](https://gitlab.rlp.net/pbotte/workload-manager/blob/cfd7f0ef41bb11c3b7a4fb806d8ed9b9f15aca4c/wkmgr.py#L258)):
```python
os.mkdir(outputdir + str(jobid))
bashCommand = execname + " > " + outputdir + str(jobid) + "/std_out.txt 2> " + outputdir + str(jobid) + "/err_out.txt"
subprocess.run(bashCommand, shell=True)
```
which basically means, that stdout and error out are redirected into files sitting in a job subfolder. 

The number of jobs running in parallel is equal to the number of MPI ranks, which is equivalent to the number of
processes (the `-n` option in `mpirun`/`srun`). A round robin manner is applied during runtime 
if there are more jobs in the queue than processes available.

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
89 90

### First steps (aka hello world)
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
91

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
92
Complete [the installation steps](#installation) first, see above.
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
93

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
94 95 96 97
1. On HIMster 2 / Mogon 2, load the following module first
   ```bash
   module load lang/Python/3.6.6-foss-2018b
   ```
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
98
   to enable Python 3.6 and MPI4Py support. You can also add this line to your `~/.bashrc` configuration file to speed up the process when you log in again.
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
99

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
100
2. Next, test the parameters for the workload-manager. To do so, run short tests (with the dry-run option) on the headnode. More examples with different parameters see the next chapter
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
101 102 103 104 105 106 107 108 109 110 111 112
   * On a head node run with 
     ```bash
     ./wkmgr.py -n [YOUR EXECUTABLE]
     ```
   * Or reserve a dedicated node for this purpose first, eg
     ```bash
     salloc -p devel -A m2_him_exp -N 1 -t 1:30:00
     #or during a turorial
     salloc -p parallel --reservation=himkurs -A m2_himkurs -N 1 -t 1:30:00
     ````
     #and do some test runs like in the head node case.

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
113 114 115 116 117 118
3. Once you found the right launcher arguments, submit the job interactively with
   ```bash
   #load modules for demo analysis and MPI4Py
   module purge
   module load math/SUNDIALS/2.7.0-intel-2018.03
   module load lang/Python/3.6.6-foss-2018b
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
119

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
   #run some example provided in the git repository
   srun -n 20 ~/workload-manager/wkmgr.py -v -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
   ```
   Interactively in this context means, that you first allocate ressources and later does one or several run steps with `srun`.
4. Run your jobs scripted:
   ```bash
   #!/bin/bash
   #-----------------------------------------------------------------
   # Example SLURM job script to run MPI Job on Mogon.
   # This script requests two nodes with all cores. The job
   # will have access to all the memory in the nodes.  
   #-----------------------------------------------------------------
   
   #SBATCH -J myjob              # Job name
   #SBATCH -o myjob.%j.out       # Specify stdout output file (%j expands to jobId)
   #SBATCH -p devel              # Queue name
   #SBATCH -N 2                  # Total number of nodes requested (32 cores/node)
   #SBATCH -n 64                 # Total number of tasks
   #SBATCH -t 01:30:00           # Run time (hh:mm:ss)
   #SBATCH -A m2_him_exp         # Specify account
  
   # Load all necessary modules if needed
   # Loading modules in the script ensures a consistent environment.
   module load math/SUNDIALS/2.7.0-intel-2018.03
   module load lang/Python/3.6.6-foss-2018b
  
   # Launch the executable
   srun ~/workload-manager/wkmgr.py -i ~/workload-manager/examples/LGS/Run27_LaPalma_Profile_I50 ~/workload-manager/examples/LGS/PulsedLGS
   ````
   Finally, save your script and submit via
   ```bash
   $ sbatch myjobscript
   ```

### Examples and FAQ
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
155

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
#### How to identify the number of precessors available on a machine?

Two options:
1. Look up the information before you ask for reseources in the [cluster wiki](https://mogonwiki.zdv.uni-mainz.de/dokuwiki/nodes). Look out for the column named "Cores".
2. The direct way
   - You identify the machine you reserved: 
   ```bash
   $ squeue -u $USER
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   4576219     devel     bash   pbotte  R       1:02      1 z0477
   ```
   - check for the reserved computer names in the column "NODELIST"
   - ssh into these machines, run `cat /proc/cpuinfo` and count the number of processors. Or do all in once:
   ```bash
   ssh {REPLACE WITH A COMPUTER NAME, eg z0477} "cat /proc/cpuinfo | grep processor | wc -l"
   ````
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
172

Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
173
Note, that `/procs/cpuinfo` normally reports a number, which is twice as high as the number of cores. The effect comes from the point, that it treats [HyperThreating](https://en.wikipedia.org/wiki/Hyper-threading) in the same way as normal processors. Generally speaking, better **use only the number of cores** in your jobs.
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196

#### Input / Output File Example
Task: Run the analysis binary for each input file in MyInputDirectory on 20 cores
```bash
srun -n 20 ~/workload-manager/wkmgr.py -i MyInputDir MyAnalysisBinary
```
check for output* in the current directory for the output or change this behaviour with the `-o`-option.

Provide the `-s` option, of you do not want to add the default placeholders (`{inputdir}{inputfilename} {outputdir}{jobid}/outfile.txt`) to the execname:
```bash
srun -n 20 ~/workload-manager/wkmgr.py -s MyAnalysisBinary
```

#### Parameters Example
Task: Run the analysis binary for theta from 0 to 180 in 2.5 degree and energy from 130 to 150 MeV in 2 MeV steps with no input files on 20 cores
```bash
srun -n 20 ~/workload-manager/wkmgr.py -a theta,0,180,2.5 -a energy,130,150,2 -ni MyAnalysisBinary {theta} {energy}
```
- Note: The `-s` option is automatically activated once you add a '{' character in your execname statement.

#### Example Shell Environment
Your bash environment is also active within the workload-manager jobs. This enables you to do more sophisticated calls like:
```bash
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
197
srun -n 20 ~/workload-manager/wkmgr.py -a theta,0,180,2.5 -a energy,130,150,2 -ni "MyAnalysisBinaryPrepare && MyAnalysisBinary {theta} {energy} && MyAnalysisBinaryAfter"
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
198
```
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
199
- Note the `"` in this statement.
Peter-Bernd Otte's avatar
Peter-Bernd Otte committed
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220

#### Change Verbosity
Get infos with `-v` or even mir mit `-vv`. Sample output 
```bash
2019-05-27 16:27:14,842 rank15 INFO      Worker startet job.
```
which tells you the time, the rank / worker, debug/info/warning/error and the message text.
Please note, that output from the individual workers / ranks are not necessarily displayed in the right order -- where as the order within a rank is consistent.

#### Dry-runs
Test everything before you do the calculation with
```bash
~/workload-manager/wkmgr.py -vv -n ...
```
to perform a dry-run with maximum verbosity and check the printed out worklist.

## Full Loader
The full loader is currently under development. 
- with loader (untested on cluster so far) replace `wkmgr.py` with `wkloader.py`. 
- Handy command for debug purposes might be: `mpirun -n 4 ./wkmgr.py -vv -n [YOUR EXECUTABLE]`