Hyperctl¶
Hyperctl is a general tool for multi-job management, which includes but not limit to training, testing and comparison. It is packaged under Hypernets and intended to provide convenience to every developing stage.
Concepts¶
Job
A command line job that accepts parameters. Hyperctl provides the python API to read the parameters of the job, so this command line can execute a python script and use the API to obtain the parameters to complete the job.
Batch
A batch of jobs. All the status files and output files of jobs in the same batch are in the working directory of the batch.
Scheduler
Job scheduler, which schedules jobs in batch to run on appropriate machine resources and manages computing resources.
Backend
The backend for running jobs can run in a stand-alone mode or multiple remote nodes through SSH protocol.
Quick start¶
Run batch using command line tool¶
After installing hypernets
, you could see the following description by typing hyperctl
, which includes four arguments run
, generate
, batch
, job
:
$ hyperctl
usage: hyperctl [-h] [--log-level LOG_LEVEL] [-error] [-warn] [-info] [-debug] {run,generate,batch,job} ...
hyperctl command is used to manage jobs
positional arguments:
{run,generate,batch,job}
run run jobs
generate generate specific jobs json file
batch batch operations
job job operations
optional arguments:
-h, --help show this help message and exit
Console outputs:
--log-level LOG_LEVEL
logging level, default is INFO
-error alias of "--log-level=ERROR"
-warn alias of "--log-level=WARN"
-info alias of "--log-level=INFO"
-debug alias of "--log-level=DEBUG"
```
Take using Hyperctl to tuning the tol
parameter of sklearn.linear_model.LogisticRegression
as an example,
create a job python script ~/sklearn_iris_example.py
with following content:
import os
import pickle as pkl
from hypernets import hyperctl
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
job_params = hyperctl.get_job_params() # read job params as a dict from hyperctl
tol=job_params['tol']
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8086)
lr = LogisticRegression(tol=tol)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"tol: {tol}, accuracy_score: {accuracy_score(y_pred, y_test)}")
# persist assets
report_dir = "~/report/"
os.makedirs(report_dir)
with open(os.path.join(reportdir, f"model_tol_{tol}.pkl"), 'wb') as f:
pkl.dump(lr, f)
Hyperctl uses the JSON format file to define the jobs, for example create a file named batch.json
and configures 2 jobs then set the parameter tol
to 1 and 100 respectively:
{
"name": "sklearn_iris_example",
"jobs": [{
"name": "tol_1",
"params": {
"tol": 1
},
"command": "python ~/sklearn_iris_example.py"
},
{
"name": "tol_100",
"params": {
"tol": 100
},
"command": "python ~/sklearn_iris_example.py"
}
]
}
Note
Make sure that the python used by the command in the job has scikit-learn
installed
Run the job with command:
$ hyperctl run --config ./batch.json
After the task finished, view the output log file:
~/hyperctl-batches-working-dir/sklearn_iris_example/tol_1/stdout
----------------------------------------------------------------
tol: 1, accuracy_score: 0.9333333333333333
~/hyperctl-batches-working-dir/sklearn_iris_example/tol_100/stdout
------------------------------------------------------------------
tol: 100, accuracy_score: 0.36666666666666664
Run batch using API¶
Using API is more flexible than the command line tool to manage batch.
from hypernets.hyperctl.appliation import BatchApplication
from hypernets.hyperctl.batch import Batch
batch = Batch(name="remote-batch-example", data_dir="~/hyperctl/remote-batch-example", job_command="python ~/sklearn_iris_example.py")
batch.add_job(name='job1', params={"tol": 1})
batch.add_job(name='job2', params={"tol": 2})
batch.add_job(name='job3', params={"tol": 3})
backend_conf = {
"type": "remote",
"machines": [
{
"connection": { 'hostname': "172.20.30.105", 'username': "hyperctl", 'password': "hyperctl"} # modify to your host configuration
}, {
"connection": { 'hostname': "172.20.30.106", 'username': "hyperctl", 'password': "hyperctl"} # modify to your host configuration
}, {
"connection": { 'hostname': "172.20.30.107", 'username': "hyperctl", 'password': "hyperctl"} # modify to your host configuration
}
]
}
app = BatchApplication(batch, server_host="172.20.30.105", # modify to your host configuration
server_port=8061,
scheduler_exit_on_finish=True,
scheduler_interval=1000,
backend_conf=backend_conf,
independent_tmp=True,
scheduler_callbacks=[ConsoleCallback()])
app.start()
Debug job in development stage¶
Job scripts need to be scheduled through BatchApplication to run, so that they run in two separate processes, This brings inconvenience to the development and debugging of the job script. At this time, we can inject test parameters into the job to run the job script directly:
import os
import pickle as pkl
from hypernets import hyperctl
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
test_params = { # define your test params
"tol": 1
}
api.inject(params=mock_params) # inject test params, and NOTE that remove it when you no longer debug or it can not read params from BatchApplication
job_params = hyperctl.get_job_params()
tol=job_params['tol']
X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=8086)
lr = LogisticRegression(tol=tol)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"tol: {tol}, accuracy_score: {accuracy_score(y_pred, y_test)}")
# persist assets
report_dir = "~/report/"
os.makedirs(report_dir)
with open(os.path.join(reportdir, f"model_tol_{tol}.pkl"), 'wb') as f:
pkl.dump(lr, f)
And now we can run or debug the job script directly, but note that we need remove api.inject(params=mock_params)
if
needs to receive params from BatchApplication
.
Generate jobs from template¶
Hyperctl generates jobs config in batch by arranging and combining parameters based on the configuration template, the generated file can be used to run the batch.
Here is an example of how to use template file to generate batch config file .
First create a template file job-template.yml
with following content:
params:
learning_rate: [0.1,0.2]
max_depth: [3, 5]
command: python3 cli.py
Then execute command to generate batch config file:
$ hyperctl generate --template ./job-template.yml --output ./batch.json
Here is the generated batch.json
file:
{
"name": "eVqNV5Ut1",
"job": [{
"name": "eaqNV5Ut1",
"params": {
"learning_rate": 0.1,
"max_depth": 3
},
"command": "python3 cli.py"
}, {
"name": "ebqNV5Ut1",
"params": {
"learning_rate": 0.1,
"max_depth": 5
},
"command": "python3 cli.py"
}, {
"name": "ecqNV5Ut1",
"params": {
"learning_rate": 0.2,
"max_depth": 3
},
"command": "python3 cli.py"
}, {
"name": "edqNV5Ut1",
"params": {
"learning_rate": 0.2,
"max_depth": 5
},
"command": "python3 cli.py"
}]
}
Batch configuration file references¶
Examples¶
LocalBackend¶
{
"name": "local_backend_example",
"jobs": [
{
"name": "job1",
"params": {
"param1": 1
},
"command": "sleep 3"
}
],
"backend": {
"type": "local"
}
}
RemoteSSHBackend¶
{
"name": "local_backend_example",
"jobs": [
{
"name": "job1",
"params": {
"param1": 1
},
"command": "sleep 3"
}
],
"backend": {
"type": "remote",
"machines": [
{
"connection": {
"hostname": "host1",
"username": "hyperctl",
"password": "hyperctl"
}
}
]
},
"server": {
"host": "192.168.10.206"
}
}
Configuration references¶
BatchApplicationConfig¶
Field Name | Type | Description |
---|---|---|
name | str , required |
batch name, should be unique in a batch. |
jobs | list[JobConfig], required | Jobs to run. |
backend | BackendConfig, optional | platform where the jobs running on, default is LocalBackendConfig . |
server | ServerConfig , optional | server setting. |
scheduler | SchedulerConfig , optional | scheduler setting. |
batches_data_dir | str , optional |
batches working directory, where to store output files of batches, hyperctl will create a sub-directory by the batch name for every batch in this directory.
default read from environment by key HYPERCTL_BATCHES_DATA_DIR , if do not set in environments using ~/hyperctl-batches-data-dir . |
version | str , optional |
if is None, use the currently running version, default is None. |
JobConfig¶
Field Name | Type | Description |
---|---|---|
name | str , optional |
str, unique in batch, optional, if is null will generate a uuid as job name, recommended that you specify one, with the name of the batch name, the executed job can be skipped when the batch is re-executed |
params | dict , required |
job params, it can be obtained through API hypernets.hyperctl.get_job_params |
command | str , required |
command to the the job, if execute a file, recommend use absolute path or path relative to {execution.working_dir} |
working_dir | str , optional |
working dir to run the command , default is {batches_data_dir}/{batch_name}/{job_name} |
Note
A job write output file to {batches_data_dir}/{batch_name}/{job_name}
, it usually contains files:
- stdout: standard output
- stderr: standard error
- run.sh: shell script to run the job
LocalBackendConfig¶
Running batch in standalone mode, please refer to the example LocalBackend.
Field Name | Type | Description |
---|---|---|
type | "local" |
|
environments | dict , optional |
Environments setting will export for the job process. |
RemoteBackendConfig¶
Hyperctl supports parallel jobs in remote machines, this mode uses multiple machines to speed up the progress of the batch. It distributes jobs to remote nodes through the SSH protocol, which requires that the nodes running tasks remotely need to run SSH services and provide connection accounts. Please refer to the example RemoteSSHBackend 。
Field Name | Type | Description |
---|---|---|
machines | list[RemoteMachineConfig ], required | Connection and configuration information of remote machines. |
RemoteMachineConfig¶
Field Name | Type | Description |
---|---|---|
connection | SHHConnectionConfig, required | Connection information for the remote machine. |
environments | dict , optional |
Environments setting will export for the job process. |
SHHConnectionConfig¶
Field Name | Type | Description |
---|---|---|
hostname | hostname , required |
IP or hostname of remote machine. |
username | username , required |
username of remote machine. |
password | password , required |
password of remote machine. |
ServerConfig¶
Field Name | Type | Description |
---|---|---|
host | str , optional |
where to bind for the http server, it’s should be IP address that can be accessed in remote machines if is remote backend, otherwise, the job will fail because the api server cannot be accessed, default is localhost. |
port | int , optional |
http server port, default is 8060 |
SchedulerConfig¶
Field Name | Type | Description |
---|---|---|
interval | int , optional |
Scheduling interval, the unit is milliseconds, default value is 5000 |
exit_on_finish | boolean , optional |
whether to exit the process when all jobs are finished, default is false |
Job template configuration file references¶
Examples¶
Basic example¶
Refer to Generate jobs from template .
Configuration references¶
JobTemplateConfig¶
Field Name | Type | Description |
---|---|---|
name | str , required |
refer to BatchApplicationConfig.name |
params | dict[str, list] , required |
job params list, used to arrange and combine to generate jobs config. |
command | str , required |
refer to JobConfig.command |
working_dir | dict[str, list] , required |
|
backend | BackendConfig, optional | refer to BatchApplicationConfig.backend |
batches_data_dir | str , optional |
refer to BatchApplicationConfig.batches_data_dir |
server | ServerConfig , optional | refer to BatchApplicationConfig.server |
scheduler | SchedulerConfig , optional | refer to BatchApplicationConfig.scheduler |
version | str , optional |
refer to BatchApplicationConfig.version |