Runs Management - fedml run
FedML Run CLI Overview
Manage runs on the TensorOpera AI Platform
Usage: fedml run [OPTIONS] COMMAND [ARGS]...
Manage runs on the TensorOpera AI Platform.
Options:
-h, --help Show this message and exit.
-k, --api_key TEXT The user API key.
-v, --version TEXT Version of TensorOpera AI Platform. It should be dev,
test or release.
-pf, --platform TEXT The platform name at the TensorOpera AI Platform
(options: octopus, parrot, spider, beehive, falcon,
launch, default is falcon).
Commands:
list List runs from the TensorOpera AI Platform.
logs Get logs of run from the TensorOpera AI Platform.
status Get status of run from the TensorOpera AI Platform.
stop Stop a run from the TensorOpera AI Platform.
fedml run list [OPTIONS]
List runs from the TensorOpera AI Platform.
Options
Option | Description |
---|---|
--help or -h | Show this message and exit. |
--run_name or -r | Run name at the TensorOpera AI Platform. |
--run_id or -rid | Run id at the TensorOpera AI Platform. |
--api_key or -k | The user API key. |
--version or -v | Version of TensorOpera AI Platform. It should be dev, test or release. |
--platform or -pf | The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon). |
Example
List all runs on the TensorOpera AI Platform.
feml run list
Found the following matched runs.
+----------------------+---------------------+----------+---------------------+------------------+------+
| Run Name | Run ID | Status | Created | Spend Time(hour) | Cost |
+----------------------+---------------------+----------+---------------------+------------------+------+
| tight_ready | 1684458113152978944 | FINISHED | 2023-07-27 06:58:04 | 0.0333 | 2.0 |
| shorter_tax | 1684458685260238848 | FINISHED | 2023-07-27 07:00:20 | 0.0333 | 2.0 |
| swam_fellow | 1684500824392339456 | FINISHED | 2023-07-27 09:47:47 | 0.0333 | 2.0 |
| national_your | 1684753343311908864 | FINISHED | 2023-07-28 02:31:13 | 0.0333 | 2.0 |
+----------------------+---------------------+----------+---------------------+------------------+------+
List selected runs on the TensorOpera AI Platform.
fedml run list -r tight_ready
Found the following matched runs.
+-------------+---------------------+----------+---------------------+------------------+------+
| Run Name | Run ID | Status | Created | Spend Time(hour) | Cost |
+-------------+---------------------+----------+---------------------+------------------+------+
| tight_ready | 1684458113152978944 | FINISHED | 2023-07-27 06:58:04 | 0.0333 | 2.0 |
+-------------+---------------------+----------+---------------------+------------------+------+
fedml run logs [OPTIONS]
Get logs of run from the TensorOpera AI Platform.
Options
Option | Description |
---|---|
--help or -h | Show this message and exit. |
--run_id or -rid | Run id at the TensorOpera AI Platform. |
--need_all_logs or -a | Boolean value representing if all logs are needed. Default to True. |
--page_num or -pn | request page num for logs. --need_all_logs should be set to False if you want to use this option. |
--page_size or -ps | request page size for logs, --need_all_logs should be set to False if you want to use this option. |
--api_key or -k | The user API key. |
--version or -v | Version of TensorOpera AI Platform. It should be dev, test or release. |
--platform or -pf | The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon). |
Example
fedml run logs -rid 1716563514434392064
Logs summary info is as follows.
+---------------------+-----------------+---------------------------------------------------------------------------------------+
| Run ID | Total Log Lines | Log URL |
+---------------------+-----------------+---------------------------------------------------------------------------------------+
| 1716563514434392064 | 11 | https://s3.us-west-1.amazonaws.com/fedml/fedml-logs/fedml-run-1716563514434392064.log |
+---------------------+-----------------+---------------------------------------------------------------------------------------+
Logs URL for each device is as follows.
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| Device ID | Device Name | Device Log URL |
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| 1684824138201567232 | NVIDIA A100-SXM4-80GB:8 | https://s3.us-west-1.amazonaws.com/fedml/fedml-logs/fedml-run-1714535384211394560-edge-1684824138201567232%40user-214.log |
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
All logs is as follows.
[FedML-Client @device-id-1684824138201567232] [Mon, 23 Oct 2023 14:13:30 -0700] [INFO]-----GPU Machine scheduling successful-----
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:287:report_server_id_status] report_server_id_status. message_json = {"run_id": 1716563514434392064, "edge_id": 201649, "status": "STARTING"}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'STARTING', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'STARTING', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [server_runner.py:502:run_impl] Detect all status of Edge ids: [1684824138201567232]
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [ERROR] [server_runner.py:934:detect_edges_status] There are inactive edge devices. Inactivate edge id list is as follows. [1684824138201567232]
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:287:report_server_id_status] report_server_id_status. message_json = {"run_id": 1716563514434392064, "edge_id": 201649, "status": "FAILED", "server_id": 201649}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'FAILED', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [ERROR] [server_runner.py:1441:send_exit_train_with_exception_request_to_edges] exit_train_with_exception: send topic flserver_agent/1684824138201567232/exit_train_with_exception
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:158:common_broadcast_client_training_status] report_client_training_status. message_json = {"edge_id": 1684824138201567232, "run_id": 1716563514434392064, "status": "FAILED"}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [server_runner.py:438:run] Release resources.
fedml run status [OPTIONS]
Get status of run from the TensorOpera AI Platform.
Options
Option | Description |
---|---|
--help or -h | Show this message and exit |
--run_name or -r | Run name at the TensorOpera AI Platform. |
--run_id or -rid | Run id at the TensorOpera AI Platform. |
--api_key or -k | The user API key. |
--version or -v | Version of TensorOpera AI Platform. It should be dev, test or release. |
--platform or -pf | The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon). |
Example
❯ fedml run status -r particular_determine
Found the following matched runs.
+----------------------+---------------------+----------+---------------------+------------------+------+
| Run Name | Run ID | Status | Created | Spend Time(hour) | Cost |
+----------------------+---------------------+----------+---------------------+------------------+------+
| particular_determine | 1684754107195330560 | FINISHED | 2023-07-28 02:34:15 | 0.0333 | 2.0 |
+----------------------+---------------------+----------+---------------------+------------------+------+
fedml run stop [OPTIONS]
Stop a run from the TensorOpera AI Platform.
Options
Option | Description |
---|---|
--help or -h | Show this message and exit. |
--run_id or -rid | Id of the run. |
--api_key or -k | The user API key. |
--version or -v | Version of TensorOpera AI Platform. It should be dev, test or release. |
--platform or -pf | The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon). |
Example
fedml run stop -rid 1716563514434392064
Run 1716563514434392064 is stopped successfully.