Runs Management - fedml run

FedML Run CLI Overview

Manage runs on the TensorOpera AI Platform

Usage: fedml run [OPTIONS] COMMAND [ARGS]...

  Manage runs on the TensorOpera AI Platform.

Options:
  -h, --help            Show this message and exit.
  -k, --api_key TEXT    The user API key.
  -v, --version TEXT    Version of TensorOpera AI Platform. It should be dev,
                        test or release.
  -pf, --platform TEXT  The platform name at the TensorOpera AI Platform
                        (options: octopus, parrot, spider, beehive, falcon,
                        launch, default is falcon).

Commands:
  list    List runs from the TensorOpera AI Platform.
  logs    Get logs of run from the TensorOpera AI Platform.
  status  Get status of run from the TensorOpera AI Platform.
  stop    Stop a run from the TensorOpera AI Platform.

`fedml run list [OPTIONS]`

List runs from the TensorOpera AI Platform.

Options

Option	Description
`--help` or `-h`	Show this message and exit.
`--run_name` or `-r`	Run name at the TensorOpera AI Platform.
`--run_id` or `-rid`	Run id at the TensorOpera AI Platform.
`--api_key` or `-k`	The user API key.
`--version` or `-v`	Version of TensorOpera AI Platform. It should be dev, test or release.
`--platform` or `-pf`	The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).

Example

List all runs on the TensorOpera AI Platform.

feml run list

Found the following matched runs.
+----------------------+---------------------+----------+---------------------+------------------+------+
|       Run Name       |        Run ID       |  Status  |       Created       | Spend Time(hour) | Cost |
+----------------------+---------------------+----------+---------------------+------------------+------+
|     tight_ready      | 1684458113152978944 | FINISHED | 2023-07-27 06:58:04 |      0.0333      | 2.0  |
|     shorter_tax      | 1684458685260238848 | FINISHED | 2023-07-27 07:00:20 |      0.0333      | 2.0  |
|     swam_fellow      | 1684500824392339456 | FINISHED | 2023-07-27 09:47:47 |      0.0333      | 2.0  |
|    national_your     | 1684753343311908864 | FINISHED | 2023-07-28 02:31:13 |      0.0333      | 2.0  |
+----------------------+---------------------+----------+---------------------+------------------+------+

List selected runs on the TensorOpera AI Platform.

fedml run list -r tight_ready

Found the following matched runs.
+-------------+---------------------+----------+---------------------+------------------+------+
|   Run Name  |        Run ID       |  Status  |       Created       | Spend Time(hour) | Cost |
+-------------+---------------------+----------+---------------------+------------------+------+
| tight_ready | 1684458113152978944 | FINISHED | 2023-07-27 06:58:04 |      0.0333      | 2.0  |
+-------------+---------------------+----------+---------------------+------------------+------+

`fedml run logs [OPTIONS]`

Get logs of run from the TensorOpera AI Platform.

Options

Option	Description
`--help` or `-h`	Show this message and exit.
`--run_id` or `-rid`	Run id at the TensorOpera AI Platform.
`--need_all_logs` or `-a`	Boolean value representing if all logs are needed. Default to True.
`--page_num` or `-pn`	request page num for logs. --need_all_logs should be set to False if you want to use this option.
`--page_size` or `-ps`	request page size for logs, --need_all_logs should be set to False if you want to use this option.
`--api_key` or `-k`	The user API key.
`--version` or `-v`	Version of TensorOpera AI Platform. It should be dev, test or release.
`--platform` or `-pf`	The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).

Example

fedml run logs -rid 1716563514434392064

Logs summary info is as follows.
+---------------------+-----------------+---------------------------------------------------------------------------------------+
|        Run ID       | Total Log Lines |                                        Log URL                                        |
+---------------------+-----------------+---------------------------------------------------------------------------------------+
| 1716563514434392064 |        11       | https://s3.us-west-1.amazonaws.com/fedml/fedml-logs/fedml-run-1716563514434392064.log |
+---------------------+-----------------+---------------------------------------------------------------------------------------+

Logs URL for each device is as follows.
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
|      Device ID      |     Device Name         |                                                       Device Log URL                                                      |
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+
| 1684824138201567232 | NVIDIA A100-SXM4-80GB:8 | https://s3.us-west-1.amazonaws.com/fedml/fedml-logs/fedml-run-1714535384211394560-edge-1684824138201567232%40user-214.log |
+---------------------+-------------------------+---------------------------------------------------------------------------------------------------------------------------+

All logs is as follows.
[FedML-Client @device-id-1684824138201567232] [Mon, 23 Oct 2023 14:13:30 -0700] [INFO]-----GPU Machine scheduling successful-----
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:287:report_server_id_status] report_server_id_status. message_json = {"run_id": 1716563514434392064, "edge_id": 201649, "status": "STARTING"}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'STARTING', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'STARTING', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:13:54 -0700] [INFO] [server_runner.py:502:run_impl] Detect all status of Edge ids: [1684824138201567232]
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [ERROR] [server_runner.py:934:detect_edges_status] There are inactive edge devices. Inactivate edge id list is as follows. [1684824138201567232]
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:287:report_server_id_status] report_server_id_status. message_json = {"run_id": 1716563514434392064, "edge_id": 201649, "status": "FAILED", "server_id": 201649}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:229:report_server_device_status_to_web_ui] report_server_device_status. msg = {'run_id': 1716563514434392064, 'edge_id': 201649, 'status': 'FAILED', 'role': 'normal', 'version': 'v1.0'}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [ERROR] [server_runner.py:1441:send_exit_train_with_exception_request_to_edges] exit_train_with_exception: send topic flserver_agent/1684824138201567232/exit_train_with_exception
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [mlops_metrics.py:158:common_broadcast_client_training_status] report_client_training_status. message_json = {"edge_id": 1684824138201567232, "run_id": 1716563514434392064, "status": "FAILED"}
[FedML-Server @device-id-201649] [Mon, 23 Oct 2023 14:38:59 -0700] [INFO] [server_runner.py:438:run] Release resources.

`fedml run status [OPTIONS]`

Get status of run from the TensorOpera AI Platform.

Options

Option	Description
`--help` or `-h`	Show this message and exit
`--run_name` or `-r`	Run name at the TensorOpera AI Platform.
`--run_id` or `-rid`	Run id at the TensorOpera AI Platform.
`--api_key` or `-k`	The user API key.
`--version` or `-v`	Version of TensorOpera AI Platform. It should be dev, test or release.
`--platform` or `-pf`	The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).

Example

❯ fedml run status -r particular_determine
Found the following matched runs.
+----------------------+---------------------+----------+---------------------+------------------+------+
|       Run Name       |        Run ID       |  Status  |       Created       | Spend Time(hour) | Cost |
+----------------------+---------------------+----------+---------------------+------------------+------+
| particular_determine | 1684754107195330560 | FINISHED | 2023-07-28 02:34:15 |      0.0333      | 2.0  |
+----------------------+---------------------+----------+---------------------+------------------+------+

`fedml run stop [OPTIONS]`

Stop a run from the TensorOpera AI Platform.

Options

Option	Description
`--help` or `-h`	Show this message and exit.
`--run_id` or `-rid`	Id of the run.
`--api_key` or `-k`	The user API key.
`--version` or `-v`	Version of TensorOpera AI Platform. It should be dev, test or release.
`--platform` or `-pf`	The platform name at the TensorOpera AI Platform (options: octopus, parrot, spider, beehive, falcon, launch, default is falcon).

Example

fedml run stop -rid 1716563514434392064

Run 1716563514434392064 is stopped successfully.

Runs Management - fedml run

FedML Run CLI Overview​

fedml run list [OPTIONS]​

Options​

Example​

List all runs on the TensorOpera AI Platform.​

List selected runs on the TensorOpera AI Platform.​

fedml run logs [OPTIONS]​

Options​

Example​

fedml run status [OPTIONS]​

Options​

Example​

fedml run stop [OPTIONS]​

Options​

Example​

FedML Run CLI Overview

`fedml run list [OPTIONS]`

Options

Example

List all runs on the TensorOpera AI Platform.

List selected runs on the TensorOpera AI Platform.

`fedml run logs [OPTIONS]`

Options

Example

`fedml run status [OPTIONS]`

Options

Example

`fedml run stop [OPTIONS]`

Options

Example