EdgeTB is a hybrid testbed for distributed machine learning at the edge. It allows using Docker containers and physical nodes to create hybrid test environments. On the one hand, the existence of physical nodes improves the computing fidelity and network fidelity of EdgeTB, making it close to the physical testbed. On the other hand, compared with the physical testbed, the adoption of emulators makes it easier for EdgeTB to generate large-scale and network-flexible test environments.
- At least 2 computing devices, one acts as Controller, and others act as Workers (as physical nodes, class: PhysicalNode or as emulator, class:Emulator).
- Software requirement for computing devices.
Computing devices | Requirement |
---|---|
Controller | python3, python3-pip, NFS-Server |
Worker (PhysicalNode) | python3, python3-pip, NFS-Client, iproute (iproute2) |
Worker (Emulator) | python3, python3-pip, NFS-Client, iproute (iproute2), Docker |
- Copy
controller
into Controller and install the python packages defined incontroller/ctl_req.txt
. - Copy
worker
into Worker and install the python packages defined inworker/agent_req.txt
.
controller
├─ dml_app >> Where we prepare roles, static files, shared by NFS
├─ dml_req.txt >> Role's execution environment
├─ Dockerfile >> Role's execution environment
├─ gl_peer.py >> Role's functions, an example
├─ nns >> Neural networks
├─ dml_utils.py
└─ worker_utils.py
├─ dml_file >> Dynamically generated files for each node, transmitted over the network
├─ conf >> Generated by dml_tool/*_conf.py before running test, send to each node
└─ log >> Received from each node
├─ dataset >> Splitted dataset, static files, shared by NFS
├─ dml_tool
├─ gl_dataset.json >> Dataset definition of all Gossip peer nodes, an example
├─ gl_structure.json >> Structure definition of all Gossip peer nodes, an example
├─ dataset_conf.py >> Used to generate dataset conf file for each node
├─ gl_structure_conf.py >> Used to generate structure conf file for each Gossip peer node
├─ conf_utils.py
├─ splitter_utils.py
└─ splitter_fashion_mnist.py >> Used to download and/or split dataset, an example
├─ gl_manager.py >> Runtime manager, an example
├─ gl_run.py >> Test environment definition, an example
└─ links.json >> Network links definition
worker
├─ agent.py >> Used to communicate with controller/*_run.py
├─ dml_app >> mount point of controller/dml_app, over NFS
├─ dml_file
├─ conf >> Received from controller
└─ log >> Generated by each node while running test, send to controller
└─ dataset >> mount point of controller/dataset, over NFS
Prepare roles, neural networks, dataset >> Define test environment >> Run it >> Collect result.
- The only things you need to do in Worker is to run
worker/agent.py
with python3 with root privileges. We need to mount NFS and install python packages viapython3-pip
, which require root privileges. - All the following operations should be completed in the Controller.
- Prepare roles, just like what
controller/dml_app/gl_peer.py
does. - Prepare neural network model, just like what
controller/dml_app/nns/nn_fashion_mnist.py
does. - Prepare datasets and split it, just like what
controller/dml_tool/nn_fashion_mnist.py
does. - Update
controller/dml_app/Dockerfile
andcontroller/dml_app/dml_req.txt
to meet your DML. - Prepare test environment
controller/run.py
, just like whatcontroller/gl_run.py
does. - Prepare network
controller/links.json
. - Prepare Runtime Manager, just like what
controller/gl_manager.py
does. - Run
controller/run.py
with python3 with root privileges and keep it running on a terminal (called Term). - It takes a while to deploy the tc settings, so please set your DML to start running after receiving a certain
message, such as receiving a
GET
request for/start
. - Wait until Term displays
tc finish
, and then start your DML. - Clear the test environment.
- Same with above 1-4.
- Just use the
controller/dml_app/gl_peer.py
,controller/dml_app/Dockerfile
, andcontroller/dml_app/dml_req.txt
- Modify
controller/gl_run.py
to define the test environment. - Modify
controller/linlks.json
to define the network. - Modify
controller/dml_tool/gl_dataset.json
to define the data used by each node and modifycontroller/dml_tool/gl_structure.json
to define the DML structure of each node, seecontroller/dml_tool/README.md
for more. - Run
controller/gl_run.py
with python3 with root privileges and keep it running on a terminal (called Term). - In path
controller/dml_tool
, typepython3 dataset_conf.py -d gl_dataset.json
in terminal to generate dataset conf files and typepython3 gl_structure_conf.py -s gl_structure.json
generate DML structure conf files. - Type
curl localhost:3333/conf/dataset
in a terminal to send those dataset conf files to each node. Wait until all nodes have received the dataset conf file. This function is defined incontroller/base/manager.py
. - Type
curl localhost:3333/conf/structure
to send those DML structure conf files to each node. Wait until all nodes have received the structure conf file. This function is defined incontroller/base/manager.py
. - Wait until Term displays
tc finish
. - Type
curl localhost:3333/start
in a terminal to start all nodes. This function is defined incontroller/base/manager.py
andcontroller/gl_manager.py
. - When there is no node Gossip, type
curl localhost:3333/finish
in a terminal to stop all nodes and collect result files. This function is defined incontroller/base/manager.py
andcontroller/gl_manager.py
. - Commands such as
curl localhost:3333/emulated/reset
andcurl localhost:3333/physical/reset
are used to remove all the emulated nodes and physical nodes. These functions are defined incontroller/base/manager
, andworker/agent.py
.
- Same with above 1-4.
- Just use the
controller/dml_app/fl_trainer.py
,controller/dml_app/fl_aggregator.py
,controller/dml_app/Dockerfile
andcontroller/dml_app/dml_req.txt
- Modify
controller/fl_run.py
to define the test environment. - Modify
controller/linlks.json
to define the network. - Modify
controller/dml_tool/fl_dataset.json
to define the data used by each node and modifycontroller/dml_tool/fl_structure.json
to define the DML structure of each node, seecontroller/dml_tool/README.md
for more. - Run
controller/fl_run.py
with python3 with root privileges and keep it running on a terminal (called Term). - In path
controller/dml_tool
, typepython3 dataset_conf.py -d fl_dataset.json
in terminal to generate dataset conf files and typepython3 fl_structure_conf.py -s fl_structure.json
generate DML structure conf files. - Type
curl localhost:3333/conf/dataset
in a terminal to send those dataset conf files to each node. Wait until all nodes have received the dataset conf file. This function is defined incontroller/base/manager.py
. - Type
curl localhost:3333/conf/structure
to send those DML structure conf files to each node. Wait until all nodes have received the structure conf file. This function is defined incontroller/base/manager.py
. - Wait until Term displays
tc finish
. - Type
curl localhost:3333/start?root=n1
in a terminal to start all nodes. This function is defined incontroller/base/manager.py
andcontroller/fl_manager.py
. The root should be the first node defined incontroller/dml_tool/fl_structure.json
. - When the pre-set training round is met, it will automatically stop all nodes and collect result files. This function
is defined in
controller/base/manager.py
andcontroller/fl_manager.py
. - Commands such as
curl localhost:3333/emulated/reset
andcurl localhost:3333/physical/reset
are used to remove all the emulated nodes and physical nodes. These functions are defined incontroller/base/manager
, andworker/agent.py
.
- Same with above 1-4.
- Just use the
controller/dml_app/el_peer.py
,controller/dml_app/Dockerfile
, andcontroller/dml_app/dml_req.txt
- Modify
controller/el_run.py
to define the test environment. - Modify
controller/linlks.json
to define the network. - Modify
controller/dml_tool/el_dataset.json
to define the data used by each node and modifycontroller/dml_tool/el_structure.json
to define the DML structure of each node, seecontroller/dml_tool/README.md
for more. - Run
controller/el_run.py
with python3 with root privileges and keep it running on a terminal (called Term). - In path
controller/dml_tool
, typepython3 dataset_conf.py -d el_dataset.json
in terminal to generate dataset conf files and typepython3 el_structure_conf.py -s el_structure.json
generate DML structure conf files. - Type
curl localhost:3333/conf/dataset
in a terminal to send those dataset conf files to each node. Wait until all nodes have received the dataset conf file. This function is defined incontroller/base/manager.py
. - Type
curl localhost:3333/conf/structure
to send those DML structure conf files to each node. Wait until all nodes have received the structure conf file. This function is defined incontroller/base/manager.py
. - Wait until Term displays
tc finish
. - Type
curl localhost:3333/start?root=n1
in a terminal to start all nodes. This function is defined incontroller/base/manager.py
andcontroller/el_manager.py
. The root should be the first node defined incontroller/dml_tool/el_structure.json
. - When the pre-set training round is met, it will automatically stop all nodes and collect result files. This function
is defined in
controller/base/manager.py
andcontroller/el_manager.py
. - Commands such as
curl localhost:3333/emulated/reset
andcurl localhost:3333/physical/reset
are used to remove all the emulated nodes and physical nodes. These functions are defined incontroller/base/manager
, andworker/agent.py
.
- Same with above 1-4.
- Just use the
controller/dml_app/ra_peer.py
,controller/dml_app/Dockerfile
, andcontroller/dml_app/dml_req.txt
- Modify
controller/ra_run.py
to define the test environment. - Modify
controller/linlks.json
to define the network. - Modify
controller/dml_tool/ra_dataset.json
to define the data used by each node and modifycontroller/dml_tool/ra_structure.json
to define the DML structure of each node, seecontroller/dml_tool/README.md
for more. - Run
controller/ra_run.py
with python3 with root privileges and keep it running on a terminal (called Term). - In path
controller/dml_tool
, typepython3 dataset_conf.py -d ra_dataset.json
in terminal to generate dataset conf files and typepython3 ra_structure_conf.py -s ra_structure.json
to generate DML structure conf files. - Type
curl localhost:3333/conf/dataset
in a terminal to send those dataset conf files to each node. Wait until all nodes have received the dataset conf file. This function is defined incontroller/base/manager.py
. - Type
curl localhost:3333/conf/structure
to send those DML structure conf files to each node. Wait until all nodes have received the structure conf file. This function is defined incontroller/base/manager.py
. - Wait until Term displays
tc finish
. - Type
curl localhost:3333/start
in a terminal to start all nodes. This function is defined incontroller/base/manager.py
andcontroller/ra_manager.py
. - When the pre-set training round is met, it will automatically stop all nodes and collect result files. This function
is defined in
controller/base/manager.py
andcontroller/ra_manager.py
. - Commands such as
curl localhost:3333/emulated/reset
andcurl localhost:3333/physical/reset
are used to remove all the emulated nodes and physical nodes. These functions are defined incontroller/base/manager
, andworker/agent.py
.
Please cite our paper if you find EdgeTB is useful in your research.
Lei Yang, Fulin Wen, Jiannong Cao, Zhenyu Wang. "EdgeTB: a Hybrid Testbed for Distributed Machine Learning at the Edge
with High Fidelity." IEEE Transactions on Parallel and Distributed Systems. DOI: 10.1109/TPDS.2022.3144994.
EdgeTB is designed and developed by the joint research team at School of Software Engineering, South China University of Technology, and the Department of Computing, The Hong Kong Polytechnic University. If you have any question, please contact with us: Fulin Wen [email protected] and Lei Yang [email protected].