-
Notifications
You must be signed in to change notification settings - Fork 24
illities 5.0 Availability
Status, proposals and (5.0) priorities related to availability.
- Status: Step #1, initial draft
-
midonet availability
- network state available database ( zookeeper / cassandra )
- network traffic gateways, multiple for redundancy (up to 6) - for both availability and load balancing
-
API (tomcat) is single point of failure - keystone required for authenticated access (v1.9), may have changed in 5.0
-
eucanetd is single point of failure (CLC colocated) - no channel for error conditions
-
instance metadata is single point of failure (nginx, related networking)
-
no general networking path from instances to cloud. DNS would have similar issues to instance metadata
- cluster controller availability / network broadcast
- changes to broadcast could impact instance gating
- cluster is a stateless component so persistence not an issue
- more issues with memory use / networking rather than availability
- managed modes cluster is still required for routing traffic
- edge / vpc modes cluster is not in data path
- enabled/disabled cluster components may need work / testing
- synchronous messaging model can cause reliability issues
Something along the lines of cinder from openstack as a common front-end could allow reuse of some management/monitoring between backends.
More common code may help make configuration / property lifecycle issues easier to resolve.
- ufs / available backend
- improving handling of errors from backend, such as retries on connections, etc
- persistence availability for S3 metadata
- database configuration cache / properties, reduce database use
-
deploy in a VM behind ELB
- could be more of a consideration for managed deployments etc
- customer feedback is that service is less easy to manage in a service VM. Harder to configure and debug.
-
multiple consoles would need centralized caching
-
long running tasks such as deleting a scaling group or creating an image from an instance or multi-delete for S3
-
single point of failure
-
stateful backend services
- resource reservations
- background tasks
- task polling
- workflow
-
database
- management (postgres)
- persistence for services
-
service state coordination
-
network information broadcast
Eucanetd could run on reduncant components such as ufs. Tomcat could be paired with eucanetd on each ufs host. We would need to coordinate work performed by eucanetd either by explicit control of the active host or by partitioning work between eucanetd hosts. This implies that ufs hosts would need to coordinate over the network view (no in-memory state)
Instance metadata is single point of failure (nginx, bridges), we could replace nginx with Java listeners (on ufs?)
We could restore support for redundant cluster controllers to avoid a single point of failure.
Switching to polling for network view would remove some failures on cluster outage.
-
elasticache - for console shared cache
-
move workflows to SWF
- console
- elb (in progress)
- auto scaling
- ec2
-
SQS / queuing
- service / console (periodic) task distribution
- distributed events?
- notification of events between components (or to console)
- other service usage
- ec2 import tasks
-
alternative persistence such as Cassandra
- might not be ideal for services such as SQS (at scale)
- relation to midonet use of cassandra?
-
Postgresql availability improvements
- does not currently meet availability needs with base product or enhancements
- may become simpler as less data is stored in postgres
- postgres in a vm is another option to reduce size of main db
-
vm services may be a way to make additional internal services more robust
-
scalable support for cloudformation workflows (for example)
-
jgroups - hosts / discovery
- multicast use (local rather than global from 4.3)
- could replace with zookeeper, atomix, etcd (static config)
- register hosts with one or two other "nodes"
-
cloudwatch logs / logs into S3
- reduces operations concerns when functionality is more distributed or in vms
-
upgrades
- complexity increases as we add persistence mechanisms
- swf upgrades (versioning of workers / activities)
-
configuration
- distribution, particularly non Java components
- transactional updates for consistency of multiple properties