亚马逊ECS,噩梦般的使用经历
原作者;Bilal Aslam 编译:李哲 核心提示:今天,Appuri的联合创始人兼首席产品官Bilal Aslam与大家分享他与他的团队使用亚马逊ECS的惨痛经历。原作者;Bilal Aslam 编译:李哲 核心提示:今天,Appuri的联合创始人兼首席产品官Bilal Aslam与大家分享他与他的团队使用亚马逊ECS的惨痛经历。 在Appuri,ETL管道、API和UI都是由大量的小型单目标服务构成的。一开始,我们使用的是大型的单个资源库,后来逐渐向微服务模式转型。这并不是因为某种观念上的偏
在Appuri,ETL管道、API和UI都是由大量的小型单目标服务构成的。一开始,我们使用的是大型的单个资源库,后来逐渐向微服务模式转型。这并不是因为某种观念上的偏见,而是因为它符合我们的工作方式。尽管有利有弊,大体上来说,微服务的应用效果很不错。但是,我们今天不是来讨论微服务的,而是要向你讲述我们应用亚马逊EC2弹性容器服务(ECS)的惨痛经历,以及我们如何通过转向 Kubernetes悬崖勒马。
特此声明:总体上讲,我们很喜欢AWS的产品。而且,每家公司对ECS的使用程度不同。比如说,Segment就有对ECS非常愉快的使用经历,完全没有我们这些抱怨。
我对管理服务很是钟情。例如,我们不自己运行Postgres服务器,而是使用亚马逊RDS。我们也不自己运行hypervisor或者bare-metal服务器,而是使用亚马逊EC2。在理想的情况下,你向提供商购买管理服务,以便专注于创造更多差异化的附加价值,这是一个双赢的局面。事实上,我们与很多管理服务提供商都有这样的经历。
2015年6月,我们开始考虑购买PaaS来部署公司的服务。我的意愿是选择Docker化的管理服务,与此同时,保持一定的控制权。作为AWS的客户,我们考虑使用亚马逊Elastic Beanstalk和全新的亚马逊EC2 ECS。
亚马逊ECS的优势在于:
- 可以方便快捷地启动Docker容器
- ECS能提供多重可用区(Multiple Availability Zones)
- 支持回滚部署(rolling deploys),真正实现了零停机(Zero-Downtime)部署
- API客户端。所有AWS服务的API客户端都支持我们使用的所有语言类型。
- ECS和EC2实例集群协同工作。这样,我们就不需要学习一个新的PaaS,只需要在运行亚马逊Linux的任何一个EC2实例上安装ECS客户端,加入ECS集群。
- 对GitHub issue缺少关注。GitHub issue上有很多功能和客户请求,并没有得到亚马逊ECS的关注。
- 糟糕的架构。ECS欠缺很多现代化部署和运营基础设施所需的基本元素。
And so, it begins, with a love of managed services
I love managed services. For example, we don't run our own Postgres server - we use Amazon RDS. We also don't run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services. In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS). Amazon ECS fit the bill because of several promises:- With ECS, you simply launch Docker containers.
- ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.
- You can do rolling deploys. Neato, deployments with zero downtime!
- API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.
- ECS works with vanilla EC2 instances. This is a nice plus, as we don't have to learn a new PaaS - just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.
First impression: wow, it's missing a LOT of stuff.
My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That's all good, we do that, too. However, it was sad to see that these key features were missing:- No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable -- for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.
- No central config. ECS doesn't have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.
- Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (
aws ecs update-service --desired-count N) but the ECS CLI is just not very powerful.
I have made a huge mistake
Our first "oh crap" moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail! We opened a forum post and response from the team wasn't on target. Apparently they don't believe in treating environment variables as sensitive quantities. Now, we could have built yet more infrastructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start - in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?What killed ECS for us
We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when two issues remained unaddressed:- ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that's part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail - this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven't been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.
- Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.
- Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.
Adios ECS, hello Kubernetes
After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue withkube-proxy not routing traffic correctly, but a restart fixed the issue and it hasn't cropped up since. We haven't looked back!评论
- 暂时没有评论,来说点什么吧





