亚马逊ECS，噩梦般的使用经历-T媒体

作者

54文章
178文档
86.9 万阅读
2打赏

亚马逊ECS，噩梦般的使用经历

7735 0

2016-11-22 13:50

文章摘要：原作者;Bilal Aslam 编译：李哲核心提示：今天，Appuri的联合创始人兼首席产品官Bilal Aslam与大家分享他与他的团队使用亚马逊ECS的惨痛经历。在Appuri，ETL管道、API和UI都是由大量的小型单目标服务构成的。一开始，我们使用的是大型的单个资源库，后来逐渐向微服务模式转型。这并不是因为某种观念上的偏

原作者;Bilal Aslam

编译：李哲

核心提示：今天，Appuri的联合创始人兼首席产品官Bilal Aslam与大家分享他与他的团队使用亚马逊ECS的惨痛经历。

e7d4_projectfedena-404-error-page

在Appuri，ETL管道、API和UI都是由大量的小型单目标服务构成的。一开始，我们使用的是大型的单个资源库，后来逐渐向微服务模式转型。这并不是因为某种观念上的偏见，而是因为它符合我们的工作方式。尽管有利有弊，大体上来说，微服务的应用效果很不错。但是，我们今天不是来讨论微服务的，而是要向你讲述我们应用亚马逊EC2弹性容器服务（ECS）的惨痛经历，以及我们如何通过转向 Kubernetes悬崖勒马。

特此声明：总体上讲，我们很喜欢AWS的产品。而且，每家公司对ECS的使用程度不同。比如说，Segment就有对ECS非常愉快的使用经历，完全没有我们这些抱怨。

我对管理服务很是钟情。例如，我们不自己运行Postgres服务器，而是使用亚马逊RDS。我们也不自己运行hypervisor或者bare-metal服务器，而是使用亚马逊EC2。在理想的情况下，你向提供商购买管理服务，以便专注于创造更多差异化的附加价值，这是一个双赢的局面。事实上，我们与很多管理服务提供商都有这样的经历。

2015年6月，我们开始考虑购买PaaS来部署公司的服务。我的意愿是选择Docker化的管理服务，与此同时，保持一定的控制权。作为AWS的客户，我们考虑使用亚马逊Elastic Beanstalk和全新的亚马逊EC2 ECS。

亚马逊ECS的优势在于：

可以方便快捷地启动Docker容器

ECS能提供多重可用区（Multiple Availability Zones）

支持回滚部署（rolling deploys），真正实现了零停机（Zero-Downtime）部署

API客户端。所有AWS服务的API客户端都支持我们使用的所有语言类型。

ECS和EC2实例集群协同工作。这样，我们就不需要学习一个新的PaaS，只需要在运行亚马逊Linux的任何一个EC2实例上安装ECS客户端，加入ECS集群。

第一印象

我们看到ECS demo的第一印象是，它缺少很多关键功能：

缺少服务发现（service discovery）功能。在ECS中，服务发现功能的替代方式为使用内置的负载均衡器（load balancers）。这是运行ECS网络可访问（network-accessible）服务的唯一方式，即使只有一个实例，也必须得运行ELB。对于微服务架构来说，这就增加了每次部署服务的成本。

不能统一配置。ECS不能够把不带参数的配置信息传递给服务（即Docker容器），那么我如何把环境参数传递给每个服务呢？只能复制粘贴。

平庸的CLI。和Kubernetes等竞争对手相比，ECS的CLI表现很平庸。你可以从命令行（aws ecs update-service --desired-count N）进行扩展，但是ECS的CLI功能不是很强大。

尽管缺少了这么多核心功能，我们还是选择了继续使用ECS。

让我们后悔的时刻

让我们后悔的瞬间发生在，我们发现，环境参数会被泄漏到CloudTrail以及使用CloudTrail事件记录和日志的其他第三方服务中。

我们在论坛上发了帖子，ECS团队的回复没有切中要害。显然，他们不认为环境参数是敏感信息。

我们原本可以建更多的基础设施来用亚马逊的密钥管理服务（KMS）加密机密信息，然后在启动服务的时候进行解密。实际上，这正是Convox做的事情。但是，我们这个领域还有这么多有趣的工作可做，为什么要建这些基础设施呢？

让我们崩溃的时刻

在使用ECS的近一年时间里，我们关注每一个功能的发布，积极参与开放GitHub issue等等。但是到最后，我们还是因为以下几个原因放弃了ECS：

ECS agent经常断开连接，致使我们无法启动新容器。ECS在每一个EC2实例中都安装一个agent，用来和亚马逊API以及Docker进行互动。但是这个agent经常断开连接，导致部署失败，这对我们的服务部署来说是致命的。这一问题尽管已成定论，但仍然在不断发生。在我们的集群上，这一问题每天至少出现两次。尽管我们已经做出了最大努力，但仍然找不到根本原因。据我所知，ECS团队至今还没有解决这一问题。

下图是在Slack上的搜索结果，这只是问题反馈的一小部分。这一问题出现得非常频繁，以至于我们不得不经常重启agents来避免这一问题。

当你每隔一小时就要重启一次服务来修复漏洞的时候，你肯定会崩溃的。

对GitHub issue缺少关注。GitHub issue上有很多功能和客户请求，并没有得到亚马逊ECS的关注。

糟糕的架构。ECS欠缺很多现代化部署和运营基础设施所需的基本元素。

再见，ECS；你好，Kubernetes

在对ECS的一片怨声载道过后，我们决定试用Kubernetes (k8s)。两个星期的体验之后，我们感觉很满意。这个开源项目很适合做大规模的部署和运营。不管是它的CLI，还是服务发现或配置管理，都非常好用。尽管我们遇到了一个很奇怪的问题，就是它的kube-proxy不能正确地挖掘流量，但是重启之后问题就解决了，而且没有复发。到目前为止，我们还没有后悔我们做出的这一选择。

英文原文：

Here at Appuri, we have a large number of small, single-purpose services that make up our ETL pipeline, API and UI. We started from large, monolithic repos and gradually migrated to this microservices pattern, not because of any philosophical bias but because it fit our work style. By and large, this has worked well with all the known pros and cons of microservices. But I'm not here to debate microservices. I'm here to tell you about our nightmare on Amazon EC2 Elastic Container Service (ECS) and how we saved ourselves by moving to Kubernetes.

NOTE: In general, we love AWS. Also, your mileage with ECS may vary. For example, Segment had a great experience with ECS and apparently none of our complaints.

There's also the wonderful Convox project which contains a lot of great workflows on top of ECS. When we started using ECS, Convox wasn't far enough along to meet our needs.

And so, it begins, with a love of managed services

I love managed services. For example, we don't run our own Postgres server - we use Amazon RDS. We also don't run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services.

In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS).

Amazon ECS fit the bill because of several promises:

With ECS, you simply launch Docker containers.

ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.

You can do rolling deploys. Neato, deployments with zero downtime!

API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.

ECS works with vanilla EC2 instances. This is a nice plus, as we don't have to learn a new PaaS - just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.

First impression: wow, it's missing a LOT of stuff.

My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That's all good, we do that, too. However, it was sad to see that these key features were missing:

No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable -- for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.

No central config. ECS doesn't have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.

Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (aws ecs update-service --desired-count N) but the ECS CLI is just not very powerful.

Despite these missing features, we decided to move ahead.

I have made a huge mistake

Our first "oh crap" moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail!

We opened a forum post and response from the team wasn't on target. Apparently they don't believe in treating environment variables as sensitive quantities.

Now, we could have built yet more infrastructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start - in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?

What killed ECS for us

We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when two issues remained unaddressed:

ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that's part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail - this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven't been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.

This is a Slack search results view of just some of the times we've seen this problem happen. This problem became so pervasive that we started restarting agents periodically to get around the failure:

You know you're going crazy when you restart a service every hour to fix its bugs.

Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.

Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.

Adios ECS, hello Kubernetes

After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue with kube-proxy not routing traffic correctly, but a restart fixed the issue and it hasn't cropped up since. We haven't looked back!

凡本网内容请注明来源：T媒体（http://www.cniteyes.com）”的所有原创作品，版权均属于易信视界（北京）信息科技有限公司所有，未经本网书面授权，不得转载、摘编或以其它方式使用上述作品。

本网书面授权使用作品的，应在授权范围内使用，并按双方协议注明作品来源。违反上述声明者，易信视界（北京）信息科技有限公司将追究其相关法律责任。

标签: