徐继哲 2010年01月06日 星期三 21:07 | 0条评论
Hi, my name is Arthur de Haan and I am responsible for Test and System Engineering in Windows Live. To kick things off, I’d like to give you a look behind the scenes at Hotmail, and tell you more about what it takes to build, deploy and run the Windows Live Hotmail service on such a massive global scale.
Hosting your mail and data (and our own data!) on our servers is a big responsibility and we take quality, performance, and reliability very seriously. We make significant investments in engineering and infrastructure to help keep Hotmail up and running 24 hours a day, day in and day out, year after year. You will rarely hear about these efforts – you will only read about them on the rare occasion that something goes wrong and our service has run into an issue,.
Hotmail is a gigantic service in all dimensions. Here are some of the highlights:
You can imagine that the Hotmail user interface you see in the browser is only the tip of the iceberg – a lot of innovations happen beneath the surface. In this post I will give a high level overview of how the system is architected. We will do deeper dives into some specific features in later posts.
Architecture
Hotmail and our other Windows Live services are hosted in multiple datacenters around the world. Our Hotmail service is organized in logical “scale units,” or clusters. Furthermore, Hotmail has infrastructure that is shared between the clusters in each datacenter:
A cluster hosts millions of users (how many depends on the age of the hardware) and is a self-contained set of servers including:
Preventing outages and data loss is our top priority and we take utmost care to keep them from happening. We’ve designed our service to handle failure –our assumption is that anything that can fail will do so eventually. We do have hardware failures—with hundreds of thousands of hard drives in use, some are bound to fail. Fortunately, because of the architecture and failure management processes we have in place, customers rarely experience any impact from these failures.
Here are a few of the ways we keep failures contained:
Engineering process
I’ve talked a little bit about our architecture and steps we are taking to ensure uninterrupted service. No service is static however; in addition to growth due to usage, we do push out updates on a regular basis. So our engineering processes are just as important as our architecture to provide you with a great service. From patches to minor updates to major releases, we take a lot of precautions during our development and rollout process.
Testing and deployment – For every developer on our staff we have a test engineer who works hand in hand with him or her to give input on the design and specs, set up a test infrastructure, write and automate test cases for new features, and measure quality. When we talk about quality, we mean it in the broadest definition of the word: not just stability and reliability, but also ease of use, performance, security, accessibility (for customers with disabilities), privacy, scalability, and functionality in all browsers and clients that we support, worldwide. Given our scale, this is not an easy feat.
And because we’re a free service funded largely by advertising, we need to be highly efficient on an operational basis. So deployment, configuration, and maintenance of our systems are highly automated. Automation also reduces the risk of human error.
Code deployment and change management – We have thousands of servers in our test lab where we deploy and test code well before it goes live to our customers. In the datacenter we have some clusters reserved for testing “dogfood” and beta versions in the final stages of a project. We test every change in our labs, be it a code update, hardware change or security patch, before deploying it to customers.
After all the engineering teams have signed off on a release (including Test and System Engineering) we start gradually upgrading the clusters in the datacenter to push the changes out to customers worldwide. Typically we do this over a period of a few months – not only because it takes time to perform the upgrades without affecting customers with downtime, but it also allows us to watch and make sure there is no loss of quality and performance.
We can also turn individual features on or off. Sometimes we deploy updates but postpone or delay turning them on. In rare cases we have temporarily turned features off, say for security or performance reasons.
Conclusion
This should begin to give you a sense of the size and scope of the engineering that goes into delivering and maintaining the Hotmail service. We are committed to engineering excellence and continuous improvements of our services for you. We continue to learn as the service grows, and we take all your feedback seriously, so do leave me a comment with your thoughts and questions. I am passionate about our services and so are all the members of the Windows Live team – we may be engineers but we use the services ourselves, along with hundreds of millions of our customers.
Arthur de Haan
Director, Windows Live Test and System Engineering
Zeuux © 2024
京ICP备05028076号
暂时没有评论