1193 lines
81 KiB
Plaintext
1193 lines
81 KiB
Plaintext
|
|
Episode: 366
|
||
|
|
Title: HPR0366: The Open Source Data Center
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr0366/hpr0366.mp3
|
||
|
|
Transcribed: 2025-10-07 19:07:46
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
music
|
||
|
|
The Utah Open Source Foundation is proud to help open source grow in Utah and the
|
||
|
|
Intermountain West. Watch for upcoming announcements about our expanding regional efforts.
|
||
|
|
The following presentation, the Open Source Data Center, was given on Wednesday, May 13, 2009,
|
||
|
|
by Dan Hanks at the Provo Linux user group. Visit their site plug.org. The bandwidth for this and
|
||
|
|
many other presentations is provided by Center7. Visit their site at center7.com.
|
||
|
|
A little bit about me, Dan Hanks. I graduated from DOIU in Computer Science.
|
||
|
|
This has been since 98, mostly Linux and Solaris. I tried to word Windows as best I could.
|
||
|
|
I started out on-line back in 1997 with a local ISP here with around 20 servers. That's where I cut my teeth on-licks.
|
||
|
|
Back in the summer of 98, when fun things and cool things are happening with Linux.
|
||
|
|
This is interesting, those years in school. I think my grades tended to suffer a little bit because of that.
|
||
|
|
So anyway, after even that, I took a little internship out in Nevada with an engineering company,
|
||
|
|
going to some Windows programming for three months. That was nuts convinced I never wanted to work with Windows.
|
||
|
|
So I didn't. So I came back and got a job with the then North Sky, which is a little internet startup web
|
||
|
|
posting. That subsequently got bought by about.com, who subsequently was bought by Premier yet,
|
||
|
|
who subsequently sold the web posting division of about.com over to United Online in 2004.
|
||
|
|
And so in that slot, we had maybe around a couple hundred servers. So we kind of stepped up
|
||
|
|
in order of magnitude into that operation. Back in 2008, May, it's almost been a year now,
|
||
|
|
or has been a year now, United Online closed the web posting division,
|
||
|
|
or whether it ain't closed the whole division, they closed the office of the ranit and took
|
||
|
|
operations themselves from California. And so I moved over here to Amateur.
|
||
|
|
That's moved a couple of buildings over in the campus to where we have around 15,000 servers that
|
||
|
|
I work with. I don't manage them all myself. We have an admin team that helps to manage all those.
|
||
|
|
But again, another quarter in magnitude up. And I suppose maybe next, I have to move to Google to get
|
||
|
|
the next word of magnitude, what I want. So anyways, actually then on the earnings call for Amateur,
|
||
|
|
they just had, the number was 19,000 machines. If you look, I suppose all the divisions of the
|
||
|
|
company were probably around that. So big operations, lots of interesting challenges. At this
|
||
|
|
scale, you get very, very interesting problems to solve. I had one very patient wife and four very
|
||
|
|
mentioned some kids. As I was telling the guys earlier, family life and preparing presentations
|
||
|
|
replied, this don't seem to mix. And so it was up to 230 last night trying to polish this off.
|
||
|
|
And anyways, before you get started into this, though, I want to ask, what do you, obviously,
|
||
|
|
you're here for some reason. I want to get kind of a fill for what you're hoping to go home with
|
||
|
|
tonight. And I'll try and tailor what I say and what I share to maybe what you want. So by
|
||
|
|
a raise of hand, if you have any questions, I'm open. Yeah.
|
||
|
|
If it's a big deal about automation or how you expect, I mean, you would mean you're not
|
||
|
|
doing a lane walk around and monitor and manage something through.
|
||
|
|
Right. So I'm okay. Yeah. I'm wondering what kind of tools you can use to
|
||
|
|
track and manage configuration on that in the surface, because I've got a system myself
|
||
|
|
that's up to 300. And then start to be a major endpoint.
|
||
|
|
Right. Okay. Okay. Yeah.
|
||
|
|
I have a very small operation. I want to know what I'm going to do now.
|
||
|
|
I want to preserve the management already with programming in the big sense.
|
||
|
|
Okay. I don't want to do it here, but this is always something I can't come to.
|
||
|
|
Did you like that? Did you either want to tell me to ask these questions before you started?
|
||
|
|
We'll take care of the payments afterwards outside.
|
||
|
|
Okay, Jay. So I was obviously a white failure with your stuff.
|
||
|
|
I don't want you in self-free and there what you learn to learn
|
||
|
|
Okay. So I don't know what we had there.
|
||
|
|
I know you've owned up a lot of stuff that you'll know. Now that you're here,
|
||
|
|
what have you scrapped that you're owned and how many of something that was already open
|
||
|
|
so worse was been better? Okay. So as opposed to what I would rewrite from scratch,
|
||
|
|
what would I use? It's already out there. Okay.
|
||
|
|
Well, I still rewrite from scratch. Probably as little as possible.
|
||
|
|
I'm just going to look, you know, how there are ones where it's open or where I think the tools
|
||
|
|
will just work. Okay. I'll answer that question up front.
|
||
|
|
Most of the pieces are out there. It's the interesting thing.
|
||
|
|
Although what you do run into a lot is it with a lot of tools are out there.
|
||
|
|
They'll build to manage maybe a few hundred servers or in some cases even up to a few thousand.
|
||
|
|
But when you get to that next sort of magnitude, the tool sometimes starts to break
|
||
|
|
again and they're not built for that kind of scale. And you kind of have to hack them or call them
|
||
|
|
together, federate them, shard them, whatever in case maybe. So in a lot of the the glue parts
|
||
|
|
of gluing all this stuff together and where a lot of the effort goes into it.
|
||
|
|
You know, I'm not sure a lot of it's PHP, a lot of shells and Python.
|
||
|
|
And trying to get some parole in any other questions.
|
||
|
|
All right, then we'll move forward. Good questions.
|
||
|
|
And grants that I've only been a lot of extra four or a year.
|
||
|
|
So a year's exposure to 15,000 machines is not equivalent to 10 years exposure to 15,000 machines.
|
||
|
|
So I'm still learning. There's still a lot of people around here.
|
||
|
|
So what I have will share freely that I can share.
|
||
|
|
So again, the open source data center,
|
||
|
|
playing open source software into patterns of data center operations.
|
||
|
|
How did this all start? Okay. So back at United Online, we went through a number of
|
||
|
|
painful and tedious data center migrations. So every now and then we'd have to, you know,
|
||
|
|
pick up the shop and move it somewhere else. And every time you did that, it was kind of a mixed
|
||
|
|
bag, right? It was always a pain in the rear to do that because you have to internet things and
|
||
|
|
try to deal without going offline ever or having, you know, minimize your outages.
|
||
|
|
But at the same time, it's a good opportunity because each time you got to go to the news
|
||
|
|
place, you kind of got a chance to rebuild. And so you can kind of fix all the mistakes that you
|
||
|
|
had run into that always tend to creep into systems of this size over time. But as I would do
|
||
|
|
that, we'd go through each of these moves and each of these build outs and acquisitions,
|
||
|
|
that kind of stuff. And began to see certain patterns of the basic set of operations or the
|
||
|
|
basic set of services that were needed to run, you know, a data center operation. Like in our case,
|
||
|
|
it was a web pushing operation or could, you know, be some sort of web application or something.
|
||
|
|
Anyways, there was a bunch of patterns that I began to see. I kind of started, you know,
|
||
|
|
making a mental list of that. And that's kind of what this presentation grows at with that
|
||
|
|
mental list. And then we'll talk about the different open source offer that's out there to be able
|
||
|
|
to plug into this pattern. Let's see. And so what was interesting to me, so yeah,
|
||
|
|
too many data center migrations and build outs, what was interesting to me is when I went from
|
||
|
|
United Online over to Amateur, I saw a lot of those same patterns. So it was really gratifying that,
|
||
|
|
okay, this is all making sense, right? Even a smaller scale to a larger scale, there's a lot
|
||
|
|
of the same patterns going on. Obviously there are things you have to do differently at this kind
|
||
|
|
of scale, but there's a lot of the same kind of patterns. At United Online, the patterns are a
|
||
|
|
little bit smaller, at least in our division. United Online itself was really quite actually a large
|
||
|
|
operation. There's several thousand machines there, but similar patterns just at a larger scale.
|
||
|
|
So that's what we're focusing on tonight is those patterns of what you do in these kind of scale
|
||
|
|
operations. So there's kind of three areas we can talk about. The physical infrastructure,
|
||
|
|
hardware, switches, racks, cooling, all that kind of stuff, there's your actual application,
|
||
|
|
whatever lives on top of that. This thing in the middle here, this operations infrastructure,
|
||
|
|
this is what we're going to focus on tonight. The stuff in the middle that kind of glues all
|
||
|
|
your components together and provides the foundation so that you can run your application on top of
|
||
|
|
that. In my dreams, it would be this easy to build up your data center. You know, you fire up a
|
||
|
|
little curl script, make sure you use warnings, e-stricts, so you're doing well, but then you
|
||
|
|
have this nice little data center module and you tell everyone build a web application with
|
||
|
|
some infrastructure for that. So you didn't stand to your data center, you have to get in San Jose
|
||
|
|
and use it with an extra vendor and just, you know, assume a bunch of really good defaults.
|
||
|
|
You know, maybe it could be even this easy. We don't have to specify any defaults because it's
|
||
|
|
really, there's a lot of good defaults, but if you have seen Daniel and Conway's talk about
|
||
|
|
sufficiently advanced technology, space around the quote, the sufficiently advanced technology is
|
||
|
|
indistinguishable from magic, right? So if your code is advanced enough that you use some really
|
||
|
|
good, sensible defaults and you don't just simplify things quite a bit. But until we get to that
|
||
|
|
point, and I'll pick that point up at the end of the presentation because we're getting closer
|
||
|
|
with all this cloud stuff. Anyways, until we get to that point, we're going to use some patterns.
|
||
|
|
And these are these patterns that we'll talk about tonight. There's anyone who could sneak out,
|
||
|
|
got me a glass of water. So I don't sound like a frog. I'm starting to get dry mountation.
|
||
|
|
Okay, so you can patterns. Pedal number one, system imaging and provisioning. Essentially,
|
||
|
|
this is, once your bare metal is installed in the rack and powered up connected to your network,
|
||
|
|
how do you then lay down the operating system and the basic configuration
|
||
|
|
for one machine, for 10 machines, for 1000 machines, 20,000 machines, how do you do that,
|
||
|
|
right? What tools are out there in the open source world to help us do that?
|
||
|
|
So again, you're operating system, some applications. I'm obviously a postgres bigot.
|
||
|
|
Whether it's a few handful of machines or if you have lots of machines, how do you do that?
|
||
|
|
So in the Red Hat world, you have something like Kickstart. How many of you are familiar with Kickstart?
|
||
|
|
Okay, Kickstart is a system used by Red Hat that where you specify a bunch of basic stuff about
|
||
|
|
your hardware. These are the partitions I'm going to have. This is the set of RPM packages I want
|
||
|
|
to install. Thank you. I'm trying not to spill that with the keyboard there. Okay, so these,
|
||
|
|
we're in the Kickstart file, you specify what kind of packages you're going to install,
|
||
|
|
what your network configurations should look like, and any other kind of scripting and stuff that
|
||
|
|
you want to read. This file gets fed to the machine as it's booting up and the machine follows
|
||
|
|
the instructions, lays down the operating system and solves the packages and does any configuration
|
||
|
|
that you want to do. So that's really helpful in the Red Hat world. Same sense to ask being
|
||
|
|
a derivative, all the Red Hat derivatives, I even saw Kickstart for Ubuntu. Kickstart? Kickstart?
|
||
|
|
Kickstart. Kickstart. Okay, so...
|
||
|
|
Did you give it a pre-seed? I don't have my tailored in a little bit. Ubuntu pretty much has some
|
||
|
|
functionality. Kickstart, they call it a Kickstart. Okay. Kickstart. All right, so Kickstart for Ubuntu.
|
||
|
|
There's another one out there called Five for Debian. I haven't used that in production
|
||
|
|
environments, so I can't say too much about it, but it's out there, so you're aware of it.
|
||
|
|
Auto-East, Pursuit, Jumpstart, Pulsularis. There's a system out here called Rocks that's tailored
|
||
|
|
for high-performance computing clusters that pretty much allows you to lay out an entire cluster of
|
||
|
|
machines, all with the same kind of image. There's another one called System Imager that works more
|
||
|
|
on a golden image sort of model, where you create a golden image. This is what my machine needs to
|
||
|
|
look like, and then it'll burn that onto a number of different machines. Some challenges here
|
||
|
|
is maintaining your images. So, that's pattern number one is, and what tools are available there to
|
||
|
|
get the system out there. Now, operating at scale, so there's some prerequisites for that,
|
||
|
|
the kind of stuff for Kickstart, for example. Some infrastructure that needs to be in place prior to
|
||
|
|
using Kickstart on a big scale, of course, if you have one or two machines, with Kickstart, you
|
||
|
|
can go pop in a CV or Kickstart that way, but when your data centers that that thousands of miles
|
||
|
|
away, you have to do things a little bit differently. So, you have things like Pixie, which is a
|
||
|
|
pre-execution environment that when a machine comes up, the BIOS will make a DHCP request,
|
||
|
|
the DHCP circle hand-back, and I key address, so it will start talking on the network, and also
|
||
|
|
hand-back a little bit of information about where to go next. And so, it'll fetch a TFTP image
|
||
|
|
and start the process up, we'll fetch the Kickstart image, start following those instructions,
|
||
|
|
you'll need, so it'll need DHCP server, basic network services, either maybe DHCP FTP or NFS
|
||
|
|
to serve up the distribution contents, the DNS server, so it can do some resolution.
|
||
|
|
So, that's what you're going to need to be doing the provisioning and the Kickstarting.
|
||
|
|
A new tool that's recently come out, which is really interesting, kind of combines a lot of
|
||
|
|
that's called Cobbler. Combines Kickstart, DHCP, Pixie, DNS, and also management of your
|
||
|
|
younger repositories. So, a really nice tool, if I were going to start again, maybe asking
|
||
|
|
Jason, to go back to Jason's question, in the beginning, what would I keep, what would I,
|
||
|
|
you know, use that's already out there, back at night online, I kind of built up a server that
|
||
|
|
served out all these things, and I probably look Cobbler really closely because it does a lot of
|
||
|
|
that for you, putting it together. So, the challenge comes when you're operating at a large scale,
|
||
|
|
you've got multiple environments, how do you serve up, for example, your DHCP request,
|
||
|
|
over large subnets, do you segment, do you have a server in each of your segments,
|
||
|
|
each of your network segments, how do you maintain the configuration images or your Kickstart
|
||
|
|
images across all of these different environments. So, there's a challenge that you have to think
|
||
|
|
about once you start getting up to large scale. So, pattern number two, a system configuration,
|
||
|
|
once your machine is up and running, how do you maintain the configuration on the machine?
|
||
|
|
If you've got something like system imagery, you have to re-image each time, and it's not
|
||
|
|
necessarily a bad thing, in some cases it's faster to re-image and build from scratch, and then
|
||
|
|
it doesn't kind of forget what's wrong with the machine and fix it. Especially when you're dealing
|
||
|
|
with thousands of machines, you just don't have time to try and debug every last little thing,
|
||
|
|
and sometimes it's faster to just pull out a rotation, re-kickstart it, re-image it, and get back,
|
||
|
|
and then get it going again. So, what tools are available here? There's one called CF engine,
|
||
|
|
that the basic model of CF engine is you define a policy on a master server,
|
||
|
|
and then that policy is then spread out to a number of other machines, and they read that policy,
|
||
|
|
and then adjust their system configuration according to that policy. And CF engine allows you to
|
||
|
|
control things like, I'm just trying to think of some of the built-ins in a little while,
|
||
|
|
we don't use CF engine, it's on this year, so it's been a while, but you can, it'll allow your
|
||
|
|
run different scripts. You can tell what set of packages need to be installed on a machine,
|
||
|
|
so you can use the first pattern of kickstarting to lay down the basic OS image,
|
||
|
|
and then you get CF engine installed probably as part of that kickstart process,
|
||
|
|
and then you see CF engine, and then you see CF engine to specify what packages should be in place,
|
||
|
|
what configuration files need to be in place, maybe I need to make a modification of my
|
||
|
|
anti-host file, maybe I need to throw something in the Apache config, or maybe I just need to
|
||
|
|
distribute some files altogether, and you can distribute files through CF engines a little bit,
|
||
|
|
you kind of have to bolt it on yourself, but other tools will do that for you, so Publits
|
||
|
|
one that's come out recently, it was written by a guy named Luke Congays, I think that's how you
|
||
|
|
pronounce it last night, I'm not sure, but it was kind of written in a response to CF engine,
|
||
|
|
he tried Luke work with CF engine for quite some time, I had kind of a lot of gripes about it,
|
||
|
|
the lead developer CF engine listened to a key on some of the changes, Luke went and makes
|
||
|
|
Luke kind of went out on his own, using Ruby, he built a system called Puppet, which is very
|
||
|
|
similar, but in my opinion it fills a lot more polished, a lot easier to use, it's written in Ruby,
|
||
|
|
and he uses a domain-specific language that allows you to declaratively specify what your system
|
||
|
|
should look like, so in your configuration on Puppet, not your master, your Puppet master,
|
||
|
|
you can define, so for the web servers they need to have these users on, and they need to have
|
||
|
|
these crime entries, they need to have these file permissions on this set of files, and Puppet
|
||
|
|
also provides a file server service, so you can create a set of basic files that need to go out
|
||
|
|
to this class of machines, and so the machines when they run through their Puppet config,
|
||
|
|
they can fetch back the files and sort that are served up through the Puppet master,
|
||
|
|
the master server, and be able to then apply that config, and I see as you can apply templating
|
||
|
|
to this as well, so you can say use this template, and then when you get on this host apply these
|
||
|
|
rules to fill the template according to whatever it is needed on that host, so really quite a nice
|
||
|
|
system, and if I were going to start again from scratch, I probably wouldn't do the CFN
|
||
|
|
generator out, I'd probably do Puppet, it's a little more feature rich in my view.
|
||
|
|
See, another nice thing about Puppet is it has a hardworks during an OS abstraction layer,
|
||
|
|
one of the difficult things in working with CFN is you have to do things a little bit differently,
|
||
|
|
in your configuration, depending on if you're working on solaris, if you're working on macOS,
|
||
|
|
or if you're working on Linux, Puppet has this nice OS abstraction layer, so you just talk about
|
||
|
|
generic things like services, I want Apache to come up on boot, Apache needs to be running it all
|
||
|
|
times, and then this abstraction layer does the right thing depending on whether you're working on
|
||
|
|
solaris, depending on whether you're working on macOS or Linux, so it's really nice that way,
|
||
|
|
you can just work in generalities, these users need to have accounts, set their passwords with
|
||
|
|
this value, that kind of stuff, and there's a lot of different types there that you can
|
||
|
|
configure on your system, macOS, if you can put chronicries in place with this, you can create
|
||
|
|
different configuration files, make sure different services are running, make sure they start up
|
||
|
|
a boot, there's some others that a little less well known, like BCF G2 developed the Argon
|
||
|
|
National Laboratory, I haven't used that so I can't say much about it, likewise for LCFG,
|
||
|
|
LCFG's design, they're similar in design to Puppet and CF Engine, LCFG's designed for large
|
||
|
|
installations, there's a group at the Lisa conference each year, large installation system
|
||
|
|
administration, let's talk about system configuration and some of the authors of these generally tend
|
||
|
|
that kind of stuff, another option that you have is arcing a lot of glue, start with some kind
|
||
|
|
of master server and is arcing the stuff out, and maybe run some scripts, you know, that becomes
|
||
|
|
tedious after a while and can become problematic, but it is an option, you know, depending on the
|
||
|
|
sizing network and what your needs are, another one that's recently come out just recently,
|
||
|
|
called Chef, I don't know too much about it, it's been heavily influenced by Puppet, it uses
|
||
|
|
for its definition language, it uses Ruby itself as opposed to a Ruby like language used in Puppet,
|
||
|
|
so just another one to keep your eye on, okay, so what's that? Yeah,
|
||
|
|
so Ryan's question for the people in podcast land is what people are actually doing right now,
|
||
|
|
so yeah, let's open this up, one of the systems that you make files,
|
||
|
|
what recommendations have you gotten from the author, from the community around Puppet,
|
||
|
|
I'm curious, that's something I've heard too with Puppet is you get to large scales, you start to
|
||
|
|
run in problems with the mastering and you keep up with the load and your stuff like that.
|
||
|
|
Yeah, master of Puppets, master of Puppet, master of Puppet, master of Puppet,
|
||
|
|
master of Puppet, master of Puppet, master of Puppet, master of Puppet, master of Puppet,
|
||
|
|
master of Puppet, master of Puppet, master of Puppet, master of Puppet, master of Puppet,
|
||
|
|
you're supposed to be at this other position of your big time meeting today and you can't
|
||
|
|
still last minute, so you've also, I'm just going to share with you how I did it,
|
||
|
|
he's only like one of the four major podiums that I've had in the last one,
|
||
|
|
but here I'm just going to get him adopted, we can probably get a presentation for him,
|
||
|
|
so he's like, that's a different kind of action, wow, actually we have that work,
|
||
|
|
we have one, this is a different tool, that's a different tool,
|
||
|
|
it's a different tool, it's a different tool, it's a different tool,
|
||
|
|
yeah, it might have been reducing parts, arcing a lot and come to go,
|
||
|
|
because the guy with the four major podium is counteractored,
|
||
|
|
and so I'm basically replacing everything with Puppet and Puppet,
|
||
|
|
all the tools that I need to get it, but I don't have, you know,
|
||
|
|
we're looking about trying to do a certain thing here,
|
||
|
|
I don't know how to do a jump, I don't know how to do it,
|
||
|
|
currently I'm going to use Kickstart,
|
||
|
|
and after Kickstart there's a script that gets run on each machine
|
||
|
|
that does all the configuration magic, and then I'm using Cobbler,
|
||
|
|
I would love to try that out, well, Cobbler and Puppet and that kind of stuff,
|
||
|
|
so once it's up, we have, and Ryan can step in and tell me if I'm
|
||
|
|
disclosing too much information, once the machine is up,
|
||
|
|
we use Unnush, familiar with all the internals, we have a script that
|
||
|
|
pushes configurations and files out on a mass scale,
|
||
|
|
homegrown solution, and that's work for us,
|
||
|
|
there's a lot of rough edges and dark corners, I'd love to go and clean out,
|
||
|
|
my role hasn't been in that specific area at all in a true at this point in time,
|
||
|
|
but there are days when I wish I could have something like CF
|
||
|
|
Engineer or Puppet, and it brings the screen for us.
|
||
|
|
But yeah, that kind of stuff can work and can grow you to this kind of size,
|
||
|
|
so, but yeah, if I were starting out again from Bare Metal,
|
||
|
|
I'd be using one of these, probably Puppet.
|
||
|
|
Well, yeah, one interesting point, the file installer that I mentioned for Debian,
|
||
|
|
can tie into CF Engine, which is not interesting,
|
||
|
|
tie in there. Going back to the questions of automation,
|
||
|
|
and tracking and managing configuration, again, that's where all this would
|
||
|
|
fall into, it's those, how you automate all that, how you track all the
|
||
|
|
configuration. At Omniture, we have to be really careful when you get to this guy,
|
||
|
|
you have to be really careful about how many
|
||
|
|
machine types and how many builds you have,
|
||
|
|
because that can scale out of control really, really fast, we're very, very
|
||
|
|
strict about our hardware configurations, very, very strict about
|
||
|
|
what new software will we deploy, where, what goes on our images, that kind of
|
||
|
|
stuff, you have to be just careful, otherwise it can scale out of control, and
|
||
|
|
the complexity just becomes so huge. So, it's time to keep in mind.
|
||
|
|
Let's see. So yeah, a lot of these systems in common thread is they're
|
||
|
|
declared versus procedure, which I like a lot.
|
||
|
|
It's a what versus how, what should my system look like, as opposed to
|
||
|
|
run this series of steps you get there, which is kind of nice,
|
||
|
|
a danger with these kind of systems, particularly in large scales, and it's
|
||
|
|
really easy to hose your system. To hose your entire system, if you
|
||
|
|
misconfigure something, if you put in an inadvertent R&M or an inadvertent
|
||
|
|
remove this file, and if you're wrong file, you're
|
||
|
|
end up removing a director, your code director, something like that.
|
||
|
|
Bad news. So, my recommendation there, we'll talk a little bit more about that,
|
||
|
|
just have a really good staging and QA environment that mimics all of your
|
||
|
|
production operations so that you can run through these configuration
|
||
|
|
changes in that environment, and work out the bugs, and that's it,
|
||
|
|
in that scenario first, and employ good change management practices, and we'll talk
|
||
|
|
about that in pattern number 12.
|
||
|
|
Head number three, software and patch management. Once you have your systems up,
|
||
|
|
how do you keep up to date with all these patches? How do you keep your
|
||
|
|
young and your app repositories up to date? How do you know which patches to
|
||
|
|
take in from upstream, and which not? How do you manage all that?
|
||
|
|
Obviously, we've got the tools, we've got young app kits.
|
||
|
|
The interesting one that I don't think we've heard about too much is our path,
|
||
|
|
being a string to one to look at. They've kind of
|
||
|
|
gone RPM, squared kind of thing, and having all sorts of stuff going on there.
|
||
|
|
R sync. You use that to push stuff out.
|
||
|
|
Questions, package or not to package. You want to use an packaging system like
|
||
|
|
or Yum, or just build everything from source and deploy it all to use your local.
|
||
|
|
And just have a golden belt. It's another question you have to consider.
|
||
|
|
There's pros and cons about each way. I'm a packaging person myself.
|
||
|
|
Let's see, the main challenge here is how do you verify that every one of your
|
||
|
|
20,000 machines has the same software set? How do you ensure that all your
|
||
|
|
web servers have the exact same set of software that they need to run their
|
||
|
|
code correctly? How do you make sure that all of your mail servers are
|
||
|
|
all of whatever server have like all the right curl modules that they've got?
|
||
|
|
So stuff like Puppet and CFNs, you can have ways to handle that,
|
||
|
|
saying these are the set of our pins that should be or these
|
||
|
|
set of packages that should be installed on this server,
|
||
|
|
and take corrective action if you don't find them there.
|
||
|
|
That's definitely a challenge.
|
||
|
|
That number four is monitoring. And as I got, was preparing for this talk,
|
||
|
|
there's kind of some ambiguity in what's meant by monitoring,
|
||
|
|
what's meant by data collection. I'm going to define monitoring as essentially
|
||
|
|
availability and performance. So let's talk a little bit about performance first.
|
||
|
|
I'd recommend in talking about performance to look at the slides that Kerry Millsat gave
|
||
|
|
at the recent Percona conference that was held in conjunction with my scale
|
||
|
|
conference. A big thing to look for there is response time.
|
||
|
|
Focus on response time. We can measure all the layers of the stack, but in the end,
|
||
|
|
what our users are experiencing is we're in red and butter come Trump.
|
||
|
|
So let's see. And then of course availability. So performance
|
||
|
|
want to measure how fast it is, how well it's performing, how
|
||
|
|
performance under load availability is available or not. There's a couple of other types of
|
||
|
|
facets of monitoring. You have external monitoring, meaning what does our site look like to the
|
||
|
|
rest of the world as it up as available as a performance. And internal monitoring is my SMTPE's
|
||
|
|
are they all responding like we expect them to, or is my database responding like they expect it to.
|
||
|
|
Is my QAing system accepting messages in the queue or that kind of thing.
|
||
|
|
External monitoring, there's a lot of different services out there and these are not open-source.
|
||
|
|
Some are cheap, some are free, some are very expensive.
|
||
|
|
Keynote Gomez is a really expensive. These allow you to essentially synthesize web transactions.
|
||
|
|
Some of them have little script recordings where you can go browse to your website and record
|
||
|
|
that script and then replay it from different nodes all over the world in their network of
|
||
|
|
nodes, monitoring nodes. And then they can give you graphs and reports about how things are
|
||
|
|
performing. Now there's your website, Bullseller site.
|
||
|
|
Mon that it toward US is actually a free one. They do have paid versions but it is a free one.
|
||
|
|
I think they've got, I can't remember how many nodes are monitoring.
|
||
|
|
The England is another interesting one that I thought was pretty good when I tested that
|
||
|
|
outside uptime.com. Some of these are just focused on uptime. Is my site available or is it not?
|
||
|
|
I'm going to buy everything. Others actually go into response time and give you detailed
|
||
|
|
information about how long did it take to do the DNS look up on particular request. How long
|
||
|
|
did it take to request the first byte? Once I sent that first byte across how long did it take
|
||
|
|
to actually get the rest of the content? It'll break down your connection by that. It'll break down
|
||
|
|
by each object on your page. How long did it take to fetch this graph? How long did it take to fetch
|
||
|
|
this flash object? All that. It'll break it down by that and give you a connection statistics on
|
||
|
|
all of that. So those tend to be really expensive too. And the price, based on the number of nodes
|
||
|
|
you're anything from, why your nodes are located, how often you run the tests, what kind of tests
|
||
|
|
they are. We just fetching a single URL or are we fetching an entire page of objects or
|
||
|
|
several or walking through transaction and several pages on our site? It's important to do this
|
||
|
|
because you want to drink internally from saying agios or something like that or nauseos.
|
||
|
|
Raising has quick serve of hypernases and agios. Nauseos? Okay. Just curious.
|
||
|
|
Let's see. Nauseos? I'm trying to talk good. Anyways, but yeah, internal monitoring can only
|
||
|
|
go so far. It's good to get, you know, your transactions are in the entire stack coming from
|
||
|
|
someone in Russia. How are they? That's not an interesting thing is, you know, when you've got
|
||
|
|
data centers that are just located in the United States, for example, how are people in Europe,
|
||
|
|
how are people in Russia, how are people in China? How is their performance compared to some in
|
||
|
|
the United States that's, you know, a couple hops away from your data center? Depending on where
|
||
|
|
your audience is, that can become very important and who you're trying to reach and how well you're
|
||
|
|
trying to reach, what kind of market penetration you're looking for, that kind of thing.
|
||
|
|
Internal monitoring, you want to monitor any system that will impact the quality of your service
|
||
|
|
that can become unavailable. So you kind of have to be really thorough and proactive about this.
|
||
|
|
But as outages come, you can use those as opportunities to patch up holes in your monitoring.
|
||
|
|
Why didn't, if this outage happened, why didn't we know about it beforehand and take
|
||
|
|
measures to make sure that we're alerted well before that happens in the future?
|
||
|
|
Well, before the district feels well before we saturate the network, well before we, you know,
|
||
|
|
run out of query capability in our database. And go just beyond checking the poor. A lot of these
|
||
|
|
systems make it really easy just to, oh, poor 80s responding were good. Well, no, you got to go
|
||
|
|
beyond that. Fetch page. It calls a transaction to happen on your web server, deliver a message to
|
||
|
|
your S&TP systems, set a value in your memcastee, any of those kind of things.
|
||
|
|
So make sure that you're, every aspect of your system is functioning as it normally ought to.
|
||
|
|
There's a lot of tools out there that can do this as far as uptime and availability.
|
||
|
|
Nagios obviously is a very popular one. Zavis is another one that's come out recently in
|
||
|
|
the last few years. Looks pretty. It looks like it scales really well. A lot of these are agent
|
||
|
|
base models or can use agent base models where, as opposed to just hitting something from outside
|
||
|
|
your server, you can actually run an agent on your server and it will collect data there and then
|
||
|
|
send it back from home kind of thing. Send it back to a centralized point where you can aggregate
|
||
|
|
that data. HyperX, another one, open NMS. Let's see. HyperX gives you auto discovery, which is
|
||
|
|
really a nice tool of large environments. So as you bring machines up, I guess how this auto
|
||
|
|
discovery works and you can install an agent in your provisioning process and when that comes
|
||
|
|
up it will start sending out responses. Hey, I'm here, I'm here and that will just show up in your
|
||
|
|
and your centralized monitoring system. Let's see. Open NMS is another one written in Java.
|
||
|
|
Let's see. And all these will find they have different feature sets. Some of them do just
|
||
|
|
monitoring. Some of them include a little bit of inventory and asset management, data visualization
|
||
|
|
and collection. Open NMS is three focuses are on service pulling, pulling your services for
|
||
|
|
the data, collecting that and then it also can do event and notification management. In other words,
|
||
|
|
there's something down, tell me about it. Mon is another one, pretty simple one. Monit,
|
||
|
|
it's billing is it's feature that it tells is being able to monitor and do corrective action.
|
||
|
|
So if you've got something happening on your server that triggers a certain threshold,
|
||
|
|
it can take corrective action that you've predefined. Reconoier is done by the guys at
|
||
|
|
METI, I don't know how you pronounce that. They are a slosh negligent in a good book called
|
||
|
|
the scalable internet architectures. Smart guys, definitely something worth looking at. They focus
|
||
|
|
on ease of admin, efficiency and scale, delegated deployment and applying policies of large
|
||
|
|
future service. That's an interesting one I'd like to take a closer look at. Another one called
|
||
|
|
Irma that I hadn't heard of before I got ready for this, called the extremely reusable monitoring
|
||
|
|
APIs developed at Orbits, the travel company. How many of you knew that Orbits is an open source
|
||
|
|
company? Anyways, they've got some really good open source software that they have provided for
|
||
|
|
the community. This one again is a monitoring tool. Lots of jobs are going on with Irma.
|
||
|
|
Next pattern is system data collection and visualization. There's a lot of crossover between
|
||
|
|
pattern form, pattern fact. There's another one that is jump onto the scene, just seen off.
|
||
|
|
Oh yeah. Yeah.
|
||
|
|
It's pretty interesting. We've been working a little bit. Do we think about doing something
|
||
|
|
like that? Open source conference where we actually have a bunch of different ones displayed?
|
||
|
|
That's one of the ones that they're providing. If I'm missing any of these, please,
|
||
|
|
this is an open source presentation, so let me know and we'll incorporate it.
|
||
|
|
I like Nogges. I like the object model that Nogges uses. It's got an object model where it defines
|
||
|
|
you've got hosts, you've got services, you've got escalations, you've got schedules,
|
||
|
|
and it all flies in really nice. I'm kind of a data person. I'm at DBA in my past life,
|
||
|
|
so I like its object model and it wouldn't be too hard to put the Nogges object model into a database
|
||
|
|
and story configuration there. I would generate your Nogges config.
|
||
|
|
I like Nogges, but I don't like so much about Nogges is that the interface is built from
|
||
|
|
memory and C, and so it makes it a little bit harder to extend and integrate that if you want to.
|
||
|
|
But the core monitoring, everything, is pretty nice to plug into. Easily to plug in your systems.
|
||
|
|
Again, if I'm missing things on this list, let me know. I'd love to add these to the presentation.
|
||
|
|
Let's see, so system data collection visualization, there's a lot of overlap between monitoring and
|
||
|
|
this, because essentially number four is another sort of data collection visualization.
|
||
|
|
There's a ton of these. You've got CAC diodes, both in PHP. It's front-end around R&D tool. A lot of
|
||
|
|
people using it. It can do graph templating, so if you have all sorts of different graphs, you want
|
||
|
|
to create that you can have templates that do different kinds of graphs based on a theme.
|
||
|
|
The only thing I really like about CAC diodes ability on a given graph, you can draw a little
|
||
|
|
box around and it'll zoom into that graph, expand that section out, and so you can kind of dive
|
||
|
|
in deeper into a particular event if you're interested in what happened there.
|
||
|
|
Gameplay is probably my favorite on this list. It was built for high-performance computing clusters,
|
||
|
|
and so it scales quite well. It uses multi-casts underneath. You've got all these agents running
|
||
|
|
on your box, it's called G-Mondi, and they're running home to a G-meta-D, which is an aggregator,
|
||
|
|
and they can use multi-casts for not clogging your network so much. Really a nice tool.
|
||
|
|
It can monitor, just about anything you can produce with a script, so if you have a script that
|
||
|
|
can spit out some kind of value, then Gameplay can take advantage of that and spit that phone
|
||
|
|
back from your servers to the master. It has a PHP front-end. It can aggregate data by cluster,
|
||
|
|
so if you have a web cluster and you want to see how overall is my web cluster performing,
|
||
|
|
it can show you what's the average load across this cluster of machines. What's the average
|
||
|
|
disk usage across this cluster of machines? It's really nice. I really like that aggregation feature.
|
||
|
|
Very, very helpful. You can sort the data out of it in XML to plug into other systems. We did
|
||
|
|
that in the United Online. It uses RID to store its information. Rebooting is another one. It's
|
||
|
|
another agent-based system where you've got an agent running on your machines that's written
|
||
|
|
in Chrome. It uses RID, easy to write plugins. A lot of these have a common thing that they have
|
||
|
|
plugins. If they're set of plugins that they provide a monitor, a particular aspect of your
|
||
|
|
system, just don't quite have what you want. You can write your own plugins monitor. Whatever it
|
||
|
|
is about your system that you want to monitor, that's your key performance indicator.
|
||
|
|
Zavix, we mentioned that one. Collect the system's statistics collection team. This one's really
|
||
|
|
cool as I was reading about this one. Again, it runs on your servers. You gather stats.
|
||
|
|
It's written in C. It's intended to be extremely lightweight. It's pluggable. It's got
|
||
|
|
binding so you can write your own plugins in C, in Perl and Java. Communicate with it via Unix
|
||
|
|
domain socket or it'll detect binaries for you or scripts. You can do SNP for you. They've
|
||
|
|
also added simple monitoring like Nagios with a notification and thresholds. You can plug its
|
||
|
|
data into Nagios. I've got a plugin that will do that for you. I like it because it's really
|
||
|
|
simple. It's kind of like it's really good glue piece in your system and very simple. It does
|
||
|
|
one thing. It doesn't really quite well. It doesn't generate grass, but you can use it
|
||
|
|
to do so. It's been in the RRE or whatever. Any kind of graphing system you want. It can do high
|
||
|
|
resolution statistics. The default is 10 seconds without putting too much load on your system.
|
||
|
|
It's really good. They're targeting it for embedded systems like your WRT router here.
|
||
|
|
So you want to run on that kind of thing. It has a data push model using IPv6 and multi-casts.
|
||
|
|
Again, so it doesn't clog your network or you can use IPv4 and Unicast if you want.
|
||
|
|
Again, with multi-casts, you get this auto discovery network because your nodes come up with these
|
||
|
|
agents running on them. They're kind of home. You don't have to in your configuration on the master.
|
||
|
|
You don't have to list out all your 1000 machines. They just show up in your as they start
|
||
|
|
spreading their data back home. So that's a really nice tool there. Collectible is kind of a
|
||
|
|
command line tool. It's kind of like VMStat and IOStat and all those different stack commands
|
||
|
|
kind of rolled into one and you can format what I want to see CPU and compare it with network
|
||
|
|
compared with my memory usage and those all stats and you know scroll amount as they come.
|
||
|
|
Yeah, that one's written in purl. It can really nice about collectible as it can do sub-second
|
||
|
|
resolution. So if you need to collect really fine-grained data, you know less than every second,
|
||
|
|
you can do that. It takes the time higher as module and purl to do that. It can give you
|
||
|
|
command line output. You can run it as a demon. You can send data over UWP if you didn't
|
||
|
|
do ganglia. You can output as aid and plot format if you didn't do a new plot or open office.
|
||
|
|
It has an interactive mode from the command line. You can do a record mode so you're sending the data
|
||
|
|
to a file. You can read data from a file into it and look at it, play it back for you.
|
||
|
|
You can output it to output trade arbitrary socket. So if you want to write your own demon
|
||
|
|
to harvest all the data, it can send whatever it's collecting while it's running instead of
|
||
|
|
some socket. Again, it's another tool. It's just easy to integrate. It's a nice little foundation
|
||
|
|
block in your environment. It's new stuff with it. And it can do stuff that's above and beyond
|
||
|
|
your basic, what SAR will give you. SAR is another one that's not on this list.
|
||
|
|
Collectible can do stuff like NFS stats. If you're using the Lester cluster file system,
|
||
|
|
it will grab stats from that. Interconnects stats, slab data. Again, this is a tool that's
|
||
|
|
really easy to integrate in your environment. The one's de-stat that's written by the guy who
|
||
|
|
does the DAG packages repository. It's your placement for VMSAT and IOSAT. It's kind of a lot
|
||
|
|
like Collectile, except it's written in Python. So interesting is you find for recent language,
|
||
|
|
you find a tool that's similar to that's written in a different language. Go figure. It's also
|
||
|
|
easy to extend plugins. It's out that there are no time shifts when the system is stressed.
|
||
|
|
You can export a state of the CSB. R&E tool, of course, a granddaddy of all this stuff.
|
||
|
|
It uses a cyclic databases. They're really nice to you about R&D graphs and data as it stores it.
|
||
|
|
Our files where you can set a specific file size, you never have to worry about that database
|
||
|
|
growing. You can say, I'm going to collect this much data and as it fills up that database,
|
||
|
|
it just starts writing back to the beginning. Assuming that data that is old, you're going to be
|
||
|
|
less likely to want to look at it and it'll kind of aggregate older data into more or less and
|
||
|
|
less granular time slices. It's used just about everywhere. You can produce graphs. You can
|
||
|
|
put just about anything into it as long as it's time series data. MRTG is the same author that wrote
|
||
|
|
R&D tool, used for graph and traffic from routers and any other device with S&P support.
|
||
|
|
It's written in Perl. Graphite is another one that came out as done by Orbits,
|
||
|
|
the travel company. This one was really, really cool. It's very, very similar to R&D tool,
|
||
|
|
except it fixes a couple of the problems that they ran into when trying to scale R&D tool out
|
||
|
|
to a huge number of machines. It's an enterprise scalable and real time graphing.
|
||
|
|
Designed be horizontally scalable, storing data for thousands of devices. You can add machines
|
||
|
|
to increase throughput, real time graphs even under load. Equal from the website at the time
|
||
|
|
this writing, the production system at Orbits can handle approximately 160,000 distinct metrics
|
||
|
|
per minute running on two Niagara 2 sun servers on a very fast sand. So that's pretty good.
|
||
|
|
Written in Python, Cores S&MP, a lot of devices run S&MP. You can query the devices,
|
||
|
|
a lot of routers, network equipment. You can run S&MP on your servers and gather data that way.
|
||
|
|
DRR is a graphing tool that you can use to graph R&D graphs. Supermod's another one.
|
||
|
|
I don't have a lot of data about that one. Let's see, moving on though.
|
||
|
|
Item number six, ticketing. Once you get a big enough staff maintaining what has happened
|
||
|
|
in the network through post-its-on-your-wonder-doesn't-scale very well. So you want to have some kind of
|
||
|
|
ticketing system whereby people can submit trouble tickets to help request that kind of stuff.
|
||
|
|
There's a ton of those out there if you look for them. I like RT. That's about all, say about that.
|
||
|
|
One thing I don't like about RT, unfortunately. Well, RT is written in Perl, obviously,
|
||
|
|
like that. One thing I don't like about RT is rendering the page when you have a huge ticket,
|
||
|
|
takes way too long. I wish they didn't end cash if I had a thing make it faster.
|
||
|
|
Right, right. But I'm a lazy sys admin, right? If someone's already built it.
|
||
|
|
But that would be the lazy admin I agree with.
|
||
|
|
Right. That's nice about RT is.
|
||
|
|
Nice part about RT is it does have an online interface so you can interact with it that way.
|
||
|
|
Good stuff. And it's very extensible being crawled. You can plug templates into it. You can do all sorts
|
||
|
|
of the extension of it. It integrates well in your environment.
|
||
|
|
Let's see, pad of number seven, centralized user account management. Once you have all these
|
||
|
|
machines, how can you log in to all of them? How do you maintain password changes?
|
||
|
|
A couple of possibilities. LDAP, there's number of LDAP, open source LDAP, sure,
|
||
|
|
where's out there. You've got open LDAP, you've got, which is okay. If you bet if you're going to
|
||
|
|
scale it up, make sure you don't need the Berkeley TV as you're back in because that tends to get
|
||
|
|
corrupt and frequently. As like season nodding heads in the room, the people have probably
|
||
|
|
sweated over that in a couple of times. Another question if you're using some kind of centralized
|
||
|
|
directory system is what do you do when your directory is down? Do you have a set of escape
|
||
|
|
accounts that you can log into to manage machines? Or are you toast? When you do there, Kerberos
|
||
|
|
and other, what we did at the night online is we used CF Engine to distribute password files.
|
||
|
|
It was a very, we tried all that for a little bit in the internal environment,
|
||
|
|
this ran out all sorts of problems with it. And so, decided we were going to use
|
||
|
|
CF Engine with distributed password files. Worked really nicely. Public can do the same thing,
|
||
|
|
although you don't have to select files and run it with puppet. You can just use a little
|
||
|
|
snippet of the puppet code and say define these users. You need this account. You can templatize it
|
||
|
|
all. What about users' home directories? If they log in, are they going to expect to have a
|
||
|
|
certain set of files there? How do you do that? Do you amount to be an NFS and deal with all the
|
||
|
|
issues that come with NFS? But when we happen to have 10,000 NFS clients for server, you know,
|
||
|
|
everyone to server what do you do? Do you have a regular R-Sync that pushes the contents, maybe
|
||
|
|
a specific set of contents out to all these machines on a regular basis? You know,
|
||
|
|
these are questions you have to ask. Again, there's lots of tools that you can do that with.
|
||
|
|
I like the idea where you have something like CF Engine or Puppet that's pushing out.
|
||
|
|
Well, there's a simple server where all the users who have a login of the server can log into
|
||
|
|
and they can put, I want this set of files to be on all the machines. I'm going to do something
|
||
|
|
like CF Engine or Puppet to push that out. So that I have all of the nifty little tools that are
|
||
|
|
in my bin directory are on each machine. That's what I would do. I wouldn't use NFS just because once
|
||
|
|
you scale the 10,000 machines, you'll cripple your NFS server. Yeah, yeah, so just like that. So it works well.
|
||
|
|
Any other suggestions? What have you guys done that works well for you?
|
||
|
|
Yeah, and it could depend on your environment. It could depend on your security policy.
|
||
|
|
It may be some places where you don't want shark tools lying around, you know,
|
||
|
|
for wouldn't be attackers to get into. Yeah.
|
||
|
|
Other than the fact that you know, like different shells and that sort of thing, it's probably
|
||
|
|
having to do it. What's that?
|
||
|
|
It's just something like that works for you probably.
|
||
|
|
Yeah, it's just a thing, right?
|
||
|
|
Yeah, I would say it's like other than that. So it's narrow where you want to specific
|
||
|
|
figurations that it's looking for problems, but otherwise it isn't, you know,
|
||
|
|
if you have a small team or you know, you can probably get what up in certain areas.
|
||
|
|
This is the thought.
|
||
|
|
Yeah, great, thanks. Okay, so next pattern, DNS. Inside this environment you're going to need
|
||
|
|
some kind of DNS services. You're going to want to have external DNS services.
|
||
|
|
So the world asking for your website can do it and want something scalable, stable.
|
||
|
|
But you're also going to need internal DNS services. So you're going to have some kind of local
|
||
|
|
resolvers. There's a number of different DNS servers that are out there open source.
|
||
|
|
You've got the venerable bind. Very capable. Very can be complex if you want it to.
|
||
|
|
My DNS isn't it is one we use at United Online. It's actually really quite simple.
|
||
|
|
There hasn't been a whole lot of updates. The code base, you know, quite a while,
|
||
|
|
but it's still pretty stable. Basically what it is, it's a DNS server that's back in by
|
||
|
|
mySQL database or a PostgreSQL database if you want one. And so at United Online we had about
|
||
|
|
millions of resource records in there, probably 80 to 100,000 domains that we hosted inside there.
|
||
|
|
And you can update it just like update any DNS updates or just a database update.
|
||
|
|
So you can write your own front-end tools for it.
|
||
|
|
DJB DNS is another good resolver. If you can deal with Dan Bernstein's code and his philosophy,
|
||
|
|
I don't prefer it, but that's all I'll say. There's lots of others out there,
|
||
|
|
lots of other DNS systems. But somebody think about you want internal services,
|
||
|
|
you want external services. You know, can we see that? It's a little bit of a DNS wisdom here.
|
||
|
|
As you're building your environment, as you're writing your code,
|
||
|
|
don't hard-foot IP addresses into your code. Because every time your database is going to change,
|
||
|
|
whatever server you're connecting to is going to change sometime. And it's a royal pain in
|
||
|
|
the rear. You have to do a code rollout to change that for that. So instead,
|
||
|
|
you see names. Just write a building to be. It never has to change. You never have to change
|
||
|
|
your code if your infrastructure or your environment changes. Create a C name and DNS.
|
||
|
|
You have to request infrastructure. Change the DNS. Start thinking back up. You're done to change it.
|
||
|
|
So this is suggesting one way of going about that. It's been nice.
|
||
|
|
The Cnex pattern is mass execution. It comes a time when you've got thousands of machines,
|
||
|
|
and you want to run the same thing on all those machines. Maybe you haven't got your collection
|
||
|
|
system on, and you want to know the value of dirty butters in crock-mimminfo on all of those
|
||
|
|
machines. And so how do you gather that information, or maybe I've got a file update I need to make,
|
||
|
|
or whatever you have to do across 20,000 machines. You don't want to be SSHing in 20,000 machines,
|
||
|
|
make that change. So there's a lot of different options here in the open search world, too. You've
|
||
|
|
got CFRN, which is a piece of CF engine, and it kind of ties in with CF engine, so that based on
|
||
|
|
whatever classes you've defined in CF engine, you can say, run this command if it's part of this
|
||
|
|
machine's part of this class. Funke is one that's really quite cool. Rent and Python,
|
||
|
|
fairly relatively new, allows you to, you can either use it from the command line, or it's got
|
||
|
|
library, so you can incorporate it in your script. So you can write scripts that do stuff across
|
||
|
|
all your infrastructure. Really worth looking at, SSHing expect is another way to do that,
|
||
|
|
at the United Online, we had this lovely little tool called Massos. It was built on SSH,
|
||
|
|
and the Expect Pro Module worked okay, had its warts, but I don't know if I'd use it again,
|
||
|
|
but if I had given these other C3s, the set of tools built for high-performance computing clusters.
|
||
|
|
Capastron was one of those, comes out in the Rails community, Ruby,
|
||
|
|
Distributed Shell, Fabric, I'm going to do too much of these, but they're out there available.
|
||
|
|
Python number 10 is Time Synchronization. Once you get up to a number of machines, you're going to
|
||
|
|
want to have some way of keeping them all in the same clock so that all the things don't happen
|
||
|
|
because your DNS server is at midnight when the rest of them are at 6 a.m., fun things happen
|
||
|
|
when that happens. So you use NTP, that's pretty much the main thing out there. There's a book out
|
||
|
|
there by A Press called Expert Network Time Protocol, and in there they talk about having to account
|
||
|
|
for relatively, when you use relativity, if we have like some interplanetary network, right,
|
||
|
|
and you have to do it with Einstein in effects with your time-sinking. So when we get to that scale,
|
||
|
|
just keep that in mind, you'll have to worry about it. I don't think we're there yet,
|
||
|
|
maybe NASA has to, I think NASA has to do with that kind of stuff. What's that?
|
||
|
|
I'm sorry.
|
||
|
|
I don't know. It'll be interesting to find out, but I assume NASA has to deal with these kind of
|
||
|
|
things, right? Time shifts with, I don't know, as you accelerate your rockets to fast enough,
|
||
|
|
you've got to deal with your time dilation. Anyways, pattern number 11 that I noticed is having
|
||
|
|
some sort of internal messaging or IRC system. You want to have an easy way of being able to talk
|
||
|
|
with people that I can get, or email, or even emailing. At United Online we had an internal chat
|
||
|
|
server, an Almnature, same story. It was really interesting when I came to Almnature, and I started
|
||
|
|
to see all these same things like my first day there, I saw all of these similarities,
|
||
|
|
even 4,000 emails from Nagios were there in my inbox the first day. It's like,
|
||
|
|
boy, I'm right at home. This is great. Get those filters set up really quick.
|
||
|
|
Anyways, there's a bazillion IRC demons, a matching number of clients and bots. Have the bots
|
||
|
|
that you're right for your system do interesting and useful things, like have little commands that
|
||
|
|
can query the status of certain host groups. If you want to get really fun, you know,
|
||
|
|
tie this into your mass execution stuff and be able to, from your IRC channel, send out
|
||
|
|
commands to your system. I don't know. All sorts of things you can do, because we're open-source,
|
||
|
|
we can tie all this stuff together. You've got an endless supply of LEGOs to play with here. So,
|
||
|
|
you know, if you want your Nagios alerts to come through your IRC channel, you can do that too.
|
||
|
|
All sorts of stuff you can do with the IRC and your bots.
|
||
|
|
Yeah, send the start key messages to your boss when it's late, and the machine's gone down.
|
||
|
|
Make life fun, to make your bots fun, do fun things. But better than 12, change management
|
||
|
|
and auditing. With recent years in Starbucks, servings Oxley, and all sorts of stuff like that,
|
||
|
|
depending on your environment, you have to worry about this. One recommendation, read this book.
|
||
|
|
This is a lot of handbook. Really, really good book. And it defines a good change management system.
|
||
|
|
Eventually, you're going to need to get to the point where you need to know what's changed on
|
||
|
|
the system. If something goes down, you don't want to have to be going big and through servers,
|
||
|
|
which you'd rather do, is go throughout a lot. What change has been made that might have triggered
|
||
|
|
this. Go there first and be able to find that. Interesting quote out of this, I'm going to read.
|
||
|
|
He says, a high-performance organization. So, essentially, the authors of this book did a bunch
|
||
|
|
of research on corporations and companies and organizations that said, which ones have the most
|
||
|
|
effective IT organizations as far as change management? And they said, the high-performance
|
||
|
|
platform organizations can effectively handle extremely high volumes of change often
|
||
|
|
responsible for successfully implementing hundreds or even thousands of changes per week.
|
||
|
|
They sustain high-change success rates of over 99 percent as defined by changes that are
|
||
|
|
successfully implemented without causing an outage or an episode of unplanned work.
|
||
|
|
That was the quote that caught me attention in that book. I started reading it. It sounded
|
||
|
|
kind of like, you know, a high Uber ITL speak and that kind of stuff. But then I saw this and
|
||
|
|
this really caught me in my eye because, you know, the Ben working where, you know, something changes
|
||
|
|
and you spend hours trying to fit to clean up after that change. So, this book describes all sorts
|
||
|
|
of patterns that these high-performing organizations do. And depending on how it's implemented,
|
||
|
|
it can be either a royal pain in the rear or it can be, you just have to the way you approach
|
||
|
|
and the way that you do a change management. It has to work with you, but, you know, these kind
|
||
|
|
of statistics really get my attention when I see that. So, good change management can make or
|
||
|
|
break your organization. Next pattern, number 13, there's no pattern 13. As I remind you,
|
||
|
|
there's a lot more magic to these machines than we want to admit. So, pattern 14, project management,
|
||
|
|
there's a number of tools out there. Eventually, you're going to have want something to
|
||
|
|
organize your projects, whether that's, you know, I don't know, a wiki or a Microsoft project,
|
||
|
|
maybe not, I don't know, that's not an open source. MrProject.project. There's a number of
|
||
|
|
those out there. Pattern number 15, interim mail handling. Your systems generate mail. When
|
||
|
|
Cron, some happens in a Cron execution, you're going to get mail. There's all sorts of events
|
||
|
|
you're going to generate mail. You're going to want to have something that's going to catch all
|
||
|
|
that and get it into a place where you can look at it. Maybe a bunch of proctomy rules or something
|
||
|
|
to follow that. Weed out all the important stuff out of your interim mail handling. You want it to
|
||
|
|
be nice and redundant. Number of tools, again, postrics, XMQ mail, XMTP is an interesting one. It's
|
||
|
|
a SMTP demon written in Pearl. It's got this pluggable architecture so you can kind of hook all
|
||
|
|
sorts of stages in the SMTP dialogue to do different things. So, yeah, again, catch emails
|
||
|
|
there by Cron. If you have systems that are delivering mail to your customers based on certain
|
||
|
|
events, maybe you've got a science event that triggers you need them off to them. Maybe they've
|
||
|
|
forgot my password kind of thing. You don't want your customers waiting for three hours until they
|
||
|
|
get their, you know, they're, they're forgot my password and they get, oh, wait, what did I do?
|
||
|
|
Anyways, so, yeah, you want a good internal, solid internal mail system that can handle that.
|
||
|
|
Pattern number 16, internal log harvesting processing. Depending on what you're doing,
|
||
|
|
you're going to generate copious amounts of logs, be those Apache logs, mail logs,
|
||
|
|
sys logs. One tool here that's useful is sys log ng. You can have a master sys log server.
|
||
|
|
And all of your nodes, you can configure them and your sys log d.com to log to a centralized log
|
||
|
|
post. You can also get, there's a lot of networking devices that can do the same sort of thing where
|
||
|
|
you can specify a sys log host to capture all this stuff. That's you want some good tools to
|
||
|
|
be able to capture all that and aggregate it and analyze it in a good way for you. One thing
|
||
|
|
you might look at is Hadoop, who here's familiar with Hadoop. Hadoop is an open source implementation
|
||
|
|
of the MapReduce system that's in use at Google. It's Yahoo's got a really huge Hadoop close cluster,
|
||
|
|
I think like 10,000 nodes in which they process and crunch logs. If you want to play around with
|
||
|
|
Hadoop and MapReduce, Amazon just announced, I think they call it elastic MapReduce, so there's
|
||
|
|
a cloud service where you can instantiate a MapReduce cluster and play around with it if you want.
|
||
|
|
Facebook just recently started using Hadoop, crunch a bunch of data that they have and they built
|
||
|
|
this layer called Hive, which they've open sourced. Hive produces a SQL-like layer on top of MapReduce,
|
||
|
|
so it takes your SQL, converts it to a MapReduce kind of function, sends that into Hadoop,
|
||
|
|
holds results back. Really good for kind of data warehousing and real-time query, touch that.
|
||
|
|
One of the things that nodes need to have with our syslog, which actually has a database back in,
|
||
|
|
what's that called? Our syslog, it's actually the newer version of syslog.
|
||
|
|
Okay. Just like the, I think, our syslog.
|
||
|
|
Yeah, our syslog, it's a data diagram and most of the red add systems, like all of your syslogs,
|
||
|
|
and then you grow syslogs, I don't have it. Basically it has an option for a database back in, I think,
|
||
|
|
my SQL-hand cluster has been supported in several others, so it actually supports the ability to
|
||
|
|
database it, and then you can integrate that into a SQL warehousing.
|
||
|
|
Okay. So, could I stuck there?
|
||
|
|
Cool.
|
||
|
|
I think our syslog problem will come in over between week-weeks and week-weeks.
|
||
|
|
Yeah, okay.
|
||
|
|
I've had a problem in three days and seven, where we are through a lot of complicated traffic
|
||
|
|
logging through syslogs, and our messages are getting on all of our new meetings.
|
||
|
|
Right.
|
||
|
|
So, we implemented our syslogs, change our rules, and week-weeks, and then some of our messages.
|
||
|
|
I'd really like to see, like, a comparison of how the performance of the syslog in G for syslogs,
|
||
|
|
and those sorts of environments, because there's a lot of value in being able to dig into the code,
|
||
|
|
or dig into the block itself, and find out what's interesting about trends.
|
||
|
|
Yeah, I was reading a book today called The Art of Capacity Management.
|
||
|
|
One of the guys at Flickr, who said that they feed their syslog data into some kind of system,
|
||
|
|
and they're able to watch, for example, the number of messages that they see
|
||
|
|
during a given time period.
|
||
|
|
So, they can see if you see a flood of messages coming to your syslog, you can be alerted to that fact.
|
||
|
|
I don't know if you were brought by them around.
|
||
|
|
Yeah, so spread is a messaging software that uses multi-cast,
|
||
|
|
and what it claims is reliable order delivery of messages over multi-cast.
|
||
|
|
And there is a module that they wrote, the same guys that Omniti, the guy who did the reconnoiter,
|
||
|
|
Theo Schloshnagelenko, they wrote a module called Modlogs Spread for Apache.
|
||
|
|
I don't know if there's been a whole lot of development work on it,
|
||
|
|
but essentially what it does is it allows your Apache logs to get sent out
|
||
|
|
over this spread communication channel.
|
||
|
|
And the presenter, I saw, I think it was Theo, but I saw the Apache con,
|
||
|
|
and there were years ago, talk about how they are able to doing this,
|
||
|
|
watch their traffic real-time, watch their logs coming off the machines real-time,
|
||
|
|
being able to subscribe to different events and see what was going on there.
|
||
|
|
It's a really kind of cool stuff.
|
||
|
|
But...
|
||
|
|
Also, with a lot of watch here,
|
||
|
|
where should we listen to a little swatch?
|
||
|
|
Swatch.
|
||
|
|
And I've just done a couple of people on when I was teaching them.
|
||
|
|
I mentioned that, and it basically will learn you to get them.
|
||
|
|
That's like, people like, you know, I watch, we'll tell you what I grew up with.
|
||
|
|
Swatch is a, we'll watch the monitor and tell you what's happening right now.
|
||
|
|
There's a big problem.
|
||
|
|
Here you go.
|
||
|
|
Next pattern virtualization is important to learn your data center.
|
||
|
|
There's a lot of different virtualization.
|
||
|
|
You want to use it for consolidation, where it makes sense.
|
||
|
|
United online, we had a lot of benefit here.
|
||
|
|
We started virtualizing a lot of things.
|
||
|
|
We're able to consolidate quite a bit.
|
||
|
|
And one of our data center modes, we had this, I don't know, 40% reduction in footprint,
|
||
|
|
which, you know, equates a lot of power savings, a lot of cost savings,
|
||
|
|
to have less rack space that you have to contain, because we had a lot of systems
|
||
|
|
that were, you know, we built back in 2000, 2000 area machines
|
||
|
|
that were able to consolidate on these monster quad cores with lots of disk and stuff.
|
||
|
|
So where it makes sense, don't go overboard with it,
|
||
|
|
but where it makes sense to consolidate stuff, the risk you run with the virtualization
|
||
|
|
is when your host node goes down for whatever reason, it affects a lot more systems
|
||
|
|
than just the one that it used to be before you virtualized it.
|
||
|
|
And also, virtualization is really useful for replicating your production environments
|
||
|
|
on a much smaller scale to be able to replicate every aspect of your production environment
|
||
|
|
in a controlled setup.
|
||
|
|
So you can practice all of your CF engine changes, all your puppet changes,
|
||
|
|
all your remote mass execution changes.
|
||
|
|
You can practice all that stuff and make sure you don't have that inadvertent RM-RF
|
||
|
|
that's going to wipe out all your code or whatever it caused mass havoc in your production system.
|
||
|
|
So just a really, really useful tool to be able to replicate your entire production environment.
|
||
|
|
It's also a really good exercise because then you get to know what's in your production environment.
|
||
|
|
Sometimes when you scale out to these massive things,
|
||
|
|
and your company's been going through 10 years,
|
||
|
|
you've got all sorts of dark cores that who knows what's going on?
|
||
|
|
As some developer wrote that 10 years ago,
|
||
|
|
I don't know what it does, it just does.
|
||
|
|
Right?
|
||
|
|
And so, virtualizing going through this exercise is a virtualizing your environment
|
||
|
|
that makes you go through and explore those dark corners and figure out what on earth is this thing doing.
|
||
|
|
So it's just a really good exercise to go through to be able to have that experience
|
||
|
|
and that knowledge of every aspect of your system, every cron job,
|
||
|
|
how it sort of runs likewise, obviously on a smaller scale,
|
||
|
|
but every process that you have in your production system
|
||
|
|
replicated in a virtualizing environment.
|
||
|
|
A QA or a thing.
|
||
|
|
So again, there's lots of open source tools.
|
||
|
|
You've got Zen, virtual box,
|
||
|
|
by Sun, OpenVZ, by the virtualizer guys.
|
||
|
|
KVM, SolarZones, VMware,
|
||
|
|
not open source, but yeah, it's there.
|
||
|
|
Penal number 18 is a well-staff knock.
|
||
|
|
Once you get to a certain point,
|
||
|
|
it lets your system have a life.
|
||
|
|
It requires, however, very well-documented systems
|
||
|
|
and very well-documented procedures.
|
||
|
|
But again, it's really nice to have a frontline staff that handles all of the minutia
|
||
|
|
and allows you to focus on your core business and your core products
|
||
|
|
and focus on getting real work done in theory.
|
||
|
|
A number 19 is some kind of internal knowledge base.
|
||
|
|
Maybe a wiki, maybe some docs and CVS or some kind of version control.
|
||
|
|
The important thing is lower the barrier to entry.
|
||
|
|
If it's a pain in the rear to use it, nobody's going to use it
|
||
|
|
and it's going to lose its value.
|
||
|
|
The easier it is to use it, the more intuitive it is to use it,
|
||
|
|
the less documentation you have to write to explain how to use it,
|
||
|
|
the more people are going to want to use it.
|
||
|
|
But again, as you're deploying new systems,
|
||
|
|
document them as you find out problems
|
||
|
|
as you, you know, patch holes in your monitoring.
|
||
|
|
Oh, we found that document, what happened?
|
||
|
|
Keep a daily log of what you're doing
|
||
|
|
so that if you go back and say, well, what happened?
|
||
|
|
How did I fix this particular problem?
|
||
|
|
You can go back and say, yeah, that's, you know,
|
||
|
|
these magic incantations that I had to use that time to fix this problem.
|
||
|
|
All in a good knowledge base.
|
||
|
|
Item number 20 is inventory and asset management.
|
||
|
|
Once you get to a large number of machines,
|
||
|
|
it becomes interesting trying to keep track of all that hardware.
|
||
|
|
When you have 20,000 machines, it's really easy for hardware
|
||
|
|
to get lost or to fall through the cracks or whatnot.
|
||
|
|
You want some kind of system that makes it easy
|
||
|
|
once you unpack that thing and rack it up to get it in your inventory management.
|
||
|
|
We got a system where we can barcode,
|
||
|
|
we'd slap a barcode on a machine with a laptop
|
||
|
|
that's got a USB barcode scanner, scan it right in,
|
||
|
|
it goes into the system, gets into that asset management.
|
||
|
|
Some of these systems that are available out there,
|
||
|
|
we talked about in the monitoring, the data collection system
|
||
|
|
that they kind of double duty as that is in OSS, OpenCurion,
|
||
|
|
XCAT, OCS Inventory, NGI, Classify,
|
||
|
|
RackMonkey, RackTables, give you Rack Diagrams,
|
||
|
|
where you can build your own.
|
||
|
|
Back in the United Online, I built my own system
|
||
|
|
called the Host Data Base.
|
||
|
|
And essentially this asset management,
|
||
|
|
and it tied in a bunch of stuff with Kickstarter
|
||
|
|
and all sorts of things.
|
||
|
|
And with all these open source tools,
|
||
|
|
you can then begin to tile this stuff together
|
||
|
|
into this Uber system.
|
||
|
|
Pattern number 21, again, we talked about staging beta and QA,
|
||
|
|
whatever you want to call your stages,
|
||
|
|
maybe it's development, maybe it's alpha beta live,
|
||
|
|
whatever you want to call it.
|
||
|
|
You've got to have some kind of system for going
|
||
|
|
from development, which you absolutely don't want
|
||
|
|
development, doing stuff out on your production servers.
|
||
|
|
I remember a free server, which is where
|
||
|
|
what the main product, the United Online,
|
||
|
|
one of the hosting products, one of their brands that we had.
|
||
|
|
Back in the day, when they were North Sky,
|
||
|
|
the developers would develop on Web Server number one,
|
||
|
|
and the poor folks on Web Server number one
|
||
|
|
had to live with whatever code happened
|
||
|
|
to get pushed out by the developers there.
|
||
|
|
And they went to work on Web Server one,
|
||
|
|
we pushed it out to the rest of the servers.
|
||
|
|
Probably not the best way that you want to do that.
|
||
|
|
But again, you want to have kind of some kind of
|
||
|
|
development environment roped off,
|
||
|
|
carefully, so nothing makes its way out to production.
|
||
|
|
Oh, that's an interesting story there.
|
||
|
|
Probably, if you've ever been in Oracle DBA,
|
||
|
|
you've been bitten by this at least once upon a time.
|
||
|
|
Oracle has, when you're in Oracle database,
|
||
|
|
there's this little program called the listener.
|
||
|
|
And it's what handles incoming TCP connections
|
||
|
|
and follows those connections onto the database.
|
||
|
|
The listener has a configuration
|
||
|
|
in which you specify the Oracle SID
|
||
|
|
and you specify the host configuration.
|
||
|
|
And it gets really easy to copy those TNS
|
||
|
|
to the listener configuration around.
|
||
|
|
And so one day, I remember I had copied the listener configuration
|
||
|
|
from the production environment
|
||
|
|
into the development environment,
|
||
|
|
stopped the listener on that machine
|
||
|
|
and also then I got production alerts
|
||
|
|
that the database from the production database was down.
|
||
|
|
Well, that listener had the capability of being able to reach out
|
||
|
|
to whatever IP address was in the configuration
|
||
|
|
and shut down the list remotely.
|
||
|
|
So a nice way to shut down any production database
|
||
|
|
when you didn't intend to.
|
||
|
|
So it's often a good idea to fence off your development
|
||
|
|
environment, so that kind of thing doesn't happen.
|
||
|
|
Again, having a beta or a staging
|
||
|
|
environment that looks a lot as much as possible
|
||
|
|
light production that's not development
|
||
|
|
for you to then, once you've got development,
|
||
|
|
you roll your code out there, test out your changes,
|
||
|
|
test out your configuration changes
|
||
|
|
and make sure they're working well
|
||
|
|
and then push it off to production
|
||
|
|
once you're satisfied that things look good.
|
||
|
|
And again, all of this is going through your change management
|
||
|
|
system, your change in the auditing system.
|
||
|
|
So all these changes are tracked
|
||
|
|
and you can easily pinpoint where
|
||
|
|
when this changes happen.
|
||
|
|
Okay, so that's most of these patterns.
|
||
|
|
Wow, and I'm glad you're still with me here.
|
||
|
|
We talked about those virtually
|
||
|
|
since a great help on those.
|
||
|
|
Automate the building, those environments.
|
||
|
|
If you can automate the creation of all of that
|
||
|
|
with your magic use data center script, right?
|
||
|
|
That's you're getting quite ways there to wear it,
|
||
|
|
instantiates all the pieces, instantiates all your
|
||
|
|
kickstart and provisioning, instantiates all your
|
||
|
|
puppet configs and everything like that.
|
||
|
|
You can really go to town with all that.
|
||
|
|
All right, remember 22 backups.
|
||
|
|
Make sure you're keeping backups
|
||
|
|
with the stuff that needs to be backed up.
|
||
|
|
Just a couple of lists there.
|
||
|
|
There's probably a ton more.
|
||
|
|
Amanda, back a little, it's really interesting
|
||
|
|
when I was looking at that one.
|
||
|
|
And finally, pattern number 22.
|
||
|
|
One tool to rule them all.
|
||
|
|
That's this tying it all together.
|
||
|
|
Back at United Online, I was telling you about this host
|
||
|
|
database. It started out kind of with kind of two facts.
|
||
|
|
Number fact number one was that post-gress QL
|
||
|
|
could store how to native data type
|
||
|
|
from MAC addresses and IP addresses.
|
||
|
|
Wow, that's really cool.
|
||
|
|
Like, it store host information in that thing.
|
||
|
|
And it kind of just blew up from all that, right?
|
||
|
|
So I created this whole database schema
|
||
|
|
about storing all this information about our system.
|
||
|
|
And the fact number two was that Kickstart,
|
||
|
|
the Pixie system with Red Hat and Kickstart,
|
||
|
|
it can read Kickstart files over HTTP.
|
||
|
|
So as a machine would come up,
|
||
|
|
it's the HTTP request and in the config
|
||
|
|
that comes back with that, it says,
|
||
|
|
oh, hey, you'd go fetch your Kickstart file from here.
|
||
|
|
Well, that HTTP URL that it would fetch happened to be a CGI
|
||
|
|
and it would dynamically determine,
|
||
|
|
based on who was asking for it, which Kickstart file
|
||
|
|
needed to hand out to it?
|
||
|
|
And so it would hand out the Kickstart file
|
||
|
|
and off it would go and install CF Engine
|
||
|
|
and then CF Engine would take it from there.
|
||
|
|
So you'd be able to tie all this stuff together.
|
||
|
|
I love putting this stuff together.
|
||
|
|
I love integrating all this stuff.
|
||
|
|
So let's talk for a minute about,
|
||
|
|
if I had, you know, I have this dream
|
||
|
|
of this used data center kind of automation tool
|
||
|
|
and host management system.
|
||
|
|
It's the ideal data center management system.
|
||
|
|
And it wraps together all the stuff that we've been talking
|
||
|
|
tonight, all these tools that you can use
|
||
|
|
to fulfill all of these patterns that we've talked about.
|
||
|
|
Now, the architecture would look something like this.
|
||
|
|
You'd have some kind of data storage in there at the bottom.
|
||
|
|
You'd have course audit trail and versioning.
|
||
|
|
You've had kind of an engine that provides modular functionality.
|
||
|
|
So if I wanted monitoring or graphing
|
||
|
|
or whatever piece of all these pieces I could plug it in
|
||
|
|
and it would just kind of fit into that whole system
|
||
|
|
with a nice little rest API sitting on top of it.
|
||
|
|
So I could write a web UI, a flash UI,
|
||
|
|
a command line interface, or a GTK UI,
|
||
|
|
or a mobile interface to whatever I want.
|
||
|
|
And then I'll plug into that.
|
||
|
|
So you'd have an architecture something like that.
|
||
|
|
Among and among other things,
|
||
|
|
and in no particular order,
|
||
|
|
would be able to manage all of our hosts,
|
||
|
|
which are our network-connected devices.
|
||
|
|
We could be able to specify all sorts of details
|
||
|
|
for our hosts, interfaces.
|
||
|
|
And please don't make me just specify
|
||
|
|
one single IP address from machine.
|
||
|
|
No, what I mean here.
|
||
|
|
Machine can have multiple interfaces
|
||
|
|
and each interface can have multiple IP addresses,
|
||
|
|
you know, subinterfaces, that kind of thing.
|
||
|
|
Let me manage disks and partitions, memory, CPU,
|
||
|
|
motherboard, all the details that I need to store
|
||
|
|
about a host, all that very information.
|
||
|
|
I should be able to create a timestamp log
|
||
|
|
entries about this host.
|
||
|
|
I rebooted this guy today.
|
||
|
|
I kickstarted this guy today, whatever.
|
||
|
|
Keeper running, logged, everything.
|
||
|
|
It's happened on that system.
|
||
|
|
Oh, I should be able to visualize
|
||
|
|
where this host lives in the data center.
|
||
|
|
So I should have rack diagrams
|
||
|
|
and get auto-generated from this stuff.
|
||
|
|
I don't know if we have proximity sensors
|
||
|
|
so that when we rack a machine,
|
||
|
|
it can automatically phone home and say,
|
||
|
|
hey, I live here.
|
||
|
|
Maybe someday I down the road.
|
||
|
|
Maybe in the cloud, I can make my virtual rack diagrams.
|
||
|
|
Anyways, but it should have, you know, be visual
|
||
|
|
so I think visiostancels and what my machines look like
|
||
|
|
so I can tell the remote hands guy at 2 a.m.
|
||
|
|
in the morning, this is where the power button,
|
||
|
|
this is where the, you know, the plugs are on this thing.
|
||
|
|
Should be able to spit out spreadsheets
|
||
|
|
at those rack layouts so I can, you know,
|
||
|
|
send them up to the suits, up to my bosses and whatever.
|
||
|
|
I should be able to easily modify DNS information
|
||
|
|
to be a really nice intuitive DNS configurator there.
|
||
|
|
I should be able to add and configure all that stuff.
|
||
|
|
When I add a new host,
|
||
|
|
it should automatically add DNS records for me.
|
||
|
|
And it should help me from shooting myself in the foot
|
||
|
|
like adding trailing dots to hostnakes,
|
||
|
|
bumping up my cereals when I make any kind of change.
|
||
|
|
To me, I'll be able to roll back
|
||
|
|
to a known working DNS configuration
|
||
|
|
and log any DNS changes.
|
||
|
|
Logging DNS changes, that would be fantastic.
|
||
|
|
I should really keep track of all the network subnets.
|
||
|
|
And when I plug in a machine,
|
||
|
|
when I plug in its interface information,
|
||
|
|
it should magically show up in that, that subnet view.
|
||
|
|
I should be able to auto detect uncatalog host
|
||
|
|
on the network and assimilate them into this system.
|
||
|
|
I should be able to control a large number of aspects
|
||
|
|
of each host since my system is heighted
|
||
|
|
a puppet or a CF engine or something.
|
||
|
|
So from that command center,
|
||
|
|
I should be able to do all of the changes
|
||
|
|
and modifications that I could be able to do
|
||
|
|
in puppet or CF engine.
|
||
|
|
So anything that I should be able to do in puppet or CF engine,
|
||
|
|
I should be able to do from this command center.
|
||
|
|
I should be able to install and uninstall packages.
|
||
|
|
I should be able to enable and disable chronicries.
|
||
|
|
So all of this, again, puppet and CF engine time,
|
||
|
|
should be able to manage user accounts from this system.
|
||
|
|
This guy, this admin's acting up,
|
||
|
|
he did some bad luck to save those accounts.
|
||
|
|
I should be able to manage file and directory permissions.
|
||
|
|
If we've got bad files out there,
|
||
|
|
I should be able to get rid of them
|
||
|
|
or touch files that need to exist.
|
||
|
|
I should be able to control which processes are running,
|
||
|
|
control which services need to start up on a machine
|
||
|
|
when it boots up.
|
||
|
|
I should be able to manage my kickstarting
|
||
|
|
and provisioning from this system.
|
||
|
|
It should be a let's spit out kickstart files for me.
|
||
|
|
I should be able to choose a particular box,
|
||
|
|
choose a type of image or a system profile,
|
||
|
|
or have a suggest one to me based on the harbor profile
|
||
|
|
I'm looking at.
|
||
|
|
I've got a Dell 1950, that, and I need a web server,
|
||
|
|
you know, go through all the right image on it.
|
||
|
|
I should be able to manage machines
|
||
|
|
and multiple data centers around the globe.
|
||
|
|
So I should be able to gather all my stuff
|
||
|
|
and, you know, I've got one in China, I've got one in LA,
|
||
|
|
whatever, and we should be able to have granular access
|
||
|
|
if you don't want the level one CIS admins,
|
||
|
|
you know, having full control, not yet.
|
||
|
|
I should be able to manage a system monitoring
|
||
|
|
in this system, it should auto-generate,
|
||
|
|
my Nagios configuration and spread that out to my Nagios servers
|
||
|
|
or whatever I'm using for monitoring.
|
||
|
|
I should be able to, for a given host,
|
||
|
|
I should be able to see all the grassers associated
|
||
|
|
for that host via MRTGR or Ginglia.
|
||
|
|
Both at the host level and an arbitrary host grouping level.
|
||
|
|
I should be able to group by machine type,
|
||
|
|
by OS type, by operating system type,
|
||
|
|
and then do configuration changes or see grass
|
||
|
|
based on that thing all aggregated up
|
||
|
|
into that specific group.
|
||
|
|
So if I say I've got a set of customers
|
||
|
|
right out on machine day B and C,
|
||
|
|
I just want to see the stats for those machines.
|
||
|
|
You know, I'll be able to pull those stats
|
||
|
|
and aggregate them shown to me.
|
||
|
|
I should be able to fire massive execution of commands
|
||
|
|
from this machine and have little thing
|
||
|
|
where I can say I want to run this
|
||
|
|
on this arbitrary grouping post.
|
||
|
|
And again, on an arbitrary grouping,
|
||
|
|
I should be able to visually see which machines
|
||
|
|
my command executed successfully on.
|
||
|
|
How many times have you used a mass execution tool
|
||
|
|
and only, you know, two thirds of it takes on your system
|
||
|
|
and then you got to go chase down
|
||
|
|
the one third that didn't take.
|
||
|
|
So I should be able to see visually,
|
||
|
|
you know, little red and green dots.
|
||
|
|
Okay, this is spread out here and all this
|
||
|
|
funk has that kind of capability to be able to check
|
||
|
|
that, oh yeah, this actually happened on that machine.
|
||
|
|
Good set.
|
||
|
|
And then our optionally,
|
||
|
|
be able to check the command output on a particular machine
|
||
|
|
for what happened when I sent that command to it.
|
||
|
|
This system should be integrated into our ticketing system
|
||
|
|
so that they can give a machine or host on the network
|
||
|
|
and should be able to see what tickets are associated with it.
|
||
|
|
I should be able to manage virtual machines
|
||
|
|
as well as this.
|
||
|
|
This should be able to control a cloud-based data center
|
||
|
|
as opposed to a physical data center.
|
||
|
|
So I'm going to point out an EC2
|
||
|
|
or one of the other cloud services.
|
||
|
|
Who was it?
|
||
|
|
Tim Bray, I think, is working.
|
||
|
|
I think it's son developing a protocol.
|
||
|
|
It's compatible with Amazon's EC2
|
||
|
|
but is a really well designed protocol
|
||
|
|
for just this sort of thing.
|
||
|
|
I should be able to tag or group machines
|
||
|
|
and apply different actions to each tag or group.
|
||
|
|
Should have a wiki bail thing to us
|
||
|
|
so I've got my knowledge base built into that.
|
||
|
|
I managed to have a red glowing orb to remind me
|
||
|
|
occasionally who is really in control.
|
||
|
|
Again, it goes back to pattern number 13.
|
||
|
|
Right?
|
||
|
|
Where is that?
|
||
|
|
Okay.
|
||
|
|
Slide malfunction.
|
||
|
|
Slide.
|
||
|
|
Sorry, this is going to be so good too.
|
||
|
|
There we go.
|
||
|
|
Okay, so that's it.
|
||
|
|
My red glowing orb there, right?
|
||
|
|
But, of course, since we're working on maturity to be green.
|
||
|
|
Right?
|
||
|
|
But, again, since we're working on maturity to be branded,
|
||
|
|
our maturity is always about all about marketing and branding,
|
||
|
|
so we've got to make sure it's the right green.
|
||
|
|
Okay, my system should be integrating with version controls.
|
||
|
|
So I can time shift or roll back for real forward
|
||
|
|
to different configuration states.
|
||
|
|
That should be able to integrate
|
||
|
|
with my change management system.
|
||
|
|
So when the Starbucks auditors come in,
|
||
|
|
I can point to and say, yeah,
|
||
|
|
these changes happen at this point in time by this person.
|
||
|
|
I managed to be able to integrate
|
||
|
|
with the bug tracking systems that the developers are using.
|
||
|
|
I should be able to remotely power cycle
|
||
|
|
and line machines from there control aspects of that.
|
||
|
|
I should be able to open a remote console through that,
|
||
|
|
through like the ILO on HP or through the remote access
|
||
|
|
controller on Dell hardware, that kind of thing.
|
||
|
|
It should be modular, easy,
|
||
|
|
extensible for anything that I happen to forgot today.
|
||
|
|
And I should be able to have nice reports
|
||
|
|
that I can print out and send up management.
|
||
|
|
I should be able to export data from the system.
|
||
|
|
I should be able to arbitrarily group host together
|
||
|
|
and talk about that.
|
||
|
|
And these views should adapt themselves.
|
||
|
|
So I'm looking at performance graphs.
|
||
|
|
And I want this particular group
|
||
|
|
and you should aggregate that data based on that group.
|
||
|
|
In essence, the systems should be able to tie together
|
||
|
|
everything we discussed tonight,
|
||
|
|
fulfill all those patterns
|
||
|
|
and one nice user control system.
|
||
|
|
It's an easy world.
|
||
|
|
Again, it comes back to sufficiently advanced technology.
|
||
|
|
But magic, well, we can make it happen
|
||
|
|
with all the stuff in cloud computing
|
||
|
|
where you're getting closer to this idea
|
||
|
|
of using data, use data center and becoming your reality.
|
||
|
|
The more we can abstract all the crap away,
|
||
|
|
the more we can focus on our business
|
||
|
|
and on what we're doing.
|
||
|
|
And when data center pops become utility,
|
||
|
|
what then becomes possible?
|
||
|
|
My deposit that with all the open source tools available,
|
||
|
|
this is getting very close to some extent.
|
||
|
|
There's always going to be hardware
|
||
|
|
that you have to deal with and all that stuff.
|
||
|
|
But some other fun things to look at
|
||
|
|
and again, these slides will be available on my website.
|
||
|
|
Through Terra is an application utility provider.
|
||
|
|
They essentially virtualize a data center
|
||
|
|
and provide kind of cloud-based load balancers
|
||
|
|
and databases and servers, fun stuff.
|
||
|
|
Amazon cloud services, right scale, kind of ties off that.
|
||
|
|
Thank you for coming tonight.
|
||
|
|
My blog's up there.
|
||
|
|
I'll be posting the slides on there.
|
||
|
|
I Twitter at Dan Hanks.
|
||
|
|
There's my email.
|
||
|
|
Thanks for staying with me here.
|
||
|
|
I hope it's been worth your time.
|
||
|
|
I hope you've learned something that you can go home with
|
||
|
|
and use in whatever you happen to be doing.
|
||
|
|
Thank you.
|
||
|
|
Thank you for listening to Hacks of Public Radio.
|
||
|
|
HPR is sponsored by Carol.net.
|
||
|
|
She'll head on over to CARO.nc for all of us in need.
|