329 lines
23 KiB
Plaintext
329 lines
23 KiB
Plaintext
|
|
Episode: 3758
|
||
|
|
Title: HPR3758: First sysadmin job - war story
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3758/hpr3758.mp3
|
||
|
|
Transcribed: 2025-10-25 05:02:59
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is Hacker Public Radio Episode 3758 for Wednesday, the 28th of December 2022.
|
||
|
|
Today's show is entitled, First Sis Admin Job War Story.
|
||
|
|
It is hosted by Norrist, and is about 28 minutes long.
|
||
|
|
It carries a clean flag.
|
||
|
|
The summary is, how I got my first job as a Sis Admin, and a story about NFS.
|
||
|
|
Okay, so I thought I'd record a quick holiday episode for HPR, and I'll do kind of a combo
|
||
|
|
story about how I got my first job in tech, I haven't always worked in tech I'm currently
|
||
|
|
a Linux Admin, and then I'll combine that with a bit of a war story about my first week.
|
||
|
|
I have for a long time since 2000, then a Linux user, and I didn't have a Linux job
|
||
|
|
that far back.
|
||
|
|
But I was working for a place that had a contract with the government, and the contract was
|
||
|
|
going to end.
|
||
|
|
We didn't know like the specific date was going to end, but we knew the job itself that
|
||
|
|
we were there doing was only going to take about 10 years.
|
||
|
|
So we all knew, when you took a job there, you knew that at some point you were going
|
||
|
|
to get laid off, and if you made it until the end, everyone was going to get laid off
|
||
|
|
at the end.
|
||
|
|
So even though it kind of sucks having a job where you can't work there forever, it gives
|
||
|
|
you sort of a unique opportunity to sort of plan for changing careers.
|
||
|
|
So since you can look ahead and know, owner about this year, what I have to do something
|
||
|
|
different, it gives you time to prep for it.
|
||
|
|
So since I had been sort of a Linux on the desktop, hobbyist for a long time, I thought
|
||
|
|
well, now's my chance to do what I can, and then maybe when I do get laid off, I can
|
||
|
|
bond a job as a Linux admin or something.
|
||
|
|
So I started just kind of adding to the things that would normally do around the house with
|
||
|
|
Linux.
|
||
|
|
So instead of, you know, printers and playing music and configuring X11, I would, you know,
|
||
|
|
do things like trying to set up web servers or file servers, or maybe even an LDAP server
|
||
|
|
or stuff like that and, you know, doing virtualization and whatever I could that I thought maybe
|
||
|
|
things that a Linux admin might do.
|
||
|
|
The other thing I started working on was getting some certifications.
|
||
|
|
So I started with the Red Hat certifications I went and got the Red Hat at the time
|
||
|
|
it was Red Hat Certified System Administrator, or no, at the time it was Red Hat Certified
|
||
|
|
Technician and they've since changed it to Red Hat Certified System Administrator.
|
||
|
|
But I started with that, that's kind of their entry levels, and then a few years later,
|
||
|
|
I got the Red Hat Certified Engineering Cert.
|
||
|
|
Eventually, I got laid off just like I knew I would, and I started kind of slowly
|
||
|
|
starting looking for tech shop.
|
||
|
|
So one of the jobs I applied for, they called me back pretty quick like the next day,
|
||
|
|
and it turns out the company that called me, there were a small web development shop, and
|
||
|
|
they had some, they had three Linux admins, well they were staffed to have three Linux admins.
|
||
|
|
And earlier in the year, two of them had left, not at the same time, but in for different
|
||
|
|
reasons.
|
||
|
|
But one had left and they were kind of dragging their feet a little bit on replacing them.
|
||
|
|
And another one left and they started getting serious about replacing them.
|
||
|
|
And then there was a third guy who was kind of a junior admin.
|
||
|
|
He was kind of a mix of an admin and a developer.
|
||
|
|
So he was sort of a member of the show by himself for a little while, and eventually he had
|
||
|
|
left, he had decided to leave.
|
||
|
|
And so at this point, they were desperate to get some new people in because they had,
|
||
|
|
I guess they were staffed with three people, and they were just a few weeks away from
|
||
|
|
having zero.
|
||
|
|
So they were able to hire from like a temp IT agency, a Linux admin, but he couldn't work
|
||
|
|
there forever.
|
||
|
|
And they had found another kind of senior admin, but he wasn't going to be able to start
|
||
|
|
right away because he had a job and he had some big projects and stuff he wanted to finish.
|
||
|
|
But they needed, they needed someone to start immediately.
|
||
|
|
And since I was laid off, and even though I could tell they weren't really sure if I could
|
||
|
|
do the job, since I could start immediately, it really got their attention.
|
||
|
|
So the, like I said, it was a small web development shop they had about 10 developers, a few project
|
||
|
|
managers and designers, and support desk so people can call in with support stuff.
|
||
|
|
It was most of their applications were PHP applications that ran on Linux, and they were
|
||
|
|
kind of, they were all over the place with Linux, they were sort of Linux versions.
|
||
|
|
It was kind of whoever was charged at the time would deploy whatever Linux version happened
|
||
|
|
to be their favorite at the time.
|
||
|
|
So there was Zeus, there was Ubuntu, there was Debian, there was Red Hat, there was
|
||
|
|
Leras, it was, it was a big, big mix of things.
|
||
|
|
And there was also some Java and a little bit of Windows.
|
||
|
|
Like I said, they were desperate and I could start right away.
|
||
|
|
So they started interviewing.
|
||
|
|
So I got to interview with the guy who was leaving of the three, the last one that was
|
||
|
|
there.
|
||
|
|
And it was basically his last week.
|
||
|
|
So I interviewed with him and some of the kind of senior developers they knew a bit
|
||
|
|
about Linux, and they did, they were really careful with me, or the row.
|
||
|
|
So I did, you know, I came in and I did an interview with like the person who was going
|
||
|
|
to be my boss's boss and the developers, and then the guy who was going to be my boss
|
||
|
|
but hadn't started yet, he wanted to meet me.
|
||
|
|
So we kind of met for a quick lunch interview, because he wanted to make sure, you know,
|
||
|
|
I would do, or we could at least get along and that, you know, things I said made sense
|
||
|
|
to him.
|
||
|
|
Then they wanted to do something a little more technical.
|
||
|
|
So they had someone set up a laptop with a Linux VM on it.
|
||
|
|
They sort of wrote out a list, a task for me to do.
|
||
|
|
So I mean, it was anywhere from simple stuff to adding users and making sure they could
|
||
|
|
suit them.
|
||
|
|
They wanted me, for some reason, they wanted me to compile from source, a specific version
|
||
|
|
of Apache and PHP, and they just had all these kind of crazy things that they wanted me
|
||
|
|
to do, and that the list was long, they gave me a big long list and like two hours to do
|
||
|
|
it.
|
||
|
|
I didn't, I didn't finish.
|
||
|
|
The list was too long, I didn't finish, but the other thing they wanted to do was after
|
||
|
|
that kind of technical interview, they wanted me to meet with all the managers.
|
||
|
|
So again, it was a boss's boss and his boss and his boss's boss all just kind of set
|
||
|
|
now.
|
||
|
|
And it wasn't, they asked me a few technical questions in that interview, but I think it
|
||
|
|
was mostly just trying to figure out, am I, am I for real, you know, is it really possible
|
||
|
|
that someone who's never worked in IT before can, can, can do the job?
|
||
|
|
So obviously, since I'm telling the story, they, they did hire me, my first week there,
|
||
|
|
it was just me and the guy from the tip, ain't it, say the, the third guy who had helped
|
||
|
|
with the interviews and stuff, he, he was gone.
|
||
|
|
So his last day was like the Friday before my, my first day, but there was, you know,
|
||
|
|
there was some minimum turnover, some, you know, maybe a 20 page or a document.
|
||
|
|
And then the, you know, the two or three weeks that the temp admin had before that was
|
||
|
|
really the extent of the, the training in turnover.
|
||
|
|
So a little bit about kind of the infrastructure there.
|
||
|
|
All of their servers were in a data center that wasn't too far from the office so we could
|
||
|
|
go visit the data center when we needed it, when we needed to, and it was in, it was
|
||
|
|
like three racks worth of equipment, it was mostly virtualized.
|
||
|
|
There were a few physical servers, physical machines for heavy loads, like databases,
|
||
|
|
it would be physical servers.
|
||
|
|
A lot of ESX hosts that we, you know, we virtualized on VMware and a lot of, and some storage
|
||
|
|
and stuff like that.
|
||
|
|
The applications were mostly virtual machines.
|
||
|
|
For the PHP applications, they would all kind of share a directory to get their PHP
|
||
|
|
code from.
|
||
|
|
And when I say get their code from, I don't mean like they would copy it whenever new
|
||
|
|
code was available.
|
||
|
|
I mean, they would just literally mount this NFS share in like VAR, WW, or whatever.
|
||
|
|
The way every, every application server had the exact same code all the time.
|
||
|
|
And then there's a few other things they would have on this NFS server, including, you
|
||
|
|
know, config files for some of the load balancers will be on there, application logs will be
|
||
|
|
on there.
|
||
|
|
It was just kind of a generic place to put things, anything that needed to be available to
|
||
|
|
more than one server was probably on this NFS server.
|
||
|
|
You know, the NFS server was a virtual machine also.
|
||
|
|
You could tell, it had kind of grown over Tom, you know, there's a, there's a few strategies
|
||
|
|
for adding this space to a virtual machine when it's running, kind of the easiest one
|
||
|
|
to just add another disk.
|
||
|
|
So this NFS server that was a virtual machine, had like five disks attached to it because
|
||
|
|
it would, every time they would add a new kind of project or something for it to do, they
|
||
|
|
didn't have enough space for it, they would just add another virtual disk to it.
|
||
|
|
For the VMware cluster was kind of an oldish sand, it was, it was branded sun, but this
|
||
|
|
was after Oracle had bought sun.
|
||
|
|
So it was all sun branded stuff, it was supported by Oracle and to kind of maximize the available
|
||
|
|
space, most of the sand was raid Vov.
|
||
|
|
So they would, you know, take a group of disks, put it together and raid Vov and then use
|
||
|
|
those raid Vov disk bundles to export that to VMware and then that's where VMware would
|
||
|
|
store the virtual disk for the machines, including all of the application servers and this
|
||
|
|
NFS server.
|
||
|
|
So even before I started there was a history, it went in the last year, before I started,
|
||
|
|
there was a history of really poor performance with the PHP applications and no one really
|
||
|
|
understood why, I mean, any, all of the troubles you did with the previous admin to
|
||
|
|
just kind of let it dead ends.
|
||
|
|
But one thing we would notice when the applications were running slow, was that the load average
|
||
|
|
on the NFS server would climb and it wouldn't get high, like it wouldn't get into the hundreds
|
||
|
|
or anything, but it would just go from like where I would normally run at one or one and
|
||
|
|
a half, it would go up to like four and we could tell, like we could look at the load
|
||
|
|
average on the NFS server and based on that tell how well or poorly the PHP applications
|
||
|
|
were running.
|
||
|
|
One of our sort of first indicators that things were going poorly was that we had one of
|
||
|
|
the office staff with processed payments that people would make.
|
||
|
|
So you know, a lot of our applications would take, take payments and then the sort of
|
||
|
|
accounting personnel, we had a kind of a homegrown tool that was also a PHP application
|
||
|
|
ran on the same infrastructure, but they were usually the first to notice that things
|
||
|
|
were going south and they would try to say can you check the load average on the NFS server
|
||
|
|
but they would usually come and scream at about the load balancer instead.
|
||
|
|
We did have a load balancer, but that wasn't actually the problem, but it was clear
|
||
|
|
to us, you know, everyone was sort of frustrated with how things were performing and frustrated
|
||
|
|
with the fact that despite all of us looking at it, no one could really figure out, you
|
||
|
|
know, we tried a lot of different things, PHP settings and NFS settings, but nothing
|
||
|
|
helped.
|
||
|
|
So we had this one of our applications, it was basically the company's kind of flagship
|
||
|
|
application, it's biggest, most popular.
|
||
|
|
If anyone asked it, you know, if anyone asked, you know, what does this company do that
|
||
|
|
was list off things and this would be always me and the list of things that they had made.
|
||
|
|
But the application took payments for the system that was taking payments for.
|
||
|
|
There was an annual deadline and it was the deadline was the same for everybody.
|
||
|
|
So you could pay it any time during the year, but people being people, everyone would
|
||
|
|
wait until the very last day to make the payment.
|
||
|
|
So this particular application ran, okay, most of the time, but you know, once a year
|
||
|
|
on sort of the big day, things would get slow, things would always get slow.
|
||
|
|
And it was sort of known that there's going to be some slowdown and some performance problems
|
||
|
|
and it would all be, you know, kind of geared up and ready for it.
|
||
|
|
This particular year, you know, approximately, I'm about 10 days into the job when, you
|
||
|
|
know, big day arrives and it's terrible.
|
||
|
|
It's awful.
|
||
|
|
Like I've never seen, you know, I don't have time to experience there, but in my two
|
||
|
|
weeks, I saw some poor performance.
|
||
|
|
This was absolutely positively unusable.
|
||
|
|
I mean, you would bring up the website.
|
||
|
|
If you could log in as soon as you try to do anything, you would just stall after stall
|
||
|
|
after stall.
|
||
|
|
So it was pretty bad.
|
||
|
|
So we were all kind of desperate to figure out what a solution just to get us through
|
||
|
|
the day.
|
||
|
|
So remember I'd say that the PHP application says how they all had an in a vest mount where
|
||
|
|
they kept their code, that way they could all have the same code.
|
||
|
|
And the developers, they were pretty insistent that that's how it, the developers wanted
|
||
|
|
to be that way.
|
||
|
|
So they could ensure that every application was running exactly the same.
|
||
|
|
Well, we talked our managers into, you know, for today only, let us build some application
|
||
|
|
servers that are exactly the same except that instead of, you know, instead of mounting
|
||
|
|
the NFS server, we just copy all the files over and let these applications run, you
|
||
|
|
know, just totally off local disk and, you know, in reality, it's a virtual disk on that
|
||
|
|
scene we mentioned before, but it's not touching the NFS server.
|
||
|
|
So that, that quick fix got us some pretty good results.
|
||
|
|
So we went from unusable to actually pretty good.
|
||
|
|
Now, at the time, we didn't understand why, we didn't know like, in our heads, we're
|
||
|
|
thinking, okay, all it's doing is reading the PHP, which isn't that big of the NFS server
|
||
|
|
and separating the NFS server from the application, fix the problem.
|
||
|
|
We didn't understand it.
|
||
|
|
One of the things we thought might be an issue was the sand performance, but the same thing,
|
||
|
|
you know, the applications reading their content directly from the sand versus the applications
|
||
|
|
reading their content from an NFS server that's on the sand was nine day difference.
|
||
|
|
So after we all had a minute, a few days after the big day and we could kind of collect
|
||
|
|
our thoughts and calm down and breathe a little bit, we started trying to figure out,
|
||
|
|
okay, what is it about this NFS server?
|
||
|
|
Well, anything that server is in the mix, performance tanks.
|
||
|
|
So as we're digging in and as we're digging in, we start trying to involve the developers
|
||
|
|
a little bit.
|
||
|
|
And one thing that this application is doing that we didn't know about is logging and when
|
||
|
|
I say logging, I mean, obviously we would look at, you know, the PHP logs and Apache logs.
|
||
|
|
And those are things we were always looking at trying to figure out why is it slow and
|
||
|
|
they didn't leave us anywhere.
|
||
|
|
We didn't know what the application had another log that would log every SQL query that
|
||
|
|
the application ran.
|
||
|
|
So if you did a select, I mean, if you just logged in and search for yourself, search for
|
||
|
|
your name and the application query would be written to the logs.
|
||
|
|
And if you made a payment, that query would be written to the logs.
|
||
|
|
Every query was written to the logs and I want to say the logs, that's wrong.
|
||
|
|
It was all of those queries went to the same log file.
|
||
|
|
That's sort of okay.
|
||
|
|
No, that's not, that's really a bad idea.
|
||
|
|
So NFS doesn't allow multiple clients to write to you the same file at the same time.
|
||
|
|
So if a client says, hey, I need a write to this log file, NFS server will block the file,
|
||
|
|
let the client log to it and then unlock the file.
|
||
|
|
So because we had multiple application servers trying to write to the exact same file,
|
||
|
|
the NFS server was slowing down the applications so it could queue up the rights.
|
||
|
|
So that was the reason we saw such big performance gains when we moved off the NFS server is
|
||
|
|
that the application didn't have to wait anymore before it can write to the query log.
|
||
|
|
Now, eventually, when we heard about this, that's a bad idea for a lot of reasons,
|
||
|
|
writing a query to a log.
|
||
|
|
So eventually we were able to talk to developers out of logging this information, but it was
|
||
|
|
a clear win for us because we were finally able to figure out like, what is it about this
|
||
|
|
NFS server that makes these applications so bad?
|
||
|
|
And this particular application wasn't the only one that was doing that writing to a common
|
||
|
|
log file, but like I said, it was the biggest one and it was the one that calls the most
|
||
|
|
problems and it was the one that got the most attention.
|
||
|
|
So after that, we were still kind of interested in why the NFS performance was so bad and
|
||
|
|
why it had gotten worse because the application itself, you know, where it's writing to this
|
||
|
|
kind of common log file, it had been like that for years and there were some growth in
|
||
|
|
the application, but not enough growth to explain the performance drop year over year.
|
||
|
|
So we knew, even though we fixed the problem, but we knew there had to be something else
|
||
|
|
kind of underlying because the problem was getting worse and worse and worse.
|
||
|
|
So we had some pretty decent monitoring and we were able to, remember I said, the load
|
||
|
|
average on the NFS server would go up when performance was bad and you could see it,
|
||
|
|
you know, the owner of monitoring looked at graphs of load average and we could see, you
|
||
|
|
know, big spikes whenever on busy days and drop off on weekends and stuff like that.
|
||
|
|
And when we could zoom all the way out, we could zoom the graphs out till like a year and
|
||
|
|
we could see, you know, then we could see big days and small days, but it was interesting
|
||
|
|
to see sometimes, you know, you would go, so when you zoom out to like a year, you could
|
||
|
|
see like a month at a time and the lawns would be pretty steady, you know, for month to
|
||
|
|
month to month and then you would see kind of a drop and then month to month to month
|
||
|
|
and you may see a stair step rise, month to month to month, a lot of times we would look
|
||
|
|
at those and we would try to investigate, okay, what happened on this day that caused
|
||
|
|
this sort of stair step and one thing we really noticed was we finally got rid of that
|
||
|
|
crappy old sun slash Oracle San, upgraded to something considerably better, then you
|
||
|
|
could definitely see the load average on that NFS server, you know, when I said it used
|
||
|
|
to average one and maybe go up to four, you know, now it was down in the light .2 is
|
||
|
|
1.3 and might go up to 0.8, so that was a huge difference in the application just changing
|
||
|
|
the sand, but there was another place when we looked at the annual graph where we could
|
||
|
|
see a drop in load average, a pretty significant maybe about 30% drop and we couldn't figure
|
||
|
|
out one, a lot of times we could go back and we could see these stair steps and go back
|
||
|
|
and oh, that was the day we changed this application or that was the day we got to understand, but
|
||
|
|
we couldn't figure out one, there was one day, particular day, and it happened to be
|
||
|
|
about a few months after this big day of where everything went south, a few months after
|
||
|
|
that we saw like a 30% pretty steady month over month, week over week, 30% drop and load
|
||
|
|
average and we couldn't figure out one, so I ended up working here, working at this
|
||
|
|
top for about five years, you know, sort of the, I was always still kind of the new guy,
|
||
|
|
you know, and just about anywhere you work, if you're working hot tea, if you're just a
|
||
|
|
sad man, you're always kind of the afterthought, like no one really thinks about hot tea unless
|
||
|
|
something's broken, and so I was on a team that no one ever thought about and I was like,
|
||
|
|
the junior guy on the team that I went over thought about, so I had to, I said that to
|
||
|
|
tell you, I had to move, I had to change offices a lot, it was kind of like a cube farm
|
||
|
|
kind of place where there was, there was cubes and desk and offices and it was always nice
|
||
|
|
to be able to move out, you know, from a cube into an office, but someone else would show
|
||
|
|
up, you know, and they'd want my office, so I'd have to move out, and you know, because
|
||
|
|
the, sort of the last person that was ever really considered whenever, thinking about who
|
||
|
|
was going to, who was going to work in what office, I had to move off, there's a lot.
|
||
|
|
One time I was getting ready to move offices again and I was cleaning out a file cabinet
|
||
|
|
and it was just the folder I was looking through, it was just all kind of random receipts
|
||
|
|
and hardware things and stuff like that, I picked up a receipt and I was looking at it
|
||
|
|
and I was trying to figure out what it was and it was a receipt for returning a disc to
|
||
|
|
son or to Oracle and I'm trying to figure out what it was like, why do we do that?
|
||
|
|
And then I remembered that the guy who was my boss whenever I first started, the guy
|
||
|
|
who started on the same day as me and didn't really have any good turnover and he was
|
||
|
|
supposed to be my senior, he had done an RMA or like you, but one day he was in the
|
||
|
|
day center and he saw on this son storage system that one of the disc had a yellow light
|
||
|
|
instead of a green light, so he purported it to son, they sent him a replacement disc
|
||
|
|
and he sent the old bad disc back to son and when I was staring at the piece of paperwork
|
||
|
|
that documented that change and I thought to myself, I wonder if this has anything to do
|
||
|
|
with that unexpected load average drop or the unexpected performance boosts on that NFS
|
||
|
|
server and I looked at the date and it was within like a few days of that drop so finally
|
||
|
|
I was able to piece together that NFS server, some of its discs were on the portion of
|
||
|
|
the storage system that was built using Ray Vov and that disc that he replaced was part
|
||
|
|
of that array.
|
||
|
|
So the reason that the NFS performance had gotten worse year over year was because at some
|
||
|
|
point during the year, no one noticed but a drive failed that was part of a Ray Vov
|
||
|
|
array.
|
||
|
|
If you know anything about Ray and Ray Vov, if you don't know anything what you do need
|
||
|
|
to know is that Ray Vov is fine but if you lose a single disc out of a Ray Vov array,
|
||
|
|
all of your data will still be there but the performance will be terrible.
|
||
|
|
It no longer has an extra disc to write the parody information to so because of this
|
||
|
|
Ray Vov array running with a bad disc, the performance was terrible and then when he swapped
|
||
|
|
the disc out, that's when we could see, we didn't notice it at the time but that's when
|
||
|
|
we could see the performance increase in the NFS server.
|
||
|
|
So a long rambling story, I don't know if you can learn any lessons from that, except
|
||
|
|
maybe if you want to change careers, one key to doing that is to plan ahead if you can
|
||
|
|
but it's sort of the real key, you have to find someone who's desperate, desperate enough
|
||
|
|
to hire someone with no experience, always be careful when you're logging or writing
|
||
|
|
to a network share and never ever ever run Ray Vov in production period, I'll see you
|
||
|
|
guys next time.
|
||
|
|
You have been listening to Hacker Public Radio, and Hacker Public Radio does work.
|
||
|
|
Today's show was contributed by a HPR listener like yourself.
|
||
|
|
If you ever thought of recording podcasts, then click on our contribute link to find
|
||
|
|
out how easy it really is.
|
||
|
|
Hosting for HPR has been kindly provided by an onsthost.com, the Internet Archive and
|
||
|
|
our Sync.net.
|
||
|
|
Unless otherwise stated, today's show is released under Creative Commons, Attribution 4.0
|
||
|
|
International License.
|