Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
328
hpr_transcripts/hpr3758.txt
Normal file
328
hpr_transcripts/hpr3758.txt
Normal file
@@ -0,0 +1,328 @@
|
||||
Episode: 3758
|
||||
Title: HPR3758: First sysadmin job - war story
|
||||
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr3758/hpr3758.mp3
|
||||
Transcribed: 2025-10-25 05:02:59
|
||||
|
||||
---
|
||||
|
||||
This is Hacker Public Radio Episode 3758 for Wednesday, the 28th of December 2022.
|
||||
Today's show is entitled, First Sis Admin Job War Story.
|
||||
It is hosted by Norrist, and is about 28 minutes long.
|
||||
It carries a clean flag.
|
||||
The summary is, how I got my first job as a Sis Admin, and a story about NFS.
|
||||
Okay, so I thought I'd record a quick holiday episode for HPR, and I'll do kind of a combo
|
||||
story about how I got my first job in tech, I haven't always worked in tech I'm currently
|
||||
a Linux Admin, and then I'll combine that with a bit of a war story about my first week.
|
||||
I have for a long time since 2000, then a Linux user, and I didn't have a Linux job
|
||||
that far back.
|
||||
But I was working for a place that had a contract with the government, and the contract was
|
||||
going to end.
|
||||
We didn't know like the specific date was going to end, but we knew the job itself that
|
||||
we were there doing was only going to take about 10 years.
|
||||
So we all knew, when you took a job there, you knew that at some point you were going
|
||||
to get laid off, and if you made it until the end, everyone was going to get laid off
|
||||
at the end.
|
||||
So even though it kind of sucks having a job where you can't work there forever, it gives
|
||||
you sort of a unique opportunity to sort of plan for changing careers.
|
||||
So since you can look ahead and know, owner about this year, what I have to do something
|
||||
different, it gives you time to prep for it.
|
||||
So since I had been sort of a Linux on the desktop, hobbyist for a long time, I thought
|
||||
well, now's my chance to do what I can, and then maybe when I do get laid off, I can
|
||||
bond a job as a Linux admin or something.
|
||||
So I started just kind of adding to the things that would normally do around the house with
|
||||
Linux.
|
||||
So instead of, you know, printers and playing music and configuring X11, I would, you know,
|
||||
do things like trying to set up web servers or file servers, or maybe even an LDAP server
|
||||
or stuff like that and, you know, doing virtualization and whatever I could that I thought maybe
|
||||
things that a Linux admin might do.
|
||||
The other thing I started working on was getting some certifications.
|
||||
So I started with the Red Hat certifications I went and got the Red Hat at the time
|
||||
it was Red Hat Certified System Administrator, or no, at the time it was Red Hat Certified
|
||||
Technician and they've since changed it to Red Hat Certified System Administrator.
|
||||
But I started with that, that's kind of their entry levels, and then a few years later,
|
||||
I got the Red Hat Certified Engineering Cert.
|
||||
Eventually, I got laid off just like I knew I would, and I started kind of slowly
|
||||
starting looking for tech shop.
|
||||
So one of the jobs I applied for, they called me back pretty quick like the next day,
|
||||
and it turns out the company that called me, there were a small web development shop, and
|
||||
they had some, they had three Linux admins, well they were staffed to have three Linux admins.
|
||||
And earlier in the year, two of them had left, not at the same time, but in for different
|
||||
reasons.
|
||||
But one had left and they were kind of dragging their feet a little bit on replacing them.
|
||||
And another one left and they started getting serious about replacing them.
|
||||
And then there was a third guy who was kind of a junior admin.
|
||||
He was kind of a mix of an admin and a developer.
|
||||
So he was sort of a member of the show by himself for a little while, and eventually he had
|
||||
left, he had decided to leave.
|
||||
And so at this point, they were desperate to get some new people in because they had,
|
||||
I guess they were staffed with three people, and they were just a few weeks away from
|
||||
having zero.
|
||||
So they were able to hire from like a temp IT agency, a Linux admin, but he couldn't work
|
||||
there forever.
|
||||
And they had found another kind of senior admin, but he wasn't going to be able to start
|
||||
right away because he had a job and he had some big projects and stuff he wanted to finish.
|
||||
But they needed, they needed someone to start immediately.
|
||||
And since I was laid off, and even though I could tell they weren't really sure if I could
|
||||
do the job, since I could start immediately, it really got their attention.
|
||||
So the, like I said, it was a small web development shop they had about 10 developers, a few project
|
||||
managers and designers, and support desk so people can call in with support stuff.
|
||||
It was most of their applications were PHP applications that ran on Linux, and they were
|
||||
kind of, they were all over the place with Linux, they were sort of Linux versions.
|
||||
It was kind of whoever was charged at the time would deploy whatever Linux version happened
|
||||
to be their favorite at the time.
|
||||
So there was Zeus, there was Ubuntu, there was Debian, there was Red Hat, there was
|
||||
Leras, it was, it was a big, big mix of things.
|
||||
And there was also some Java and a little bit of Windows.
|
||||
Like I said, they were desperate and I could start right away.
|
||||
So they started interviewing.
|
||||
So I got to interview with the guy who was leaving of the three, the last one that was
|
||||
there.
|
||||
And it was basically his last week.
|
||||
So I interviewed with him and some of the kind of senior developers they knew a bit
|
||||
about Linux, and they did, they were really careful with me, or the row.
|
||||
So I did, you know, I came in and I did an interview with like the person who was going
|
||||
to be my boss's boss and the developers, and then the guy who was going to be my boss
|
||||
but hadn't started yet, he wanted to meet me.
|
||||
So we kind of met for a quick lunch interview, because he wanted to make sure, you know,
|
||||
I would do, or we could at least get along and that, you know, things I said made sense
|
||||
to him.
|
||||
Then they wanted to do something a little more technical.
|
||||
So they had someone set up a laptop with a Linux VM on it.
|
||||
They sort of wrote out a list, a task for me to do.
|
||||
So I mean, it was anywhere from simple stuff to adding users and making sure they could
|
||||
suit them.
|
||||
They wanted me, for some reason, they wanted me to compile from source, a specific version
|
||||
of Apache and PHP, and they just had all these kind of crazy things that they wanted me
|
||||
to do, and that the list was long, they gave me a big long list and like two hours to do
|
||||
it.
|
||||
I didn't, I didn't finish.
|
||||
The list was too long, I didn't finish, but the other thing they wanted to do was after
|
||||
that kind of technical interview, they wanted me to meet with all the managers.
|
||||
So again, it was a boss's boss and his boss and his boss's boss all just kind of set
|
||||
now.
|
||||
And it wasn't, they asked me a few technical questions in that interview, but I think it
|
||||
was mostly just trying to figure out, am I, am I for real, you know, is it really possible
|
||||
that someone who's never worked in IT before can, can, can do the job?
|
||||
So obviously, since I'm telling the story, they, they did hire me, my first week there,
|
||||
it was just me and the guy from the tip, ain't it, say the, the third guy who had helped
|
||||
with the interviews and stuff, he, he was gone.
|
||||
So his last day was like the Friday before my, my first day, but there was, you know,
|
||||
there was some minimum turnover, some, you know, maybe a 20 page or a document.
|
||||
And then the, you know, the two or three weeks that the temp admin had before that was
|
||||
really the extent of the, the training in turnover.
|
||||
So a little bit about kind of the infrastructure there.
|
||||
All of their servers were in a data center that wasn't too far from the office so we could
|
||||
go visit the data center when we needed it, when we needed to, and it was in, it was
|
||||
like three racks worth of equipment, it was mostly virtualized.
|
||||
There were a few physical servers, physical machines for heavy loads, like databases,
|
||||
it would be physical servers.
|
||||
A lot of ESX hosts that we, you know, we virtualized on VMware and a lot of, and some storage
|
||||
and stuff like that.
|
||||
The applications were mostly virtual machines.
|
||||
For the PHP applications, they would all kind of share a directory to get their PHP
|
||||
code from.
|
||||
And when I say get their code from, I don't mean like they would copy it whenever new
|
||||
code was available.
|
||||
I mean, they would just literally mount this NFS share in like VAR, WW, or whatever.
|
||||
The way every, every application server had the exact same code all the time.
|
||||
And then there's a few other things they would have on this NFS server, including, you
|
||||
know, config files for some of the load balancers will be on there, application logs will be
|
||||
on there.
|
||||
It was just kind of a generic place to put things, anything that needed to be available to
|
||||
more than one server was probably on this NFS server.
|
||||
You know, the NFS server was a virtual machine also.
|
||||
You could tell, it had kind of grown over Tom, you know, there's a, there's a few strategies
|
||||
for adding this space to a virtual machine when it's running, kind of the easiest one
|
||||
to just add another disk.
|
||||
So this NFS server that was a virtual machine, had like five disks attached to it because
|
||||
it would, every time they would add a new kind of project or something for it to do, they
|
||||
didn't have enough space for it, they would just add another virtual disk to it.
|
||||
For the VMware cluster was kind of an oldish sand, it was, it was branded sun, but this
|
||||
was after Oracle had bought sun.
|
||||
So it was all sun branded stuff, it was supported by Oracle and to kind of maximize the available
|
||||
space, most of the sand was raid Vov.
|
||||
So they would, you know, take a group of disks, put it together and raid Vov and then use
|
||||
those raid Vov disk bundles to export that to VMware and then that's where VMware would
|
||||
store the virtual disk for the machines, including all of the application servers and this
|
||||
NFS server.
|
||||
So even before I started there was a history, it went in the last year, before I started,
|
||||
there was a history of really poor performance with the PHP applications and no one really
|
||||
understood why, I mean, any, all of the troubles you did with the previous admin to
|
||||
just kind of let it dead ends.
|
||||
But one thing we would notice when the applications were running slow, was that the load average
|
||||
on the NFS server would climb and it wouldn't get high, like it wouldn't get into the hundreds
|
||||
or anything, but it would just go from like where I would normally run at one or one and
|
||||
a half, it would go up to like four and we could tell, like we could look at the load
|
||||
average on the NFS server and based on that tell how well or poorly the PHP applications
|
||||
were running.
|
||||
One of our sort of first indicators that things were going poorly was that we had one of
|
||||
the office staff with processed payments that people would make.
|
||||
So you know, a lot of our applications would take, take payments and then the sort of
|
||||
accounting personnel, we had a kind of a homegrown tool that was also a PHP application
|
||||
ran on the same infrastructure, but they were usually the first to notice that things
|
||||
were going south and they would try to say can you check the load average on the NFS server
|
||||
but they would usually come and scream at about the load balancer instead.
|
||||
We did have a load balancer, but that wasn't actually the problem, but it was clear
|
||||
to us, you know, everyone was sort of frustrated with how things were performing and frustrated
|
||||
with the fact that despite all of us looking at it, no one could really figure out, you
|
||||
know, we tried a lot of different things, PHP settings and NFS settings, but nothing
|
||||
helped.
|
||||
So we had this one of our applications, it was basically the company's kind of flagship
|
||||
application, it's biggest, most popular.
|
||||
If anyone asked it, you know, if anyone asked, you know, what does this company do that
|
||||
was list off things and this would be always me and the list of things that they had made.
|
||||
But the application took payments for the system that was taking payments for.
|
||||
There was an annual deadline and it was the deadline was the same for everybody.
|
||||
So you could pay it any time during the year, but people being people, everyone would
|
||||
wait until the very last day to make the payment.
|
||||
So this particular application ran, okay, most of the time, but you know, once a year
|
||||
on sort of the big day, things would get slow, things would always get slow.
|
||||
And it was sort of known that there's going to be some slowdown and some performance problems
|
||||
and it would all be, you know, kind of geared up and ready for it.
|
||||
This particular year, you know, approximately, I'm about 10 days into the job when, you
|
||||
know, big day arrives and it's terrible.
|
||||
It's awful.
|
||||
Like I've never seen, you know, I don't have time to experience there, but in my two
|
||||
weeks, I saw some poor performance.
|
||||
This was absolutely positively unusable.
|
||||
I mean, you would bring up the website.
|
||||
If you could log in as soon as you try to do anything, you would just stall after stall
|
||||
after stall.
|
||||
So it was pretty bad.
|
||||
So we were all kind of desperate to figure out what a solution just to get us through
|
||||
the day.
|
||||
So remember I'd say that the PHP application says how they all had an in a vest mount where
|
||||
they kept their code, that way they could all have the same code.
|
||||
And the developers, they were pretty insistent that that's how it, the developers wanted
|
||||
to be that way.
|
||||
So they could ensure that every application was running exactly the same.
|
||||
Well, we talked our managers into, you know, for today only, let us build some application
|
||||
servers that are exactly the same except that instead of, you know, instead of mounting
|
||||
the NFS server, we just copy all the files over and let these applications run, you
|
||||
know, just totally off local disk and, you know, in reality, it's a virtual disk on that
|
||||
scene we mentioned before, but it's not touching the NFS server.
|
||||
So that, that quick fix got us some pretty good results.
|
||||
So we went from unusable to actually pretty good.
|
||||
Now, at the time, we didn't understand why, we didn't know like, in our heads, we're
|
||||
thinking, okay, all it's doing is reading the PHP, which isn't that big of the NFS server
|
||||
and separating the NFS server from the application, fix the problem.
|
||||
We didn't understand it.
|
||||
One of the things we thought might be an issue was the sand performance, but the same thing,
|
||||
you know, the applications reading their content directly from the sand versus the applications
|
||||
reading their content from an NFS server that's on the sand was nine day difference.
|
||||
So after we all had a minute, a few days after the big day and we could kind of collect
|
||||
our thoughts and calm down and breathe a little bit, we started trying to figure out,
|
||||
okay, what is it about this NFS server?
|
||||
Well, anything that server is in the mix, performance tanks.
|
||||
So as we're digging in and as we're digging in, we start trying to involve the developers
|
||||
a little bit.
|
||||
And one thing that this application is doing that we didn't know about is logging and when
|
||||
I say logging, I mean, obviously we would look at, you know, the PHP logs and Apache logs.
|
||||
And those are things we were always looking at trying to figure out why is it slow and
|
||||
they didn't leave us anywhere.
|
||||
We didn't know what the application had another log that would log every SQL query that
|
||||
the application ran.
|
||||
So if you did a select, I mean, if you just logged in and search for yourself, search for
|
||||
your name and the application query would be written to the logs.
|
||||
And if you made a payment, that query would be written to the logs.
|
||||
Every query was written to the logs and I want to say the logs, that's wrong.
|
||||
It was all of those queries went to the same log file.
|
||||
That's sort of okay.
|
||||
No, that's not, that's really a bad idea.
|
||||
So NFS doesn't allow multiple clients to write to you the same file at the same time.
|
||||
So if a client says, hey, I need a write to this log file, NFS server will block the file,
|
||||
let the client log to it and then unlock the file.
|
||||
So because we had multiple application servers trying to write to the exact same file,
|
||||
the NFS server was slowing down the applications so it could queue up the rights.
|
||||
So that was the reason we saw such big performance gains when we moved off the NFS server is
|
||||
that the application didn't have to wait anymore before it can write to the query log.
|
||||
Now, eventually, when we heard about this, that's a bad idea for a lot of reasons,
|
||||
writing a query to a log.
|
||||
So eventually we were able to talk to developers out of logging this information, but it was
|
||||
a clear win for us because we were finally able to figure out like, what is it about this
|
||||
NFS server that makes these applications so bad?
|
||||
And this particular application wasn't the only one that was doing that writing to a common
|
||||
log file, but like I said, it was the biggest one and it was the one that calls the most
|
||||
problems and it was the one that got the most attention.
|
||||
So after that, we were still kind of interested in why the NFS performance was so bad and
|
||||
why it had gotten worse because the application itself, you know, where it's writing to this
|
||||
kind of common log file, it had been like that for years and there were some growth in
|
||||
the application, but not enough growth to explain the performance drop year over year.
|
||||
So we knew, even though we fixed the problem, but we knew there had to be something else
|
||||
kind of underlying because the problem was getting worse and worse and worse.
|
||||
So we had some pretty decent monitoring and we were able to, remember I said, the load
|
||||
average on the NFS server would go up when performance was bad and you could see it,
|
||||
you know, the owner of monitoring looked at graphs of load average and we could see, you
|
||||
know, big spikes whenever on busy days and drop off on weekends and stuff like that.
|
||||
And when we could zoom all the way out, we could zoom the graphs out till like a year and
|
||||
we could see, you know, then we could see big days and small days, but it was interesting
|
||||
to see sometimes, you know, you would go, so when you zoom out to like a year, you could
|
||||
see like a month at a time and the lawns would be pretty steady, you know, for month to
|
||||
month to month and then you would see kind of a drop and then month to month to month
|
||||
and you may see a stair step rise, month to month to month, a lot of times we would look
|
||||
at those and we would try to investigate, okay, what happened on this day that caused
|
||||
this sort of stair step and one thing we really noticed was we finally got rid of that
|
||||
crappy old sun slash Oracle San, upgraded to something considerably better, then you
|
||||
could definitely see the load average on that NFS server, you know, when I said it used
|
||||
to average one and maybe go up to four, you know, now it was down in the light .2 is
|
||||
1.3 and might go up to 0.8, so that was a huge difference in the application just changing
|
||||
the sand, but there was another place when we looked at the annual graph where we could
|
||||
see a drop in load average, a pretty significant maybe about 30% drop and we couldn't figure
|
||||
out one, a lot of times we could go back and we could see these stair steps and go back
|
||||
and oh, that was the day we changed this application or that was the day we got to understand, but
|
||||
we couldn't figure out one, there was one day, particular day, and it happened to be
|
||||
about a few months after this big day of where everything went south, a few months after
|
||||
that we saw like a 30% pretty steady month over month, week over week, 30% drop and load
|
||||
average and we couldn't figure out one, so I ended up working here, working at this
|
||||
top for about five years, you know, sort of the, I was always still kind of the new guy,
|
||||
you know, and just about anywhere you work, if you're working hot tea, if you're just a
|
||||
sad man, you're always kind of the afterthought, like no one really thinks about hot tea unless
|
||||
something's broken, and so I was on a team that no one ever thought about and I was like,
|
||||
the junior guy on the team that I went over thought about, so I had to, I said that to
|
||||
tell you, I had to move, I had to change offices a lot, it was kind of like a cube farm
|
||||
kind of place where there was, there was cubes and desk and offices and it was always nice
|
||||
to be able to move out, you know, from a cube into an office, but someone else would show
|
||||
up, you know, and they'd want my office, so I'd have to move out, and you know, because
|
||||
the, sort of the last person that was ever really considered whenever, thinking about who
|
||||
was going to, who was going to work in what office, I had to move off, there's a lot.
|
||||
One time I was getting ready to move offices again and I was cleaning out a file cabinet
|
||||
and it was just the folder I was looking through, it was just all kind of random receipts
|
||||
and hardware things and stuff like that, I picked up a receipt and I was looking at it
|
||||
and I was trying to figure out what it was and it was a receipt for returning a disc to
|
||||
son or to Oracle and I'm trying to figure out what it was like, why do we do that?
|
||||
And then I remembered that the guy who was my boss whenever I first started, the guy
|
||||
who started on the same day as me and didn't really have any good turnover and he was
|
||||
supposed to be my senior, he had done an RMA or like you, but one day he was in the
|
||||
day center and he saw on this son storage system that one of the disc had a yellow light
|
||||
instead of a green light, so he purported it to son, they sent him a replacement disc
|
||||
and he sent the old bad disc back to son and when I was staring at the piece of paperwork
|
||||
that documented that change and I thought to myself, I wonder if this has anything to do
|
||||
with that unexpected load average drop or the unexpected performance boosts on that NFS
|
||||
server and I looked at the date and it was within like a few days of that drop so finally
|
||||
I was able to piece together that NFS server, some of its discs were on the portion of
|
||||
the storage system that was built using Ray Vov and that disc that he replaced was part
|
||||
of that array.
|
||||
So the reason that the NFS performance had gotten worse year over year was because at some
|
||||
point during the year, no one noticed but a drive failed that was part of a Ray Vov
|
||||
array.
|
||||
If you know anything about Ray and Ray Vov, if you don't know anything what you do need
|
||||
to know is that Ray Vov is fine but if you lose a single disc out of a Ray Vov array,
|
||||
all of your data will still be there but the performance will be terrible.
|
||||
It no longer has an extra disc to write the parody information to so because of this
|
||||
Ray Vov array running with a bad disc, the performance was terrible and then when he swapped
|
||||
the disc out, that's when we could see, we didn't notice it at the time but that's when
|
||||
we could see the performance increase in the NFS server.
|
||||
So a long rambling story, I don't know if you can learn any lessons from that, except
|
||||
maybe if you want to change careers, one key to doing that is to plan ahead if you can
|
||||
but it's sort of the real key, you have to find someone who's desperate, desperate enough
|
||||
to hire someone with no experience, always be careful when you're logging or writing
|
||||
to a network share and never ever ever run Ray Vov in production period, I'll see you
|
||||
guys next time.
|
||||
You have been listening to Hacker Public Radio, and Hacker Public Radio does work.
|
||||
Today's show was contributed by a HPR listener like yourself.
|
||||
If you ever thought of recording podcasts, then click on our contribute link to find
|
||||
out how easy it really is.
|
||||
Hosting for HPR has been kindly provided by an onsthost.com, the Internet Archive and
|
||||
our Sync.net.
|
||||
Unless otherwise stated, today's show is released under Creative Commons, Attribution 4.0
|
||||
International License.
|
||||
Reference in New Issue
Block a user