Files
Lee Hanken 7c8efd2228 Initial commit: HPR Knowledge Base MCP Server
- MCP server with stdio transport for local use
- Search episodes, transcripts, hosts, and series
- 4,511 episodes with metadata and transcripts
- Data loader with in-memory JSON storage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00

200 lines
17 KiB
Plaintext

Episode: 1112
Title: HPR1112: LiTS 017: split
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1112/hpr1112.mp3
Transcribed: 2025-10-17 19:09:21
---
Welcome to Linux in the show Episode 17, produced proudly in conjunction with Hacker
Public Radio.
If you want real tech, it's all over there, Hacker Public Radio, and they're always looking
for contributors, people just like you and me to help out.
Anyway, my name is Dan Washco, and you are listening to Linux in the show Episode
Number 17, and we are going to cover the Split command today.
Split command is a handy command that allows you to split a large file or actually standard
in if you want to.
Into smaller files.
Like it says, split.
Now generally you're going to want to run this, like I said, on a larger file, and you
can split it into smaller files.
Really handy if you need to transfer a very large file from one location to another,
and you are very limited on the space, on the media, that you have available to you,
or in the storage capacity of the resources that you have for uploading purposes.
I ran into this problem a little while back at work where I had three and a half gigs
of data I needed to transfer, and due to the restrictions placed in security, I only
had about a gig, less than a gig of file space, to transfer from one system to another,
and that was going from one system to a jump station to another jump station to a fourth
system, and my maximum capacity was about 800 gigs in a one location, so I had to split
that file up into smaller files using the Split command, and it made it really easy.
For example, split by default splits on lines of a file, thousand lines of a file, every
thousand lines of a split, but you can adjust these with different parameters, different
switches.
If you just pass Split and some file name, it's going to split that file up into a thousand
line increments, and push that out into smaller files, labeling starts with X, and then AA,
X, A, B, X, A, C, and so on, through the alphabet until it gets to Z, and then it goes back
to do, I believe, it will go and increment, and it will go X, B, X, C, X, D, so on through
Z, and by default, if you exhaust all that, it then wraps around to a third character than
a fourth character, so it will be X, A, A, A, X, A, A, B, X, A, A, C, and that's the way
it goes in default, but we're going to talk about the switches now and how you can manipulate
the defaults.
The dash B will split on bytes, and like some of the other commands, you can pass a specified
size of bytes as megabytes or kilobytes using the dash, or using M, capital M for megabytes,
capital K for kilobytes, capital G for gigabytes, so let's say you want to split the file on
10 megabytes, it would be a split dash B, 10 capital M. Now, megabytes, okay, just capital
K, M, G, T, terabytes, petabytes, etabytes, so on, etabytes, it's all available.
Those are in 124 bit increments, so kilobytes, 124 bits, megabytes, 124 kilobytes, so on.
Here's another option, which is dash, or it's dash B, some number, and then capital M,
capital B, capital K, capital B, capital G, capital B. Those are in thousand bits, okay,
so instead of a 1,024 bytes, it's 1,000 bytes, so just be aware of that, so you can split
it on powers of 1,024 or powers of 1,000. So again, it's split dash B, 50 M, capital M
for megabytes, and you'll get your files split up into 50, or 50 megabyte byte increments.
There is an option, dash L, which splits on lines, and you can specify any number, but
default is a thousand, but if you were to specify like, you had a poem that was maybe
a hundred lines long, and you do a dash L, 20, it's going to split that poem up into 20
line files, you know, each file of 20 lines, and it will split as equally as possible,
so it'll split on those lines, it will try to do as best to preserve a full line of characters,
and not split in the middle of a line. And there's the dash capital C, or dash, dash
line bytes, and that splits on character lines, and the difference is that if you actually
do it in an ASCII file, it's going to split on individual characters of a line, so if
you were to specify dash C and 5, it's going to take the first five characters on that
line, one file, the next five, another file, third five, you know, and go in five increments,
so just like it would do on lines, it's going to do by characters, all right, now that
is dash capital C, so dash B is on bytes, or dash L is on lines, and dash capital C is
on characters, or line characters, and be aware of something, I've had issues in some older
versions of split, particularly on centauss, that when I tried to do a dash B, I was splitting
on 500 megabytes, dash B, 500 capital M, or even capital M B, it would not work, I actually
had to convert that into bytes, so 500 times 1024 times 1024, to get how many bytes it would
be, and that's the way I had to pass it to split for the work properly for some reason,
but I've done this on R to newer versions, and I had not had a problem with that, now default
output is XAA, XAB, XAC, that's called the X is the prefix, and the AA is the suffix,
okay, and you can alter either one of those using switches, and if you want to change
to default prefix from an X to anything else, you just at the end of the split command, you
say what you want that prefix to be, so you do split dash B, 500 M, my data space, and
then the prefix you want it to be, and we'll just say my data, and what you end up having
is instead of X, it's my data AA, my data AB, my data AC, and so on, that's how easy that
is to specify a different prefix, you can change the suffix by default which is AA to different
switches, to dash A, or dash dash suffix length, and instead of it being AA, it's whatever
number that you put after it, now let me rephrase that, instead of being AA, the characters
AA, it will do characters from whatever number, numbers, characters you put after, so if you
did a dash A for, it would be AA, and then AA, B, AA, C, if you did dash one, it would
dash A, that one, it would be A, B, C, D, and so on, now if you'd rather have numeric
suffixes, you just do dash dash numeric suffixes, numeric dash suffix, or dash D, and then
you provide a number that you want to start at, so for instance, if you wanted to start
at the number five for some reason, instead of starting at one, you would do a split dash
D, and it would do, if you change lefty suffix at X, it would do X, five, X, six, X, seven,
and so on, and continue upwards, so it would use numbers instead of alphabetical characters,
all right, well the numbering default numbering starts at zero, so my mistake there, let's get
right, so if you wanted something else besides X, you're really kind of stuck with X, but
you can do an additional suffix part to that there, so whatever your suffix, or your prefix,
I'm sorry, you're not stuck with the X, because you can change it with the prefix by specifying
it, but if you wanted an additional suffix part in there, that would be appended to
everything, kind of like the prefix, appended then suffix, you can do a dash, dash, additional
suffix equals, and so let's say you wanted to do a, you were doing an album, and you'd
say, I don't know, split dash, B, five, M, dash, dash, additional, dash, suffix equals,
welcome to my nightmare, space, the file name, and then space, Alice Cooper, that would give
you, as a prefix, Alice Cooper, as an additional suffix, dash, welcome to my nightmare, and then each
individual increment of that original large file would be a, a, a, b, a, c, and so on,
so you get the idea, the dash, dash, additional, suffix allows you to squeeze in another suffix
in there before the incrementing suffix, and that's the way that that works. You might say to
yourself, all right, so let's say I do a suffix length of one, and then split my file up into
something that exceeds 26 files, like I have a hundred line poem, and I say, splits, dash,
capital C, all right, let's do it, I said lines, dash, L, one, the hundred line poem, and I also
pass to it a dash, a, of one, so that it only does one character in the suffix. Well, what will
happen is it will go all the way through to A, Z, all the way through from A to Z, because it's
going to be X, A, X, B, X, C, and then it's going to dump out an error and says, split output
file suffix is exhausted, so you can't, you can't exhaust the suffix if you're not careful, so just
just be cognizant of that, okay. Now, we have looked at so far the switches to split files
based upon a predefined size, and we're saying split this file up into as many 50 megabit files
as possible, and this is a reverse that you could do with, and these are the dash and switches,
and instead of splitting up the file into a specified size, you can split it up into a specified
number of files, so if you were to do like split dash and my poem, it would split my poem up into
five files, and it would try to keep them as equal as possible, until of course it got to the last
file, which would typically be smaller, the last split file that it comes up. Now, there's
different formats that you can apply to the dash n, and we're going to cover those right now,
and by default dash n and a number splits it into that number of files. Now, there's a format
which is k slash n, which instead of splitting a file up into smaller files will split it up,
but the output k will be split out into standard out, which might seem kind of odd,
you don't know why you'd want to do it, but the format is like split dash n slash, dash n
three slash 10, my poem. What that would actually do is instead of creating 10 files with my poem,
it would do that, it would equivalent to 10 lines, but it would do the dash and where we said
dash n slash five slash 10, it would take what would be that fifth file and output it straight to
standard out. That's the way that that works. So a dash, some number slash another number would
take that file, split it up into the second number of files, and then instead of writing it out
to a file, all those two files would take the first number and file and write it out to standard out.
The other format that's dash n, l slash, some number, and the way that that works is that it will
split the file up into as the number of lines, a number of files, and it will do it by lines,
and it will preserve the line. It will not split on a line, but preserve it. So like split dash
n, l slash five, my poem was split, my poem up into five files based on lines, but it will not
split in the middle of a line, it will preserve a line. So you will get files of slightly varying
size because it's trying to preserve that line. Then there's dash n, lkn, which acts just like
lkn does, splitting on lines, and like kn does, outputting the number instead of writing out the file.
So split dash n, l, five slash 10, my poem is going to split my poem up into 10 files, not
write them out, on line, preserving lines, and split that fifth file, it would be that fifth file
up, spit it out to standard out. That's the way it works. Kind of wacky. Kind of, kind of wacky.
There you go. Now the final format is dash n, r, slash n, or dash n, r slash k slash n.
And what r does is it acts as l in splitting the files on lines, but not breaking the lines,
but does it in a round robin distribution, okay? So the format is dash n, r, and some number,
or dash n, r, some number slash another number. For instance, split n, r slash five, my poem,
is going to split my poem into five files, okay? And the way that this is going to work,
okay? The way that this is going to work is splits on lines. And what you used to have before
is it would take the first five lines and put them in a file, the second five lines, put them in
a file, third five lines, put them in a third file. What you have going on now is it takes the first
line file, one, second line file, two, third line file, four, fourth line file, four, fifth line
file, five, and then comes back to the six line file, one, seventh line file, two, eighth line
file, three, and so on. So it does it in a round robin. So instead of each line, each line gets
written to a different file, then wraps back around, okay? That's the way it works. So instead of it
being the first five in one file, the second five in the third file, second file, third five
in the third file, it's first line in one file, first file, second line in the second, third in
the third, then wraps back around, fourth line in file one, so on. So it does it in a round robin.
So if you did the K slash n format, so it would be like nr five slash 10, it splits the file up into
10 files, doesn't write them out, does the round robin thing, and the fifth file would be the
fifth file gets spit to standard out. So dash nr is round robin, so just be aware of that,
that's how that works. You can do some wacky things with split that, instead of spitting it out
to a file, you can spit it out to a command, pass it to the output of a command, so using the
dash dash filter. For instance, you could do split dash l 10 my poem and dash dash filter equals
and in there do in double quotes cat and dollar sign file, double quotes close, and what that will
do is it will split that 10 line that poem into 10 files, pass it to through the filter to cat,
and what what it becomes to the from passing and filter, it's always the variable dollar sign
file, and and cat is told to take that variable and concatenate it, so it'll just spit the the
poem out by the file, each one of those files. There is a dash e option or e lead empty dash
lines and the lead means to omit, so it will submit it will suppress a generation of empty or
zero bit length files. For instance, let's say you had a file with 10 lines in it and you did
like split dash n 110 line file, it would produce 100 files from the 10 line file over half of which
would be empty files because there's not enough data to split across no 100 line files, whereas if
you pass the dash e option into there, it would only generate files that were greater than zero bytes
and that's where it works. Now we have talked at length about splitting up a file. How do you get it
back together? Really easy, very easy, it's dead easy, you split your file up and you want to get
it back together and you just use the cat command and output the cat command, send it to the file
that you want to create, to restore it. So let's say you had a, let's say you had a tar ball of
of five gigs in size, you split that tar ball up into 10 files, each would be about, I'm going to
say 500 megabytes, but that's not true, it's going to be a little different than that. Very simple
to get that back together. So we have 10 files, we're going to be cat, we're going to be XAA, XAB
all the way through. So you just do cat, XA, asterisk, and space, and then it's going to be the
greater than symbol, the pointing to the right, and your file name dot TAR dot GC, and it restores
it all very simply. That's how easy it is to restore a split file. Just use cat, cat and redirect
the output back into the file that you want it to be. That is split in a nutshell. One final
thing I will say about split is just like just about any other command, you do not have to use a
file, it can take the standard out of a program and it worked that way. Very simple example,
let's say you're trying to monitor Apache logs or something in log file and you got a lot of data
being spit out, and you could just do tail dash F var log Apache error log dot text. And if you do
that, tail is going to continuously show you the last lines of the log as they're being generated.
And if you're watching something that's really flying past really fast, well you could just do
a pipe and pipe it to split and do split dash L 50. And if you do that, as you're tailing this,
it's going to pass the split, split is going to split it up into files that are 50 lines in length
and spit it out into a file. So you get X A X B X C, and each one of those are going to have 50
lines of what you were seeing and when you're done control C and voila, you have yourself a bunch
of little files that are split up and you can maybe manage them a little easier to look at them.
That's just an example. That's it for split. And I am going to split out of here today.
I hope that you enjoyed episode 17 of Linux in the show. Contribute to hacker public radio.
I thanked him for helping sponsor the show and support the show. And I want to support them as
much as possible. You have a great week and split some files, baby.
You have been listening to Hacker Public Radio at Hacker Public Radio does our.
We are a community podcast network that releases shows every weekday on day through Friday.
Today's show, like all our shows, was contributed by a HBR listener like yourself.
If you ever consider recording a podcast, then visit our website to find out how easy it
really is. Hacker Public Radio was founded by the Digital Dark Pound and the Infonomicum
Computer Club. HBR is funded by the binary revolution at binref.com. All binref projects are
crowd-responsive by linear pages. From shared hosting to custom private clouds, go to
lunarpages.com for all your hosting needs. Unless otherwise stasis, today's show is released
under a creative commons, attribution, share a life, lead us our license.