200 lines
17 KiB
Plaintext
200 lines
17 KiB
Plaintext
|
|
Episode: 1112
|
||
|
|
Title: HPR1112: LiTS 017: split
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr1112/hpr1112.mp3
|
||
|
|
Transcribed: 2025-10-17 19:09:21
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
Welcome to Linux in the show Episode 17, produced proudly in conjunction with Hacker
|
||
|
|
Public Radio.
|
||
|
|
If you want real tech, it's all over there, Hacker Public Radio, and they're always looking
|
||
|
|
for contributors, people just like you and me to help out.
|
||
|
|
Anyway, my name is Dan Washco, and you are listening to Linux in the show Episode
|
||
|
|
Number 17, and we are going to cover the Split command today.
|
||
|
|
Split command is a handy command that allows you to split a large file or actually standard
|
||
|
|
in if you want to.
|
||
|
|
Into smaller files.
|
||
|
|
Like it says, split.
|
||
|
|
Now generally you're going to want to run this, like I said, on a larger file, and you
|
||
|
|
can split it into smaller files.
|
||
|
|
Really handy if you need to transfer a very large file from one location to another,
|
||
|
|
and you are very limited on the space, on the media, that you have available to you,
|
||
|
|
or in the storage capacity of the resources that you have for uploading purposes.
|
||
|
|
I ran into this problem a little while back at work where I had three and a half gigs
|
||
|
|
of data I needed to transfer, and due to the restrictions placed in security, I only
|
||
|
|
had about a gig, less than a gig of file space, to transfer from one system to another,
|
||
|
|
and that was going from one system to a jump station to another jump station to a fourth
|
||
|
|
system, and my maximum capacity was about 800 gigs in a one location, so I had to split
|
||
|
|
that file up into smaller files using the Split command, and it made it really easy.
|
||
|
|
For example, split by default splits on lines of a file, thousand lines of a file, every
|
||
|
|
thousand lines of a split, but you can adjust these with different parameters, different
|
||
|
|
switches.
|
||
|
|
If you just pass Split and some file name, it's going to split that file up into a thousand
|
||
|
|
line increments, and push that out into smaller files, labeling starts with X, and then AA,
|
||
|
|
X, A, B, X, A, C, and so on, through the alphabet until it gets to Z, and then it goes back
|
||
|
|
to do, I believe, it will go and increment, and it will go X, B, X, C, X, D, so on through
|
||
|
|
Z, and by default, if you exhaust all that, it then wraps around to a third character than
|
||
|
|
a fourth character, so it will be X, A, A, A, X, A, A, B, X, A, A, C, and that's the way
|
||
|
|
it goes in default, but we're going to talk about the switches now and how you can manipulate
|
||
|
|
the defaults.
|
||
|
|
The dash B will split on bytes, and like some of the other commands, you can pass a specified
|
||
|
|
size of bytes as megabytes or kilobytes using the dash, or using M, capital M for megabytes,
|
||
|
|
capital K for kilobytes, capital G for gigabytes, so let's say you want to split the file on
|
||
|
|
10 megabytes, it would be a split dash B, 10 capital M. Now, megabytes, okay, just capital
|
||
|
|
K, M, G, T, terabytes, petabytes, etabytes, so on, etabytes, it's all available.
|
||
|
|
Those are in 124 bit increments, so kilobytes, 124 bits, megabytes, 124 kilobytes, so on.
|
||
|
|
Here's another option, which is dash, or it's dash B, some number, and then capital M,
|
||
|
|
capital B, capital K, capital B, capital G, capital B. Those are in thousand bits, okay,
|
||
|
|
so instead of a 1,024 bytes, it's 1,000 bytes, so just be aware of that, so you can split
|
||
|
|
it on powers of 1,024 or powers of 1,000. So again, it's split dash B, 50 M, capital M
|
||
|
|
for megabytes, and you'll get your files split up into 50, or 50 megabyte byte increments.
|
||
|
|
There is an option, dash L, which splits on lines, and you can specify any number, but
|
||
|
|
default is a thousand, but if you were to specify like, you had a poem that was maybe
|
||
|
|
a hundred lines long, and you do a dash L, 20, it's going to split that poem up into 20
|
||
|
|
line files, you know, each file of 20 lines, and it will split as equally as possible,
|
||
|
|
so it'll split on those lines, it will try to do as best to preserve a full line of characters,
|
||
|
|
and not split in the middle of a line. And there's the dash capital C, or dash, dash
|
||
|
|
line bytes, and that splits on character lines, and the difference is that if you actually
|
||
|
|
do it in an ASCII file, it's going to split on individual characters of a line, so if
|
||
|
|
you were to specify dash C and 5, it's going to take the first five characters on that
|
||
|
|
line, one file, the next five, another file, third five, you know, and go in five increments,
|
||
|
|
so just like it would do on lines, it's going to do by characters, all right, now that
|
||
|
|
is dash capital C, so dash B is on bytes, or dash L is on lines, and dash capital C is
|
||
|
|
on characters, or line characters, and be aware of something, I've had issues in some older
|
||
|
|
versions of split, particularly on centauss, that when I tried to do a dash B, I was splitting
|
||
|
|
on 500 megabytes, dash B, 500 capital M, or even capital M B, it would not work, I actually
|
||
|
|
had to convert that into bytes, so 500 times 1024 times 1024, to get how many bytes it would
|
||
|
|
be, and that's the way I had to pass it to split for the work properly for some reason,
|
||
|
|
but I've done this on R to newer versions, and I had not had a problem with that, now default
|
||
|
|
output is XAA, XAB, XAC, that's called the X is the prefix, and the AA is the suffix,
|
||
|
|
okay, and you can alter either one of those using switches, and if you want to change
|
||
|
|
to default prefix from an X to anything else, you just at the end of the split command, you
|
||
|
|
say what you want that prefix to be, so you do split dash B, 500 M, my data space, and
|
||
|
|
then the prefix you want it to be, and we'll just say my data, and what you end up having
|
||
|
|
is instead of X, it's my data AA, my data AB, my data AC, and so on, that's how easy that
|
||
|
|
is to specify a different prefix, you can change the suffix by default which is AA to different
|
||
|
|
switches, to dash A, or dash dash suffix length, and instead of it being AA, it's whatever
|
||
|
|
number that you put after it, now let me rephrase that, instead of being AA, the characters
|
||
|
|
AA, it will do characters from whatever number, numbers, characters you put after, so if you
|
||
|
|
did a dash A for, it would be AA, and then AA, B, AA, C, if you did dash one, it would
|
||
|
|
dash A, that one, it would be A, B, C, D, and so on, now if you'd rather have numeric
|
||
|
|
suffixes, you just do dash dash numeric suffixes, numeric dash suffix, or dash D, and then
|
||
|
|
you provide a number that you want to start at, so for instance, if you wanted to start
|
||
|
|
at the number five for some reason, instead of starting at one, you would do a split dash
|
||
|
|
D, and it would do, if you change lefty suffix at X, it would do X, five, X, six, X, seven,
|
||
|
|
and so on, and continue upwards, so it would use numbers instead of alphabetical characters,
|
||
|
|
all right, well the numbering default numbering starts at zero, so my mistake there, let's get
|
||
|
|
right, so if you wanted something else besides X, you're really kind of stuck with X, but
|
||
|
|
you can do an additional suffix part to that there, so whatever your suffix, or your prefix,
|
||
|
|
I'm sorry, you're not stuck with the X, because you can change it with the prefix by specifying
|
||
|
|
it, but if you wanted an additional suffix part in there, that would be appended to
|
||
|
|
everything, kind of like the prefix, appended then suffix, you can do a dash, dash, additional
|
||
|
|
suffix equals, and so let's say you wanted to do a, you were doing an album, and you'd
|
||
|
|
say, I don't know, split dash, B, five, M, dash, dash, additional, dash, suffix equals,
|
||
|
|
welcome to my nightmare, space, the file name, and then space, Alice Cooper, that would give
|
||
|
|
you, as a prefix, Alice Cooper, as an additional suffix, dash, welcome to my nightmare, and then each
|
||
|
|
individual increment of that original large file would be a, a, a, b, a, c, and so on,
|
||
|
|
so you get the idea, the dash, dash, additional, suffix allows you to squeeze in another suffix
|
||
|
|
in there before the incrementing suffix, and that's the way that that works. You might say to
|
||
|
|
yourself, all right, so let's say I do a suffix length of one, and then split my file up into
|
||
|
|
something that exceeds 26 files, like I have a hundred line poem, and I say, splits, dash,
|
||
|
|
capital C, all right, let's do it, I said lines, dash, L, one, the hundred line poem, and I also
|
||
|
|
pass to it a dash, a, of one, so that it only does one character in the suffix. Well, what will
|
||
|
|
happen is it will go all the way through to A, Z, all the way through from A to Z, because it's
|
||
|
|
going to be X, A, X, B, X, C, and then it's going to dump out an error and says, split output
|
||
|
|
file suffix is exhausted, so you can't, you can't exhaust the suffix if you're not careful, so just
|
||
|
|
just be cognizant of that, okay. Now, we have looked at so far the switches to split files
|
||
|
|
based upon a predefined size, and we're saying split this file up into as many 50 megabit files
|
||
|
|
as possible, and this is a reverse that you could do with, and these are the dash and switches,
|
||
|
|
and instead of splitting up the file into a specified size, you can split it up into a specified
|
||
|
|
number of files, so if you were to do like split dash and my poem, it would split my poem up into
|
||
|
|
five files, and it would try to keep them as equal as possible, until of course it got to the last
|
||
|
|
file, which would typically be smaller, the last split file that it comes up. Now, there's
|
||
|
|
different formats that you can apply to the dash n, and we're going to cover those right now,
|
||
|
|
and by default dash n and a number splits it into that number of files. Now, there's a format
|
||
|
|
which is k slash n, which instead of splitting a file up into smaller files will split it up,
|
||
|
|
but the output k will be split out into standard out, which might seem kind of odd,
|
||
|
|
you don't know why you'd want to do it, but the format is like split dash n slash, dash n
|
||
|
|
three slash 10, my poem. What that would actually do is instead of creating 10 files with my poem,
|
||
|
|
it would do that, it would equivalent to 10 lines, but it would do the dash and where we said
|
||
|
|
dash n slash five slash 10, it would take what would be that fifth file and output it straight to
|
||
|
|
standard out. That's the way that that works. So a dash, some number slash another number would
|
||
|
|
take that file, split it up into the second number of files, and then instead of writing it out
|
||
|
|
to a file, all those two files would take the first number and file and write it out to standard out.
|
||
|
|
The other format that's dash n, l slash, some number, and the way that that works is that it will
|
||
|
|
split the file up into as the number of lines, a number of files, and it will do it by lines,
|
||
|
|
and it will preserve the line. It will not split on a line, but preserve it. So like split dash
|
||
|
|
n, l slash five, my poem was split, my poem up into five files based on lines, but it will not
|
||
|
|
split in the middle of a line, it will preserve a line. So you will get files of slightly varying
|
||
|
|
size because it's trying to preserve that line. Then there's dash n, lkn, which acts just like
|
||
|
|
lkn does, splitting on lines, and like kn does, outputting the number instead of writing out the file.
|
||
|
|
So split dash n, l, five slash 10, my poem is going to split my poem up into 10 files, not
|
||
|
|
write them out, on line, preserving lines, and split that fifth file, it would be that fifth file
|
||
|
|
up, spit it out to standard out. That's the way it works. Kind of wacky. Kind of, kind of wacky.
|
||
|
|
There you go. Now the final format is dash n, r, slash n, or dash n, r slash k slash n.
|
||
|
|
And what r does is it acts as l in splitting the files on lines, but not breaking the lines,
|
||
|
|
but does it in a round robin distribution, okay? So the format is dash n, r, and some number,
|
||
|
|
or dash n, r, some number slash another number. For instance, split n, r slash five, my poem,
|
||
|
|
is going to split my poem into five files, okay? And the way that this is going to work,
|
||
|
|
okay? The way that this is going to work is splits on lines. And what you used to have before
|
||
|
|
is it would take the first five lines and put them in a file, the second five lines, put them in
|
||
|
|
a file, third five lines, put them in a third file. What you have going on now is it takes the first
|
||
|
|
line file, one, second line file, two, third line file, four, fourth line file, four, fifth line
|
||
|
|
file, five, and then comes back to the six line file, one, seventh line file, two, eighth line
|
||
|
|
file, three, and so on. So it does it in a round robin. So instead of each line, each line gets
|
||
|
|
written to a different file, then wraps back around, okay? That's the way it works. So instead of it
|
||
|
|
being the first five in one file, the second five in the third file, second file, third five
|
||
|
|
in the third file, it's first line in one file, first file, second line in the second, third in
|
||
|
|
the third, then wraps back around, fourth line in file one, so on. So it does it in a round robin.
|
||
|
|
So if you did the K slash n format, so it would be like nr five slash 10, it splits the file up into
|
||
|
|
10 files, doesn't write them out, does the round robin thing, and the fifth file would be the
|
||
|
|
fifth file gets spit to standard out. So dash nr is round robin, so just be aware of that,
|
||
|
|
that's how that works. You can do some wacky things with split that, instead of spitting it out
|
||
|
|
to a file, you can spit it out to a command, pass it to the output of a command, so using the
|
||
|
|
dash dash filter. For instance, you could do split dash l 10 my poem and dash dash filter equals
|
||
|
|
and in there do in double quotes cat and dollar sign file, double quotes close, and what that will
|
||
|
|
do is it will split that 10 line that poem into 10 files, pass it to through the filter to cat,
|
||
|
|
and what what it becomes to the from passing and filter, it's always the variable dollar sign
|
||
|
|
file, and and cat is told to take that variable and concatenate it, so it'll just spit the the
|
||
|
|
poem out by the file, each one of those files. There is a dash e option or e lead empty dash
|
||
|
|
lines and the lead means to omit, so it will submit it will suppress a generation of empty or
|
||
|
|
zero bit length files. For instance, let's say you had a file with 10 lines in it and you did
|
||
|
|
like split dash n 110 line file, it would produce 100 files from the 10 line file over half of which
|
||
|
|
would be empty files because there's not enough data to split across no 100 line files, whereas if
|
||
|
|
you pass the dash e option into there, it would only generate files that were greater than zero bytes
|
||
|
|
and that's where it works. Now we have talked at length about splitting up a file. How do you get it
|
||
|
|
back together? Really easy, very easy, it's dead easy, you split your file up and you want to get
|
||
|
|
it back together and you just use the cat command and output the cat command, send it to the file
|
||
|
|
that you want to create, to restore it. So let's say you had a, let's say you had a tar ball of
|
||
|
|
of five gigs in size, you split that tar ball up into 10 files, each would be about, I'm going to
|
||
|
|
say 500 megabytes, but that's not true, it's going to be a little different than that. Very simple
|
||
|
|
to get that back together. So we have 10 files, we're going to be cat, we're going to be XAA, XAB
|
||
|
|
all the way through. So you just do cat, XA, asterisk, and space, and then it's going to be the
|
||
|
|
greater than symbol, the pointing to the right, and your file name dot TAR dot GC, and it restores
|
||
|
|
it all very simply. That's how easy it is to restore a split file. Just use cat, cat and redirect
|
||
|
|
the output back into the file that you want it to be. That is split in a nutshell. One final
|
||
|
|
thing I will say about split is just like just about any other command, you do not have to use a
|
||
|
|
file, it can take the standard out of a program and it worked that way. Very simple example,
|
||
|
|
let's say you're trying to monitor Apache logs or something in log file and you got a lot of data
|
||
|
|
being spit out, and you could just do tail dash F var log Apache error log dot text. And if you do
|
||
|
|
that, tail is going to continuously show you the last lines of the log as they're being generated.
|
||
|
|
And if you're watching something that's really flying past really fast, well you could just do
|
||
|
|
a pipe and pipe it to split and do split dash L 50. And if you do that, as you're tailing this,
|
||
|
|
it's going to pass the split, split is going to split it up into files that are 50 lines in length
|
||
|
|
and spit it out into a file. So you get X A X B X C, and each one of those are going to have 50
|
||
|
|
lines of what you were seeing and when you're done control C and voila, you have yourself a bunch
|
||
|
|
of little files that are split up and you can maybe manage them a little easier to look at them.
|
||
|
|
That's just an example. That's it for split. And I am going to split out of here today.
|
||
|
|
I hope that you enjoyed episode 17 of Linux in the show. Contribute to hacker public radio.
|
||
|
|
I thanked him for helping sponsor the show and support the show. And I want to support them as
|
||
|
|
much as possible. You have a great week and split some files, baby.
|
||
|
|
You have been listening to Hacker Public Radio at Hacker Public Radio does our.
|
||
|
|
We are a community podcast network that releases shows every weekday on day through Friday.
|
||
|
|
Today's show, like all our shows, was contributed by a HBR listener like yourself.
|
||
|
|
If you ever consider recording a podcast, then visit our website to find out how easy it
|
||
|
|
really is. Hacker Public Radio was founded by the Digital Dark Pound and the Infonomicum
|
||
|
|
Computer Club. HBR is funded by the binary revolution at binref.com. All binref projects are
|
||
|
|
crowd-responsive by linear pages. From shared hosting to custom private clouds, go to
|
||
|
|
lunarpages.com for all your hosting needs. Unless otherwise stasis, today's show is released
|
||
|
|
under a creative commons, attribution, share a life, lead us our license.
|