291 lines
24 KiB
Plaintext
291 lines
24 KiB
Plaintext
|
|
Episode: 2698
|
||
|
|
Title: HPR2698: XSV for fast CSV manipulations - Part 1
|
||
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2698/hpr2698.mp3
|
||
|
|
Transcribed: 2025-10-19 07:39:52
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
This is HPR Episode 2698 entitled XSV4 Fast CSV Manipulation Part 1.
|
||
|
|
It is hosted by being it and in about 31 minutes long and carrying a clean flag.
|
||
|
|
The summary is written in Rust, XSV is my new favourite tool for manipulating CSV files.
|
||
|
|
This episode of HBR is brought to you by an honesthost.com.
|
||
|
|
Get 15% discount on all shared hosting with the offer code HBR15.
|
||
|
|
That's HBR15.
|
||
|
|
Better web hosting that's honest and fair at an honesthost.com.
|
||
|
|
Hello Hacker Public Radio fans, this is Be Easy once again.
|
||
|
|
I went with another episode.
|
||
|
|
This one is folks around one of my favourite topics which is the manipulation and handling
|
||
|
|
of structured data and text format.
|
||
|
|
What does that mean?
|
||
|
|
Basically messing with CSVs and stuff.
|
||
|
|
So it's a lot of part of what I do with my job and something I enjoy doing it which
|
||
|
|
is weird because who enjoys doing stuff like this is sometimes it can be really tedious.
|
||
|
|
But it's very important when you have lots of information that you want to maybe you're
|
||
|
|
doing a data migration and you're exporting from one database into a flat file and doing
|
||
|
|
some data munching on it then importing into another file or maybe you've been given some
|
||
|
|
data and you are supposed to put into a new system but it came out of Excel and there
|
||
|
|
is no rules around how the data was gathered so it didn't look very good coming out.
|
||
|
|
So that's what this is really about.
|
||
|
|
It's about being able to take stuff like that and I'm going to be talking about in particular
|
||
|
|
a tool called XSV.
|
||
|
|
Now for a long time I used a different tool, CSV cut and I still do on occasion.
|
||
|
|
It's a Python command line tool that has some functionality that XSV does not but what
|
||
|
|
XSV does it does way faster and just as well as CSV kit.
|
||
|
|
So in a future episode I'll go over CSV kit and in a future episode as well I'll go over
|
||
|
|
some of the more advanced features of XSV but right now I'm going to go over something
|
||
|
|
that is something that I do pretty often and I want to be using the file that is a part
|
||
|
|
of the XSV documentation.
|
||
|
|
So XSV can be found at github.com slash burnt sushi slash XSV and there is a file as
|
||
|
|
a part of the documentation called worldcitiespop.csv and I will be from now on calling it in this
|
||
|
|
episode world city's population.
|
||
|
|
So that's what it is, it's a file of all these different cities and what the populations
|
||
|
|
are where they're located in the world.
|
||
|
|
And so say you get a file like this and you want to know what's the first thing you want
|
||
|
|
to know about it.
|
||
|
|
Well how big is it?
|
||
|
|
You can go to your command line and type in something like wc-l you'll get a line count
|
||
|
|
which is fine but XSV has its own tool inside of it.
|
||
|
|
So let me get to my data and I can do a command called either wc-l worldcitiespop.csv or
|
||
|
|
I can do XSV count worldcitiespop now.
|
||
|
|
For reference my computer that I'm using right now is an Intel Core i7 8700 with 3.2
|
||
|
|
gigahertz.
|
||
|
|
So it's a beefy machine 12 cores but I've run XSV using a lot lower spec hardware including
|
||
|
|
a suce triple ebook and it runs excellently across all those different places.
|
||
|
|
In this case this file has a 3,173,958 records.
|
||
|
|
Now when you run wc-l on the same file you'll get one more number one higher than this
|
||
|
|
3173959 and that's including the header.
|
||
|
|
So one thing that XSV does is it takes away the header rules.
|
||
|
|
So if you just do XSV dash dash help you can see all the different commands and all
|
||
|
|
the different options so there is, you can see that there are a bunch of different commands
|
||
|
|
and we're going to focus on just a couple.
|
||
|
|
First when we talked about was count the next one I want to talk about is okay so now
|
||
|
|
I know how many lines are in this file how long it is how wide is it so there's a command
|
||
|
|
you can either do something like head dash one in the file name and that will give you
|
||
|
|
what it looks like but XSV has its own thing that has some advantages to it so if I do
|
||
|
|
XSV headers in the name of the file I get an output with two columns the first column
|
||
|
|
is the numeric order of the row of the column and then the next column is the name of that
|
||
|
|
or the data that represents the name of that column.
|
||
|
|
So in this case there are seven columns they are country city accent city region population
|
||
|
|
latitude and longitude.
|
||
|
|
If I do XSV headers dash dash help so really nice help in this in this tool you can see
|
||
|
|
that there are it doesn't just do this you can you can do a dash j and that will hide
|
||
|
|
that column index you can do a dash dash intersect which is a really useful thing we are looking
|
||
|
|
at different files and you can so with a dash dash intersect you can say XSV headers dash
|
||
|
|
dash intersect file number one file number two and it will tell you the it will give you
|
||
|
|
a list back of all the columns that intersect between these two files so if there is another
|
||
|
|
file that also has the word country spelled the same way with a capital C it will show
|
||
|
|
you that that column is duplicated so this is useful sometimes we were working with the
|
||
|
|
database data and you have two different tables and sometimes like sometimes you'll be
|
||
|
|
able to see where the foreign keys are by using this type of command and it's just really
|
||
|
|
nice the output.
|
||
|
|
So there are some common options that you usually have in XSV one is dash D or dash dash
|
||
|
|
delimiter that tells you the different types of delimiter that you could have in that
|
||
|
|
file so if it's a tab delimiter file instead of a common delimiter file you can specify
|
||
|
|
tab instead of comma or if it's piped delimiter you can specify that as well.
|
||
|
|
In some of the other commands they also have a dash O or dash dash output in the file
|
||
|
|
name so you don't have to redirect the output into into a file you can actually use dash
|
||
|
|
dash output in the file name and that is another way to get into a file.
|
||
|
|
Alright so what have we done we've looked at the length of the file we looked at the width
|
||
|
|
of the file but now I'm looking at these the names of these columns and I know that when
|
||
|
|
you're doing data cleaning you always want to look at data quality issues that you might
|
||
|
|
have where someone there might be two different representations of the same information so someone
|
||
|
|
might have spelled the country wrong or the city wrong or the region wrong and you might
|
||
|
|
want to be able to see what are the distinct count of all the countries for instance.
|
||
|
|
XSV makes it really easy so when you're looking at something like that you if you see it's
|
||
|
|
not the case in this file but if say that the word United States was all the way spelled
|
||
|
|
out and there are two different records in here one with United States with a capital
|
||
|
|
U and a capital S and another one with a capital U and a lower case S so when you
|
||
|
|
run this command you would see two different records one with three million and one with
|
||
|
|
a hundred thousand and you'd say oh look it looks like a hundred thousand of the records
|
||
|
|
should have the same value as that other one that has three million so something they
|
||
|
|
can go back to your data cleaning and clean up and then get the output and resume your data
|
||
|
|
cleaning something I do I just got to tell you I do it way too often so if you're not
|
||
|
|
into that kind of stuff you're probably pulling your hair out when you hear it but something
|
||
|
|
that is really useful and I enjoy doing it but let's get back to it so if I go XSV frequencies
|
||
|
|
the name of the command and if I do dash S which means select and then country which is
|
||
|
|
the name of calm that I want to look for distinct characters for then the name of the file
|
||
|
|
world cities pop and it'll give me a list of all of the values of country so there are
|
||
|
|
three in the output there are three columns each delimited by a comma most of the time
|
||
|
|
we're dealing with XSV the output of the command is a CSV file or CSV object in standard
|
||
|
|
in a standard out and you either put that into a file or you could pipe that into another
|
||
|
|
command so in this case we have three columns the columns are field value in count and since
|
||
|
|
I'm only looking at country I only see in the field column I only see country in the
|
||
|
|
value I see all the different values that we have for country and then there's a count
|
||
|
|
now you can see that there are 10 in this list by default it limits the output of frequency
|
||
|
|
to 10 records to do a different amount of the frequency you can use dash dash limit after
|
||
|
|
the entire command and you can put three so now I'm going to see three I think I think
|
||
|
|
you can do dash dash all I'm right no not that is all you can do dash dash limit 100 if
|
||
|
|
you want to and it gives me the hundred top 100 countries based on the count the frequency
|
||
|
|
count let's go to frequency help so yeah so there is a limit
|
||
|
|
you can set the limit to zero to disable the limit so if I want to see everything I
|
||
|
|
do dash dash limit zero and it brings me all of the countries and I can see that there
|
||
|
|
are four one two three four five countries that only have one record so if I was doing a
|
||
|
|
data cleanup I'd look at that and worry that that there was a data entry problem right there
|
||
|
|
because why were there why would these only have one record each when they're 31 million
|
||
|
|
or 3.1 million records it's probably accurate but in this example but you something you want
|
||
|
|
to check so now I've looked at the frequency I've looked at and I can do some cleanup so sometimes
|
||
|
|
you might want to just take this file and take some of the data out of it and not all of
|
||
|
|
it so let's say I just want the country or say I don't care about the longitude and latitude
|
||
|
|
in this file there's a command called xsv select and I will let you to choose column if you
|
||
|
|
go to xsv select that dash that's help you'll see that there are different ways you can select
|
||
|
|
information different columns you can either do it by the column name so xsv select name
|
||
|
|
one comma name two comma name three and that will make sure that will match those column names
|
||
|
|
and I'll put them to standard out you can do the column numbers starting with the number one
|
||
|
|
so you can say if I want the first column and the fourth column select xsv select one comma four
|
||
|
|
if I want the first four columns I can say xsv select one dash four or if you can use the
|
||
|
|
column name so in this example I could say country dash population and that will give me all the
|
||
|
|
columns between and including the country and the population column so what we wanted to do in my
|
||
|
|
example is another example where you can either do we can you can use the exclamation point
|
||
|
|
and when you use something like an exclamation point you want to put your select and you want to
|
||
|
|
put it inside of single quotes I want to do xsv select dash not so the exclamation point is not
|
||
|
|
so not longitude comma latitude world cities and now I'll get everything except for those last two
|
||
|
|
columns so that's very useful you can also go I want just the last just the from the third column to
|
||
|
|
the end so I can do ssv select three dash or from I can either and nothing after the dash I can
|
||
|
|
also go region dash and just go from region to the end so lots of different options you can use
|
||
|
|
this for it's really nice interface so I recommend if you are interested in this type of thing
|
||
|
|
definitely checking out all the things that xsv has in store so now we've gone over a couple
|
||
|
|
different use cases one one thing that is important to know that it does have this ability to do
|
||
|
|
an index and when you run xsv index it's going to output a binary file that stores a bunch of the
|
||
|
|
ifer information about all the data in the original file so it makes things like the
|
||
|
|
frequency and another command that we're about to talk about next called stats happen a lot faster
|
||
|
|
so if this is a command that you're going to if you're going to do multiple things on this on this
|
||
|
|
one file you probably want to run for and it's really big so I would put big over well it depends
|
||
|
|
on your hardware but I would put big over a million records you might want to run index first to
|
||
|
|
make it every time that you run frequency for instance it goes faster or stats which is the one
|
||
|
|
we're going to talk about next make to make them go faster so talk uh so uh but stats you can get
|
||
|
|
all types of statistical or not all types of but you can get some statistical information about
|
||
|
|
the world city's pop.csv file and by default let's just go dashes help on it you could it it uh
|
||
|
|
gives you for all the columns that you have selected so you can say dash s and then just indicate
|
||
|
|
individual columns and if you don't do every column um but it looks at the main max min and standard
|
||
|
|
deviation for all the um for for all the columns if you want to look at more there's a dash dash
|
||
|
|
everything command that gives you a lot more information uh and these are the the following things
|
||
|
|
are the items that you can see in that dash has everything you can see the mode which is the most
|
||
|
|
common value uh the cardinality which is how many that most common value is or how many different
|
||
|
|
items there are uh like kind of like at this distinct count um there's a median
|
||
|
|
and and so you can to get those individual items instead of getting everything you can do just
|
||
|
|
dash dash mode dash dash cardinality dash dash median and it'll give you just those items
|
||
|
|
if you do dashes everything it'll give you all those uh if you're going to use a big file
|
||
|
|
you do want to put the index down first using mode according to the documentation using
|
||
|
|
mode cardinality and median by themselves will uh will hold the csv in memory so if that's a limit
|
||
|
|
that you might have uh so check that out part out there's another option called dash j or dash dash
|
||
|
|
jobs by default they'll use all of your CPUs to run the calculations but you can specify how many
|
||
|
|
CPUs to use but uh let's just try an example so i only want to look at for instance
|
||
|
|
the uh let's just do dash dash first let's do xsv uh index world sees pop just to get the index
|
||
|
|
file down and then we are going to go stats dash dash everything and then you'll see the um the
|
||
|
|
output of that oh let's run it there it is um and it's kind of a jumbled mess to look at it
|
||
|
|
when it first comes out because they're you know they're common to limited and some of the values
|
||
|
|
are blank so it's kind of hard to look at there are two formatting things you can do to the output
|
||
|
|
of xsv that that help it one of them is called table and sometimes if you have a
|
||
|
|
lone number of columns or really big screen tables a good idea so it'll put a column
|
||
|
|
near format for the output where there's uh equal spacing between each uh each item so let's try
|
||
|
|
that xsv stats everything world sees pop and we're just going to pipe that right into xsv table
|
||
|
|
and the output of that oh the output of that is a nice uh a pretty formatted list of all
|
||
|
|
all the data that we're talking about so you can see there's a field column and there's a type
|
||
|
|
and then there's some min max min length max length so for these uh you for the ones that are
|
||
|
|
type text or as they call it type unicode it'll give you the min length and the max length
|
||
|
|
and then it won't give you a mean standard deviation for any of those because it doesn't make sense
|
||
|
|
but it does give you a mode and a cardinality for the um text type values if you have integer or floats
|
||
|
|
you can see that you'll have a min max uh mean standard deviation median mode cardinality
|
||
|
|
so it's really nice uh when you have a lot of these items it really is kind of hard to see
|
||
|
|
everything when you're using the the dash uh the xsv table so another option you can do is instead of
|
||
|
|
piping it into um instead of piping it into um into table we can pipe it into flatten and what
|
||
|
|
flatten does is it moves all of the items that are in a column and you can do this on the on the
|
||
|
|
file the original file itself or if you you could run a head command and then pipe head into xsv
|
||
|
|
flatten just so you can get us uh a smattering of what the data looks like in a format that is
|
||
|
|
easier to look at in the terminal and so what it does is is instead of having every column
|
||
|
|
instead of it being every record in a row and all the data for every row in columns it
|
||
|
|
puts every record in a block separated by a line that just has a pound sign or a hashtag on it
|
||
|
|
and every field is on its own line so for instance if i when i run that command
|
||
|
|
uh xsv when i run flatten on the stats uh everything command the first the first block has and
|
||
|
|
every block is going to have the same fields uh there's going to say in this case field type some
|
||
|
|
min max min length max length median median all that stuff and then it has a pound of the
|
||
|
|
repeats it and the pound of repeats it and pound for every record and so the first time i'm looking
|
||
|
|
at the field of country and it has all the information the second block i'm looking at the field
|
||
|
|
of city the third block i'm looking at the field of accent city and so on um it so very nice
|
||
|
|
uh way to to output the information another uh so now i've done i've done some things we're
|
||
|
|
looking at i've looked at some of the statistics around the file uh there might be some other
|
||
|
|
manipulations i want to do with it so one thing that we did is we selected another thing we might
|
||
|
|
want to do is we might want to sort so let's look at uh different ways you can sort so xsv
|
||
|
|
sort is that command and you can do sort dash s and specify the column that you want but let's
|
||
|
|
look at the help just so we can get a more uh thorough view so like i said you can do
|
||
|
|
dash s and then the column name or the multiple column names you can do um
|
||
|
|
dash capital dash capital n to do a numerical value sort so for people who are not familiar with
|
||
|
|
this concept um if you sort a string and the string are numeric looking and you have the numbers
|
||
|
|
one two three and twenty one it's going to put twenty one right next to two because the way
|
||
|
|
string sorts work is going to keep the two together and it's going to put the threes together
|
||
|
|
but if you do a dash capital n it will it will do what you will expect it to do which is put one
|
||
|
|
two three twenty one the dash capital R option will allow you to do a reverse sort
|
||
|
|
and so in this case we have this file let's say um we have our frequency so let's go back to our
|
||
|
|
frequency command so we have frequency dash s world cities uh and right now we are sorting it by
|
||
|
|
it by default is sorted by the value so let's do dash dash limit uh limit uh 25
|
||
|
|
oh that's that's that's the dash s limit zero and then pipe that into
|
||
|
|
xsv sort so we know that in the frequency command we have field value in count as the as the
|
||
|
|
fill names so let's go dash dash s value and now we are going to sort by the uh country name
|
||
|
|
instead of sorting it by the uh the max count and so when I do that I get an output of a csv file
|
||
|
|
uh looking format that starts the first record is country comma a d comma 92 if I did it without
|
||
|
|
that it will put the one with the highest actual count of a frequency in there which i don't
|
||
|
|
remember what it was let me take that off which is uh cn which is china okay it makes sense
|
||
|
|
so uh sort is very useful another thing you might want to do instead of sorting is you might want to
|
||
|
|
search so search is a something that you might want to say i let me just find all the countries
|
||
|
|
that begin with the letter u in which case you would use xsv search dash dash s country and then inside
|
||
|
|
of single quotes you would put your regular expression in this plate and i know it's only two
|
||
|
|
characters for this field so i do u dot and the dot means any character and then the rest
|
||
|
|
so you'll see that i will get all the ones that start with u s uh with u so there's u y u z u s
|
||
|
|
another one that you can do so say i want to look at all of the cities with the name
|
||
|
|
with the word woods in it and so i do xsv search dash s city and then inside of single quotes dot
|
||
|
|
asterisk which means any characters any amount of times woods and then dot asterisk at the end
|
||
|
|
so to find anytime the word woods is anywhere in there i run that and i can see that there is
|
||
|
|
a bunch of records in the u s and then some in canada most like a couple of barbenos where there's
|
||
|
|
the word woods in the uh in the file under the city column so we've looked at uh so now we've
|
||
|
|
looked at how search works and like i said all these things can pipe right into each other so say
|
||
|
|
you want to first search for all the ones that have the word woods in it and then you want to do
|
||
|
|
a frequency count of just those well now you have your way of doing that you you run your search
|
||
|
|
first pipe that into xsv frequency and now and then you can put that into another file if you want to
|
||
|
|
so you can see how these how one the u s philosophy and the way you use a pipe to redirect
|
||
|
|
things really works well and how xsv really does well with that type of those type of operations
|
||
|
|
so um we'll say we're already getting close to a half hour here so i think we're gonna call it
|
||
|
|
there uh but uh actually there's one other command that's before we before we completely call it
|
||
|
|
let's look at the slice command which is uh another way so one way one way that we just did a search
|
||
|
|
for was by looking for a specific specific uh words in somewhere in the file but if you want to
|
||
|
|
just do specific lines in the file you can do that with the slice command so let's look at um
|
||
|
|
slice dashes help so slice works in a couple different ways so you have the dash s
|
||
|
|
start dash en dash l length and dash i index so what those items are so say i only want a single record
|
||
|
|
i would i would go um um i would use dash i so dash i means index so if i only want the
|
||
|
|
one million eight hundred seventy second record i would do dash i and then that number
|
||
|
|
why would you want that uh i don't know i guess it's useful for some situation
|
||
|
|
but say you have a file that's already in a good order or you just did some manipulations
|
||
|
|
to put this file in the right order and you want a specific section of of the information
|
||
|
|
using um slice is gonna be a lot faster than using um search because you're not doing any
|
||
|
|
reggae so anything you're just looking up uh the index if you run the index command on
|
||
|
|
on the file first it becomes instantaneous to find these the find these files and any uh any of the
|
||
|
|
slices so with dash s you can do that's where you start so say i want um record starting at
|
||
|
|
number one million i do dash s one million and if i wanted exactly 50 from one million i would
|
||
|
|
use dash l 50 but if i wanted to so you can use dash l instead of dash and instead of dash
|
||
|
|
e to specify the amount of records if i wanted from one million to one million five hundred seventy
|
||
|
|
two i could put dash s one million dash s dash e excuse me that second number so dash e will tell
|
||
|
|
you the end of the range of your slice dash s tells you the beginning and instead of using dash
|
||
|
|
e you can specify the number of records that you want there's also uh you can also do
|
||
|
|
if you leave dash e or dash l off it'll do from that record to the end so that's another option
|
||
|
|
so say you don't want for whatever reason the first um the first one million records
|
||
|
|
will do that one really fast so let's just do xsv slice world cities
|
||
|
|
csv from from let's go three million that's the how many zero one two four five six so if i just
|
||
|
|
want the last everyone after three million i get all those records after three million if i want
|
||
|
|
all the three million the ones after three million and then just the first uh ten from three million
|
||
|
|
one to three million ten i can just do that and it gives me those ten records the thing that's
|
||
|
|
really great about xsv is that you don't ever have to worry about where the header is the header is
|
||
|
|
always there so it always returns the header in your output all right so like i said we're at a half
|
||
|
|
hour now i think we're going to call it but uh leave me uh comments on hack about radio's site
|
||
|
|
if you have any uh questions i want to have some show notes that's going to give some of the basic
|
||
|
|
information but most of this is pulled directly out of the documentation of github so definitely
|
||
|
|
check that out um the links for both the file that i'm working with and the the repository where
|
||
|
|
you find xsv itself where and you can find binaries of it or in the show notes so please check
|
||
|
|
it out and as always hacker republic ready you fans keep hacking
|
||
|
|
you've been listening to hecka public radio at hecka public radio dot org we are a community podcast
|
||
|
|
network that releases shows every weekday Monday through Friday today's show like all our shows
|
||
|
|
was contributed by an hbr listener like yourself if you ever thought of recording a podcast then
|
||
|
|
click on our contributing to find out how easy it really is hecka public radio was found
|
||
|
|
by the digital dog pound and the infonomicon computer club and it's part of the binary revolution
|
||
|
|
at binwreff.com if you have comments on today's show please email the host directly leave a comment
|
||
|
|
on the website or record a follow up episode yourself unless otherwise status today's show is
|
||
|
|
released on the creative comments attribution share a live 3.0 license
|