Episode: 2698 Title: HPR2698: XSV for fast CSV manipulations - Part 1 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2698/hpr2698.mp3 Transcribed: 2025-10-19 07:39:52 --- This is HPR Episode 2698 entitled XSV4 Fast CSV Manipulation Part 1. It is hosted by being it and in about 31 minutes long and carrying a clean flag. The summary is written in Rust, XSV is my new favourite tool for manipulating CSV files. This episode of HBR is brought to you by an honesthost.com. Get 15% discount on all shared hosting with the offer code HBR15. That's HBR15. Better web hosting that's honest and fair at an honesthost.com. Hello Hacker Public Radio fans, this is Be Easy once again. I went with another episode. This one is folks around one of my favourite topics which is the manipulation and handling of structured data and text format. What does that mean? Basically messing with CSVs and stuff. So it's a lot of part of what I do with my job and something I enjoy doing it which is weird because who enjoys doing stuff like this is sometimes it can be really tedious. But it's very important when you have lots of information that you want to maybe you're doing a data migration and you're exporting from one database into a flat file and doing some data munching on it then importing into another file or maybe you've been given some data and you are supposed to put into a new system but it came out of Excel and there is no rules around how the data was gathered so it didn't look very good coming out. So that's what this is really about. It's about being able to take stuff like that and I'm going to be talking about in particular a tool called XSV. Now for a long time I used a different tool, CSV cut and I still do on occasion. It's a Python command line tool that has some functionality that XSV does not but what XSV does it does way faster and just as well as CSV kit. So in a future episode I'll go over CSV kit and in a future episode as well I'll go over some of the more advanced features of XSV but right now I'm going to go over something that is something that I do pretty often and I want to be using the file that is a part of the XSV documentation. So XSV can be found at github.com slash burnt sushi slash XSV and there is a file as a part of the documentation called worldcitiespop.csv and I will be from now on calling it in this episode world city's population. So that's what it is, it's a file of all these different cities and what the populations are where they're located in the world. And so say you get a file like this and you want to know what's the first thing you want to know about it. Well how big is it? You can go to your command line and type in something like wc-l you'll get a line count which is fine but XSV has its own tool inside of it. So let me get to my data and I can do a command called either wc-l worldcitiespop.csv or I can do XSV count worldcitiespop now. For reference my computer that I'm using right now is an Intel Core i7 8700 with 3.2 gigahertz. So it's a beefy machine 12 cores but I've run XSV using a lot lower spec hardware including a suce triple ebook and it runs excellently across all those different places. In this case this file has a 3,173,958 records. Now when you run wc-l on the same file you'll get one more number one higher than this 3173959 and that's including the header. So one thing that XSV does is it takes away the header rules. So if you just do XSV dash dash help you can see all the different commands and all the different options so there is, you can see that there are a bunch of different commands and we're going to focus on just a couple. First when we talked about was count the next one I want to talk about is okay so now I know how many lines are in this file how long it is how wide is it so there's a command you can either do something like head dash one in the file name and that will give you what it looks like but XSV has its own thing that has some advantages to it so if I do XSV headers in the name of the file I get an output with two columns the first column is the numeric order of the row of the column and then the next column is the name of that or the data that represents the name of that column. So in this case there are seven columns they are country city accent city region population latitude and longitude. If I do XSV headers dash dash help so really nice help in this in this tool you can see that there are it doesn't just do this you can you can do a dash j and that will hide that column index you can do a dash dash intersect which is a really useful thing we are looking at different files and you can so with a dash dash intersect you can say XSV headers dash dash intersect file number one file number two and it will tell you the it will give you a list back of all the columns that intersect between these two files so if there is another file that also has the word country spelled the same way with a capital C it will show you that that column is duplicated so this is useful sometimes we were working with the database data and you have two different tables and sometimes like sometimes you'll be able to see where the foreign keys are by using this type of command and it's just really nice the output. So there are some common options that you usually have in XSV one is dash D or dash dash delimiter that tells you the different types of delimiter that you could have in that file so if it's a tab delimiter file instead of a common delimiter file you can specify tab instead of comma or if it's piped delimiter you can specify that as well. In some of the other commands they also have a dash O or dash dash output in the file name so you don't have to redirect the output into into a file you can actually use dash dash output in the file name and that is another way to get into a file. Alright so what have we done we've looked at the length of the file we looked at the width of the file but now I'm looking at these the names of these columns and I know that when you're doing data cleaning you always want to look at data quality issues that you might have where someone there might be two different representations of the same information so someone might have spelled the country wrong or the city wrong or the region wrong and you might want to be able to see what are the distinct count of all the countries for instance. XSV makes it really easy so when you're looking at something like that you if you see it's not the case in this file but if say that the word United States was all the way spelled out and there are two different records in here one with United States with a capital U and a capital S and another one with a capital U and a lower case S so when you run this command you would see two different records one with three million and one with a hundred thousand and you'd say oh look it looks like a hundred thousand of the records should have the same value as that other one that has three million so something they can go back to your data cleaning and clean up and then get the output and resume your data cleaning something I do I just got to tell you I do it way too often so if you're not into that kind of stuff you're probably pulling your hair out when you hear it but something that is really useful and I enjoy doing it but let's get back to it so if I go XSV frequencies the name of the command and if I do dash S which means select and then country which is the name of calm that I want to look for distinct characters for then the name of the file world cities pop and it'll give me a list of all of the values of country so there are three in the output there are three columns each delimited by a comma most of the time we're dealing with XSV the output of the command is a CSV file or CSV object in standard in a standard out and you either put that into a file or you could pipe that into another command so in this case we have three columns the columns are field value in count and since I'm only looking at country I only see in the field column I only see country in the value I see all the different values that we have for country and then there's a count now you can see that there are 10 in this list by default it limits the output of frequency to 10 records to do a different amount of the frequency you can use dash dash limit after the entire command and you can put three so now I'm going to see three I think I think you can do dash dash all I'm right no not that is all you can do dash dash limit 100 if you want to and it gives me the hundred top 100 countries based on the count the frequency count let's go to frequency help so yeah so there is a limit you can set the limit to zero to disable the limit so if I want to see everything I do dash dash limit zero and it brings me all of the countries and I can see that there are four one two three four five countries that only have one record so if I was doing a data cleanup I'd look at that and worry that that there was a data entry problem right there because why were there why would these only have one record each when they're 31 million or 3.1 million records it's probably accurate but in this example but you something you want to check so now I've looked at the frequency I've looked at and I can do some cleanup so sometimes you might want to just take this file and take some of the data out of it and not all of it so let's say I just want the country or say I don't care about the longitude and latitude in this file there's a command called xsv select and I will let you to choose column if you go to xsv select that dash that's help you'll see that there are different ways you can select information different columns you can either do it by the column name so xsv select name one comma name two comma name three and that will make sure that will match those column names and I'll put them to standard out you can do the column numbers starting with the number one so you can say if I want the first column and the fourth column select xsv select one comma four if I want the first four columns I can say xsv select one dash four or if you can use the column name so in this example I could say country dash population and that will give me all the columns between and including the country and the population column so what we wanted to do in my example is another example where you can either do we can you can use the exclamation point and when you use something like an exclamation point you want to put your select and you want to put it inside of single quotes I want to do xsv select dash not so the exclamation point is not so not longitude comma latitude world cities and now I'll get everything except for those last two columns so that's very useful you can also go I want just the last just the from the third column to the end so I can do ssv select three dash or from I can either and nothing after the dash I can also go region dash and just go from region to the end so lots of different options you can use this for it's really nice interface so I recommend if you are interested in this type of thing definitely checking out all the things that xsv has in store so now we've gone over a couple different use cases one one thing that is important to know that it does have this ability to do an index and when you run xsv index it's going to output a binary file that stores a bunch of the ifer information about all the data in the original file so it makes things like the frequency and another command that we're about to talk about next called stats happen a lot faster so if this is a command that you're going to if you're going to do multiple things on this on this one file you probably want to run for and it's really big so I would put big over well it depends on your hardware but I would put big over a million records you might want to run index first to make it every time that you run frequency for instance it goes faster or stats which is the one we're going to talk about next make to make them go faster so talk uh so uh but stats you can get all types of statistical or not all types of but you can get some statistical information about the world city's pop.csv file and by default let's just go dashes help on it you could it it uh gives you for all the columns that you have selected so you can say dash s and then just indicate individual columns and if you don't do every column um but it looks at the main max min and standard deviation for all the um for for all the columns if you want to look at more there's a dash dash everything command that gives you a lot more information uh and these are the the following things are the items that you can see in that dash has everything you can see the mode which is the most common value uh the cardinality which is how many that most common value is or how many different items there are uh like kind of like at this distinct count um there's a median and and so you can to get those individual items instead of getting everything you can do just dash dash mode dash dash cardinality dash dash median and it'll give you just those items if you do dashes everything it'll give you all those uh if you're going to use a big file you do want to put the index down first using mode according to the documentation using mode cardinality and median by themselves will uh will hold the csv in memory so if that's a limit that you might have uh so check that out part out there's another option called dash j or dash dash jobs by default they'll use all of your CPUs to run the calculations but you can specify how many CPUs to use but uh let's just try an example so i only want to look at for instance the uh let's just do dash dash first let's do xsv uh index world sees pop just to get the index file down and then we are going to go stats dash dash everything and then you'll see the um the output of that oh let's run it there it is um and it's kind of a jumbled mess to look at it when it first comes out because they're you know they're common to limited and some of the values are blank so it's kind of hard to look at there are two formatting things you can do to the output of xsv that that help it one of them is called table and sometimes if you have a lone number of columns or really big screen tables a good idea so it'll put a column near format for the output where there's uh equal spacing between each uh each item so let's try that xsv stats everything world sees pop and we're just going to pipe that right into xsv table and the output of that oh the output of that is a nice uh a pretty formatted list of all all the data that we're talking about so you can see there's a field column and there's a type and then there's some min max min length max length so for these uh you for the ones that are type text or as they call it type unicode it'll give you the min length and the max length and then it won't give you a mean standard deviation for any of those because it doesn't make sense but it does give you a mode and a cardinality for the um text type values if you have integer or floats you can see that you'll have a min max uh mean standard deviation median mode cardinality so it's really nice uh when you have a lot of these items it really is kind of hard to see everything when you're using the the dash uh the xsv table so another option you can do is instead of piping it into um instead of piping it into um into table we can pipe it into flatten and what flatten does is it moves all of the items that are in a column and you can do this on the on the file the original file itself or if you you could run a head command and then pipe head into xsv flatten just so you can get us uh a smattering of what the data looks like in a format that is easier to look at in the terminal and so what it does is is instead of having every column instead of it being every record in a row and all the data for every row in columns it puts every record in a block separated by a line that just has a pound sign or a hashtag on it and every field is on its own line so for instance if i when i run that command uh xsv when i run flatten on the stats uh everything command the first the first block has and every block is going to have the same fields uh there's going to say in this case field type some min max min length max length median median all that stuff and then it has a pound of the repeats it and the pound of repeats it and pound for every record and so the first time i'm looking at the field of country and it has all the information the second block i'm looking at the field of city the third block i'm looking at the field of accent city and so on um it so very nice uh way to to output the information another uh so now i've done i've done some things we're looking at i've looked at some of the statistics around the file uh there might be some other manipulations i want to do with it so one thing that we did is we selected another thing we might want to do is we might want to sort so let's look at uh different ways you can sort so xsv sort is that command and you can do sort dash s and specify the column that you want but let's look at the help just so we can get a more uh thorough view so like i said you can do dash s and then the column name or the multiple column names you can do um dash capital dash capital n to do a numerical value sort so for people who are not familiar with this concept um if you sort a string and the string are numeric looking and you have the numbers one two three and twenty one it's going to put twenty one right next to two because the way string sorts work is going to keep the two together and it's going to put the threes together but if you do a dash capital n it will it will do what you will expect it to do which is put one two three twenty one the dash capital R option will allow you to do a reverse sort and so in this case we have this file let's say um we have our frequency so let's go back to our frequency command so we have frequency dash s world cities uh and right now we are sorting it by it by default is sorted by the value so let's do dash dash limit uh limit uh 25 oh that's that's that's the dash s limit zero and then pipe that into xsv sort so we know that in the frequency command we have field value in count as the as the fill names so let's go dash dash s value and now we are going to sort by the uh country name instead of sorting it by the uh the max count and so when I do that I get an output of a csv file uh looking format that starts the first record is country comma a d comma 92 if I did it without that it will put the one with the highest actual count of a frequency in there which i don't remember what it was let me take that off which is uh cn which is china okay it makes sense so uh sort is very useful another thing you might want to do instead of sorting is you might want to search so search is a something that you might want to say i let me just find all the countries that begin with the letter u in which case you would use xsv search dash dash s country and then inside of single quotes you would put your regular expression in this plate and i know it's only two characters for this field so i do u dot and the dot means any character and then the rest so you'll see that i will get all the ones that start with u s uh with u so there's u y u z u s another one that you can do so say i want to look at all of the cities with the name with the word woods in it and so i do xsv search dash s city and then inside of single quotes dot asterisk which means any characters any amount of times woods and then dot asterisk at the end so to find anytime the word woods is anywhere in there i run that and i can see that there is a bunch of records in the u s and then some in canada most like a couple of barbenos where there's the word woods in the uh in the file under the city column so we've looked at uh so now we've looked at how search works and like i said all these things can pipe right into each other so say you want to first search for all the ones that have the word woods in it and then you want to do a frequency count of just those well now you have your way of doing that you you run your search first pipe that into xsv frequency and now and then you can put that into another file if you want to so you can see how these how one the u s philosophy and the way you use a pipe to redirect things really works well and how xsv really does well with that type of those type of operations so um we'll say we're already getting close to a half hour here so i think we're gonna call it there uh but uh actually there's one other command that's before we before we completely call it let's look at the slice command which is uh another way so one way one way that we just did a search for was by looking for a specific specific uh words in somewhere in the file but if you want to just do specific lines in the file you can do that with the slice command so let's look at um slice dashes help so slice works in a couple different ways so you have the dash s start dash en dash l length and dash i index so what those items are so say i only want a single record i would i would go um um i would use dash i so dash i means index so if i only want the one million eight hundred seventy second record i would do dash i and then that number why would you want that uh i don't know i guess it's useful for some situation but say you have a file that's already in a good order or you just did some manipulations to put this file in the right order and you want a specific section of of the information using um slice is gonna be a lot faster than using um search because you're not doing any reggae so anything you're just looking up uh the index if you run the index command on on the file first it becomes instantaneous to find these the find these files and any uh any of the slices so with dash s you can do that's where you start so say i want um record starting at number one million i do dash s one million and if i wanted exactly 50 from one million i would use dash l 50 but if i wanted to so you can use dash l instead of dash and instead of dash e to specify the amount of records if i wanted from one million to one million five hundred seventy two i could put dash s one million dash s dash e excuse me that second number so dash e will tell you the end of the range of your slice dash s tells you the beginning and instead of using dash e you can specify the number of records that you want there's also uh you can also do if you leave dash e or dash l off it'll do from that record to the end so that's another option so say you don't want for whatever reason the first um the first one million records will do that one really fast so let's just do xsv slice world cities csv from from let's go three million that's the how many zero one two four five six so if i just want the last everyone after three million i get all those records after three million if i want all the three million the ones after three million and then just the first uh ten from three million one to three million ten i can just do that and it gives me those ten records the thing that's really great about xsv is that you don't ever have to worry about where the header is the header is always there so it always returns the header in your output all right so like i said we're at a half hour now i think we're going to call it but uh leave me uh comments on hack about radio's site if you have any uh questions i want to have some show notes that's going to give some of the basic information but most of this is pulled directly out of the documentation of github so definitely check that out um the links for both the file that i'm working with and the the repository where you find xsv itself where and you can find binaries of it or in the show notes so please check it out and as always hacker republic ready you fans keep hacking you've been listening to hecka public radio at hecka public radio dot org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast then click on our contributing to find out how easy it really is hecka public radio was found by the digital dog pound and the infonomicon computer club and it's part of the binary revolution at binwreff.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow up episode yourself unless otherwise status today's show is released on the creative comments attribution share a live 3.0 license