Episode: 2752 Title: HPR2752: XSV for fast CSV manipulations - Part 2 Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2752/hpr2752.mp3 Transcribed: 2025-10-19 16:20:51 --- This is HPR Episode 2007-152 entitled XSV-4 Fast CSV Manipulation Part 2. It is hosted by being it and in about 23 minutes long and carrying a clean flag. The summary is part 2 of my introduction to the XSV tool. This episode of HPR is brought to you by archive.org. Support universal access to all knowledge by heading over to archive.org forward slash donate. Support universal access to all knowledge by heading over to archive.org. Support universal access to all knowledge by heading over to archive.org. Support universal access to all knowledge by heading over to archive.org. Support universal access to all knowledge by heading over to archive.org. Support universal access to all knowledge by heading over to archive.org. Hello hacker public radio fans. This is B easy once again. Coming with at you one more time with an episode about XSV. The extremely fast CSV Manipulation tool written in rust. This time I'm going to go over a couple of the other, basically the rest of the commands that are in the documentation with the exception of one, which is the reverse command, because although it is in the documentation, it is not yet in the actual binary that's provided on the GitHub link. I contacted the repo manager, Burnt Sushi, and he said that he knows, and he releases his binaries at a separate canis as he updates the docs, so whenever he updates the binaries, it'll be in there. But you can build a from source and get the reverse command that you if you really want it. I find that using the sort command is all I really need. But with any further ado, let's go. So we're going to go back to that original file that we're using, which was this world city's pop CSV file that had, let me count it again, xsv, count, world city's pop, that's CSV, that has 2,699,354 records in it. So it's big file and we have a lot of cool commands that we're going to go over to show what we can do with it. So the first set of commands I'm going to talk about are different ways that you can mess with the way that the file is formatted. The first one being xsv fixed lengths. And that is not what I first originally thought it was, which was, I thought it was going to be something that was going to make every length of every column certain with, but what it's actually doing is it's fixing the length of all the columns so that every row will have the right amount of columns. So what it'll do, if you don't give it any parameters, it'll read the entire file, find out how many columns is supposed to be in every line. And whichever one the widest one is, it'll spit out an output of a file that has that many columns in every line. So for instance, every line except for one has 30 lines on it, have 30 columns in it, and one of them has 29. It will fix that one record to have 30 so that if you are using this csc file and some other for some other program that doesn't have that capability, it won't throw an error and say, hey, I don't have enough records in this row. And in conversely, if there's one that has one more record, it'll make every other record longer to fit it. But you can also say how many columns that you wanted to have. So if you use the dash L and then the number, that will say, I want to fix it to make it so that it's always going to be an amount of columns. So if I know that there might be one that has 31 and the rest has 30, if I say dash, dash L, 30, it'll clip off that last column in the one that is extra. So it's a way that you can fix a csc file. So for instance, it was apple here, xsv, if I just go xsv headers for that file, I can see that there's seven columns. But that's just reading the top of the file. It does, I don't think it reads the entire file to get that number. And so say that there was one that was more than and when I do it, xsv, fixlinks, dash L, seven, world city's pop, that's going to mandate that every line is seven. Basically, everyone has six columns in it. So that's a really useful command and you can see why if you're working with data where you don't know what the source is, you need to clean it up. It works really well for that. And it's really scalable for that. Another command is the FMT standing for format. So it's a way that you can reformat a file. The biggest use case I find for that is the option for changing the delimiter. So I sometimes is better to use TSP files instead of CSV files because if the data has commas in it, it can cause some programs to have an issue. So if I look at xsv, FMT, dash dash help, I can see that there are a lot of different things you can do. One is the dash T command or dash dash out delimiter. And that is the way that you change the file, the separator. So if you want to change it from CSV to TSP, you can do dash T space back slash T, which is the tab character. Another one that you can do, or you can do dash T and then the pipe so you can make a pipe delimiter. Anything you want to read with that. The default obviously is comma. So another thing you can do is you can do a dash dash CRLF, which is going to make a dash R dash N style line ending, which is I think the default for Mac. I don't know what it will do if you want the windows line ending. I haven't had a use case for that, but maybe the CRLF does the same for that. Someone wants to write the comments or make a file episode about CRLF. I'd love to hear about it, but I always try to make everything the Linux file ending once I get a file. The next one is dash dash ASCII, and that says to use ASCII field and record separators. So if I go just to that same file, and I do XSVFNT dash dash ASCII, which is ASCII. Both cities pop. It's going to put in the ASCII line and the ASCII row delivers. So you can't really see it in the output, but if you open it with a ASCII parser, you'll see that it's parsed correctly. All right, so the dash dash quote is a way that you can change the quote character. So by default, it is the double quote. So for times when, if you, so if you've ever seen an Excel, for instance, you know that if there's a comma inside of one of the cells and you convert it to CSV, it's going to put double quotes around that record so that it knows to not escape that, to not make that comma be a line separator, but I mean, a column separator, but it says it kind of is a part of the string in the same column. So you can change that to a different character. If you want you can make single quote, you can make a pipe you can make. I've seen the pipe before. There's another one that people use for a quote character. You can do dash dash quote dash always, and that's a way to put a quote around every value so that you never have to worry about. If there is a quote character inside of the line separating column separating, oh my goodness, it's late at night, a column separating character inside of the file. Next one is dash dash escape in the argument, and that is a way to choose a different escape character. When not specified, it used the quote to quotes are escaped by doubling them. So if you are using a quote character and you want to use that quote, you'll have to use quote, quote to escape it. Otherwise, you can put an escape character. You might want to make backslash this game character, but that's just one of the things I've seen. So that's what FNT command does. It does a lot of things to change how you can make the how it looks. Another one is input. So the input command. Let's look at it again. The input command is used for reading a CSV with an exotic quote character or escape characters. So just as you can change the quote characters and escape characters, you can, if you want to ingest a file that's prepared like that, you can use XSV input dash as quote and use what this custom quote character is, or dash escape, and whatever that custom escape character is, and the output will be in a format that is what a normal quote unquote normal CSV file looks like. So it's another way of just rapidly cleaning a file to be in a standard format. So the next couple of commands we're going to talk about are ways to separate or join a CSV files together or take them apart. The first one is I'll do split first because it's the easiest one. So XSV split is a way that you can, like it says, split a file into multiple files. Very much very similar to the split command in the shell, but the difference is that it puts the header on every line on every file so that you don't have to worry about having to go back into the files and put the header on there, which is really convenient. So if you look at XSV split, you can see that the first option is dash S or dash S size. You can say how how many lines you want in each chunk. By default, it puts in the 500. So if you don't, if you just say XSV split and then the file name it'll put it into 500 line chunks, you can do dash J for jobs and you can say that's the way of saying how many CPUs you want to use. By default, it uses all, but you don't have to use all. Dash, dash file name and then the file name is a way that you can say what you want the output string to look like. So the example that I use is XSV split dash S, let's go 10,000 and then dash, so dash dash file name, I guess I make a folder called World Cities Pop first and say dash dash file name World Cities Pop slash World Cities Pop and then inside of curly brackets, nothing. So empty curly brackets, that's CSV and what that will do is it'll put it'll make up for every 500 or 10,000 lines, it'll do it'll put a file inside of World Cities Pop that's called World Cities Pop. One, I think it starts at zero, zero, one, two, three, four, five, all they have to 9,999. So I could put it underscore in front of that so that it'll say World Cities Pop underscore one, zero, one, two, three, so it's an easy way to get the files to come out separated just by a really simple way, by line number. Now, what if you want to do something a little bit more useful like split it out by a specific character, so say in this file for instance, the first column is the country, what if you want to put every country in its own file? Well, for that you use the partition command. So World Cities Pop partition, so make another, make it clear out that directory and use the partition command or XSV partition. And for the help for that, we have, you want to do partition and then the column name that you want to partition by and then the output directory and then the input file. So it's a little bit different. It's not quite a fluent API. I think it's something that could be improved where the split you have to say the path and here you just put the output door. So that's a little bit inconsistent. But if you can get used to that, if you read the docs, it's okay because this is a really useful command. So if you do XSV partition World Cities Pop, the directory and then, oh no, first I want to do XSV partition country because the name of the column and then World Cities Pop is the directory and the World Cities Pop, that CSV, that's going to create files with the name of that, whatever in that field of country, that CSV. So I end up getting a file for every country. So there's one of the US.csv at UK.csv and so on. I think it's GB.csv. So that's very useful. You can probably think of lots of ways you could use that. On the other side of that is cat. So the cat command is a way that you can join CSV files together. So if I take that same, if I CD into that directory and then do CSV cat, now XSV cat, rows, start at CSV, it's going to concatenate by row all the CSV files in that World Cities Pop directory and put it back together into one file. So if all those files have the same length, then it'll do it perfectly. If you look at the documentation, if they don't have the same length, let's see, let's get it right. If you look at the help, it says if you make a cat name by columns, the columns will be written in the same order that the inputs were given and the number of rows in the result is always equivalent to the minimum number of rows across all CSV data. This can be reverse using the pad flag. And so the pad flag will say it's kind of like what we use in the fixed length command to say concatenate columns. When you concatenate the columns, it will pad each row with this number of column columns. So what you can see how that can be useful. Another thing that it evaluates you to do is that you can use XSV fixed lengths and then or XSV select to get the columns that you want. So you know the exact number of columns before and then pipe the output into cat so that you know the right number of columns that you want to work with. Now, so you can do it by rows, you can also do it by columns. So if you know, so say both records have 92 lines in it and they have similar data, it'll just it'll do like a stack. It'll just take all those columns and put them right next to the other columns and just make it the the output wide. And if there's extra records in one, it'll just have blank blank rows in there. Just commas until it gets to the to the spot where the extended data is. So you can cat by row cat by column and then you can pad if you're using the columns. And then the last command to go over is the sample. And this is something really useful for a file that's this big. So this file has, what I say, several million records in it. Take us six million records in it. And it might be too big to do analysis on. And so I could do a head and use that because sometimes you don't want to do a full analysis. You might have like a local machine where you want to just look at the data. You want to get a good idea of what it looks like first. And if you do a head, you might not get a good representation of what the data looks like. So the sample command will allow you to take a sample or quasi or pseudo random sample of the data with a fixed amount of records to be returned. If you index the file first, so remember in the last episode we talked about indexing. You just do XSV index in the file name, which makes a little index file a binary file that references every makes some references to every line in the file in a short format. If you run that first and then run the sample command, it runs almost as detainless on this, even on this big file because it knows whatever record is and it can go through it really fast. If you don't want to index it first, you can use the slice command that we went over last time and slice the file. Like you can say, I don't want the entire file. I want the middle third of the file and then just give me a sample of that. But but to go back to what the the file the command actually looks like for sample, it's XSV sample. Let me go to help and you can do XSV sample and then the sample size and then the input file. So let me just do XSV. So I'm going to do, I'm going to use the time command. So I'm going to do time XSV sample. Let's get a sample of 100,000, let's go 10,000 records out of world cities, pop. And oh, it's actually pretty fast. Well, I am on my beef use computer. So it did it in a half a second. If I first go and index that file first, XSV, index, world cities, pop. And then run that same time to command again. Now it does it in .09 seconds. So if I was on my little adam computer, it would probably take 20 times that, maybe 100 times that, but so you can see how that could be a benefit to index the file first. So that's the final command to go over. So just to recap, the XSV command, this utility has become really handy for me. And since I found it, I've actually looked for some other utilities written in rust I found FD, which is like the fine command written rust and RG, which is like the silver searcher AG or ACC written in rust. And maybe I'll do episodes on those later. But just the idea of being able to use such a programming language that's so close to them to the metal is really great. And it's so fast and it's really changed the way I look at CSV files. Before I would either use AUC or just head. And then if I had to do a little more fancy stuff, I would open up Python interpreter or R and and just the file that way, just so I can look at it or just so I can do some basic manipulations. And now I don't have to do that for a lot of use cases because the XSV command is so powerful and so fast. So thank you for listening to another episode of Hacker Public Radio. And as always, keep hacking. You've been listening to Hacker Public Radio at Hacker Public Radio. We are a community podcast network that releases shows every weekday, Monday through Friday. Today's show, like all our shows, was contributed by an HBR listener like yourself. If you ever thought of recording a podcast and click on our contributing, to find out how easy it really is. Hacker Public Radio was founded by the Digital Dog Pound and the Infonomicon Computer Club. And it's part of the binary revolution at binrev.com. If you have comments on today's show, please email the host directly, leave a comment on the website or record a follow-up episode yourself. Unless otherwise status, today's show is released under Creative Commons, Attribution, ShareLife, 3.0 license.