Recursive Bulk File Renaming ⁂ starbreaker.org

Introduction

I had recently remembered that while I store metadata for each page in my website as shell variables instead of XML, JSON, or something truly unholy like YAML, I need not keep them inside shell scripts with the .sh extension if they're not meant to be executed. I could instead keep them in plain text files (.txt) if they're meant to be read by other shell scripts.

This is an application of the rule of least power, incidentally. Data need not be placed in executable files. Environment variables are an exception, if you're using bash, the Bourne Again SHell, you want ~/.bashrc to be executable because you're using the export command to make environment settings globally available, and most likely setting aliases for frequently used commands as well.

The Fix

The complicated part was renaming all of my existing metadata files. Here's how I did it.

find ~/git/starbreaker.org/site -name 'index.sh' \
        | sed -e 'p;s/index.sh/index.txt/' \
        | xargs -P8 -n2 mv

how I converted metadata files for my website from shell scripts to text files

Kids, Don't Try This At Home

An explanation for how this set of commands is in order; I don't want people assuming that they can just copy this and paste it into a terminal window.

You wouldn't just read an unidentified scroll you had picked up in a dungeon, would you? There's no telling what it might do! It might just be a scroll of "Identify", but it might also be a scroll of "Summon Ancient Dragons".

Actually, a lot of Rogue, Moria, Angband, Nethack, etc. have done precisely that, especially when caught between a Balrog and a hard place. I've done it myself, and sometimes it's even worked in my favor. Nevertheless it's not a good idea to run command you don't understand, especially if they're piped together.

Getting a Set of Files with `find`

The use of the find command should be relatively straightforward. First, we specify the starting directory, which in this case is git/starbreaker.org/site in my home directory. Next, we use the -name parameter to specify the filename we want to find. While this switch takes wildcard characters, I've got a specific filename in mind: index.sh.

Once I press Enter to submit my command, I get output similar to the following. For this demonstration I've piped the output of find into head to show the first 10 lines of output. You can ignore the starbreaker@kether ~/git/starbreaker.org $ part; that's just my shell prompt.

starbreaker@kether ~/git/starbreaker.org $ find ~/git/starbreaker.org/site -name 'index.sh' | head
/home/starbreaker/git/starbreaker.org/site/interests/index.sh
/home/starbreaker/git/starbreaker.org/site/now/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/without-bloodshed/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/silent-clarion/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/starbreaker/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/stories/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/stories/holiday-rush/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/stories/tattoo-vampire/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/stories/thirteen-cuts/index.sh

unsorted output of a find command piped into head

Processing Filenames with `sed`

Next, I want to process the output of my find command with sed. sed is a "stream editor", a tool that can be used to transform text on the fly. You can pipe that output of another command, like find, into sed. You can also use sed to edit files on disk; it will let you make a series of changes to multiple files as long as they contain text that matches your criteria.

sed works by using regular expressions (sometimes called "regex" or "regexp"). This isn't as arcane a concept as you might expect based on the name. If you've ever used the find/replace function in a word processor, you've used the most rudimentary possible regular expression.

While truly powerful regular expressions can resemble something you might get if you mistakenly open a picture of your significant other in a text editor, the regexp I mean to use is rather more straightforward. For each file listed by the find command I had demonstrated earlier, I want to print it out followed immediately by what the file would look like if renamed from "index.sh" to "index.txt".

To do this, I pipe the output of my find command into sed -e 'p;s/index.sh/index.txt/'. The -e switch for sed indicates that I am submitting an expression to be run against the input. In this case, the input is the line-by-line output of find. You can use multiple -e switches to specify a set of expressions to be run in succession.

However, you can also separate individual commands in a single expression using a semicolon, as I'm doing above. The p command tells sed to print out the current line as-is. The s/index.sh/index.txt/ substitution after the semicolon causes sed to print another line showing the original line changed from "index.sh" to "index.txt" Here's what the result looks like; I've piped the output through head for convenience.

starbreaker@kether ~/git/starbreaker.org $ find ~/git/starbreaker.org/site -name 'index.sh' | sed -e 'p;s/index.sh/index.txt/' | head
/home/starbreaker/git/starbreaker.org/site/interests/index.sh
/home/starbreaker/git/starbreaker.org/site/interests/index.txt
/home/starbreaker/git/starbreaker.org/site/now/index.sh
/home/starbreaker/git/starbreaker.org/site/now/index.txt
/home/starbreaker/git/starbreaker.org/site/fiction/novels/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/index.txt
/home/starbreaker/git/starbreaker.org/site/fiction/novels/without-bloodshed/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/without-bloodshed/index.txt
/home/starbreaker/git/starbreaker.org/site/fiction/novels/silent-clarion/index.sh
/home/starbreaker/git/starbreaker.org/site/fiction/novels/silent-clarion/index.txt

unsorted output of find piped into sed, whose output is in turn piped into head

Renaming Files in Bulk with `xargs`

Thus far we haven't made any actual changes. We haven't actually moved any files from "index.sh" to "index.txt" While I have everything I need to rename these files manually, there's no way in hell I'm going to rename each file manually. The whole point of having a computer is to automate this sort of scutwork. Even a single-core machine can do this faster than I could by myself. But since I've got a multi-core machine, I can parallelize the shit out of this scutwork.

As you might have guessed from the subheading, I mean to perform this bit of digital thaumaturgy with the xargs utility. xargs, incidentally, is short for "extended arguments". Given a list of strings, like the filenames listed above, you can have xargs pass each string as an argument to a command like rm (remove file), cp (copy file) or mv (move file). None of these commands accept input via pipe, so if you want to copy or move multiple or files in bulk, you need a tool like xargs.

a tangent on bulk-deleting files

If I simply wanted to start in a given directory and delete every file that matched a particular filename, the following command would do the job:

find ~/foo -name 'bar.txt' -exec rm {} +

recursive bulk file deletion using find with the -exec argument

However, this syntax is kinda fugly, and it's easy to forget that you need the {} + at the end. I still need to RTFM because I never remember what that bit of syntaxic rat poison even means. And you can't parallelize this like you can with xargs.

The command that actually handles the rename is xargs -P8 -n2 mv. It's the most complex of the set because of how xargs works. The last argument I passed to xargs is the command I actually want to run, which is mv to move files. Since I'm piping a list of filenames into xargs that I've massaged with sed so that every "index.sh" file listed by find has an "index.txt" in the same path immediately afterward, I can use the -n2 switch to tell xargs to process the input 2 lines at a time, so that xargs passes to mv a source file and a destination. The -P8 switch instructs xargs to run the specified command in parallel, with up to eight processes running at a time since I've got a multi-core CPU.

It bears mentioning, however, that running 8 mv commands in parallel doesn't necessarily mean that the operation will run eight times faster. However, optimizing this task for maximum concurrency requires a greater understanding of Amdahl's Law than I currently possess, and I can't be bothered to look that up and run experiments with different parallelization values. I just want to move some files, so setting xargs -P to the number of cores in my machine's CPU is good enough for my purposes.

You will notice that I have not included sample output for the full set of commands. This is intentional. The full sequence provided no output, which is a sign that everything worked as intended.

On a UNIX system, shell commands generally do not return any output if they succeed. The principle "no news is good news" applies here. Or, since I'm using GNU coreutils, perhaps it ought to be "no GNUs is good GNUs?"

Why Not Use Perl's `rename` Utility?

Simple answer: I didn't have it installed. I did have find, sed, xargs, and mv installed, because this is a MEWNIX system and my cat knows this.

When I installed rename so that I could write this part, the first thing I did was read the man page. While the Perl rename utility has interactive and dry-run modes, if you want to use it recursively you've still got to pipe output from find into it and deal with regular expressions.

I still have no idea if Perl rename does concurrency based on the manual page, so I'll continue to do it my way.