“I am not a smart man” -Forrest Gump -Michael Scott
I had to reference my own website recently. I had a machine that needed an ext4 filesystem resized, and I hadn’t done it in a while. I noticed while reading that there were some special characters left when I migrated from WordPress to Hugo. Mostly in the code blocks. But the wordpress-to-hugo-exporter was skipping some unicode characters.
It’s not the biggest deal, but readability is degraded, to put it nicely. It looked like doo-doo…
Let’s completely forget about my numerous spelling errors though, shall we? Don’t worry, dear reader. My wife makes sure to let me know. her: “I read your latest post. God, you can’t spell…” me: “Oh really? What did I mess up?” her: “I don’t remember, but there were a couple obvious ones.” me: “Thanks, that was very helpful…”
I like to use Pulsar as my editor (formerly known at Atom, similar to VS-Code but open source). I got through about 5x old posts of find/replace. In every single post there were at least 5 different characters. Mainly quotations, ellipses, and dashes. But they like to show up multiple ways. Opening quotes are unicode character 8220, closing quotes are 8221. Sometimes I talk about building materials, so inches show up as 8243. Sometimes things showed up as a en dash, or em dash. When calling out a 2x4 piece of lumber, there’s a special “x” character that’s different than a regular old ASCII x for some godawful reason.
So I search for unicode of any type in a post with “&#” then see what pops up. Finish by typing in the character number and semicolon to close, and “replace all”. Maybe there are 2x of each type in each post. It gets real old, real fast.
See, this is ok. For about 5 minutes.
Let’s be smart about this though
We’re going to use the linux utility, stream editor, a.k.a. sed. Sed is a great utility that I should use more. Feed it a data stream, and it will look for one string that you feed it, then replace it with a second string that you feed it.
The general format is sed -e "s/original/replacement/g" [input file]. S for substitution. G for global (replace all instances, not just the first one you find). You can feed it multiple expressions, like so:
$ sed -e 's/…/.../g' -e 's/“/"/g' -e 's/”/"/g' -e 's/–/-/g' -e 's/″/"/g' somefile.txt
There are a few problems with this.
1 - It’s kind of messy to look at. We can just create a file to do this instead. I’m calling mine “replacements.sed” It’s a little easier to read in the following format.
s/…/.../g
s/“/"/g
s/”/"/g
s/–/-/g
s/″/"/g
2 - It lives in bash history, so you could go back and find it. But who says you’ll grab the right one if you’ve used multiple versions of it? Making a file fixes this as well. That’s what the -f option of sed will do. Just keep a copy somewhere in your /home directory and name it something obvious.
3 - It’s only editing it as it prints out to the terminal. We’re adding a -i option to have it edit “in-place”.
4 - If you’re working with quotes, which I am, you’ll need to modify syntax. The expression will start with “, and once sed sees another quote in what you provide, it thinks the expression is complete (and will probably throw errors for a malformed expression). You can use many things, like single quote, pipe, etc. But just putting it in a file fixes this as well. Since they’re all on their own lines, it just uses the newline character as the delimiter.
Now I just run it. Test this on a single file first if you haven’t done this before. And back that file up if it’s important. I have all of my blog posts in one folder for Hugo to build, and they’re all in markdown format. Modify your input file(s) portion at the end of the command as required.
$ sed -i -f replacements.sed ./*.md
There is no output from running this. So go check to see if it worked.
$ grep -e "&#" ./*.md
./2024-05-09-hdd-passthrough-in-proxmox.md:# find /dev/disk/by-id/ -type l|xargs -I{} ls -l {}|grep -v -E '[0-9]$' |sort -k11|cut -d' ' -f9,10,11,12
./2024-07-01-getting-joplin-server-to-work-behind-nginx.md:# For Example: http://[hostname]:22300. The base URL can include the port.
./2024-07-01-getting-joplin-server-to-work-behind-nginx.md:# listen [::]:443 ssl ipv6only=on;
./2024-07-01-getting-joplin-server-to-work-behind-nginx.md: listen [::]:80;
./2024-11-06-new-house-2-window-film.md:They say to keep them at least 1" larger than needed in each dimension. I found that 3-4" extra is good for something this large. Small deviations in angle will end up as a large runout along the 6′ length I'm dealing with.
./2024-11-18-new-house-3-modeling-and-ideas.md:There are many free options with browser access. But they limit the number of projects you can have, limit the number of floors, require logging in, have in-app purchases, etc. Or they limit the resolution of renders and plans, so the work you've done is essentially useless when you get a 640×480 .pdf output and just have to recreate it all somewhere else.
./2024-11-18-new-house-3-modeling-and-ideas.md:I would like to do 10′ walls here, to have a nice 8′ tall garage door. I can't maintain the same slope as the existing roof, but can go shallower and still have plenty of margin for shingles. And I'm trying to maintain a short strip of siding under the existing roof to avoid trying to get the shingle spacing to line up, or re-roof the entire house that was just done 3 years ago.
Ok. I missed a few. 91 is an opening square bracket. 8242 is a single quote, used there as units for feet. And 215 is the x I mentioned earlier when talking about dimensions.
Add those to my file, and run it again. I’m using echo to append them to the tail end of the file. Use -e flag to let it interpret backslash, and send it newline character (\n) at the end of each. Or just use your text editor of choice.
$ echo -e "s/′/'/g\ns/×/x/g\ns/[/[/g" >> ./replacements.sed
$ cat ./replacements.sed
s/…/.../g
s/“/"/g
s/”/"/g
s/–/-/g
s/″/"/g
s/′/'/g
s/[/[/g
s/×/x/g
Conclusion
I’m not going to run it all again and show the output. It worked. If you’re this far in, you should be able to hold your own by using what I provided.
At some point soon I’m going to tackle e-mail with mailcow or something. So I will have a webmaster e-mail address for people to send their comments/corrections to. That is about the only thing so far that I’ve missed about WordPress - the ability to easily use comments to get feedback from people periodically.