Using GitHub as a (bad) blog platform

I finally started a new blog, thanks to the offer of @ratsclub to give me free access to capivaras.dev. But considering how small this blog platform is supposed to be, I want to have at least somewhere to have a backup of my posts. I know Mataroa, the blog platform that capivaras.dev runs, has automatic e-mail backups, but I want something more reliable.

I am writing all my posts in Markdown (the format that Mataroa supports) files inside neovim anyway, so why not store all my Markdown files in Git? So this is what I did, I now have an unofficial mirror in GitHub.

While I am here, why not overcomplicate? Can I make an usable blog platform from GitHub? And by that I don't mean GitHub pages, the repository itself. I mean, it already renders Markdown files by default, so no need to do anything in that space. To reach feature parity with capivaras.dev, I only need to have an index and RSS (since comments are not supported anyway). No need for newsletter since GitHub has a watch feature already.

After a couple of hours hacking a Python script, you can see the result of this monstrosity here. The script, called gen_blog.py, is available at the same repository (here is a permalink). It automatically generates an index at README.md with each blog post and a rss.xml file at the root of the repository.

Instead of trying to explain the code, I am going to explain the general idea, because I think that if you want to replicate this idea it is better to rewrite it in a way that you understand. It shouldn't take more than 2 hours in any decent programming language. But if you really want, the script itself is licensed in WTFPL license. The code only uses Python 3's standard library and should work in any relatively recent version (anything newer than 3.9 should work).

So the idea is basically to organise the repository and the Markdown files in a easy way that makes it trivial to parse in a deterministic way. For example, my repository is organised in the following way:

.
├── 2024-07-26
│   ├── 01-writing-nixos-tests-for-fun-and-profit.md
│   └── 02-using-github-as-a-bad-blog-platform.md <- this file
├── gen_blog.py
├── README.md
└── rss.xml

Each day that you write a new blog post will be on its own directory. This is nice because Markdown files may include extra files than the posts themselves, e.g.: images, and this organisation make it trivial to organise everything.

Each post has its own Markdown file. I put a two digit number before each post, to ensure that when publishing multiple posts at the same day I keep them in the same order of publishing. But if you don't care about it, you can just name the files whatever you want.

Also, I am assuming that each Markdown file has a header starting with # , and that is the title of the blog post.

Using the above organisation, I have this function that scrapes the repository and collect the necessary information to generate the index and RSS files:

def grab_posts(pwd: Path):
    posts = defaultdict(list)

    for dir in sorted(pwd.iterdir(), reverse=True):
        # Ignore non-directories or hidden files
        if not dir.is_dir() or dir.name[0] == ".":
            continue

        # Try to parse date from directory name
        try:
            date = datetime.strptime(dir.name, "%Y-%m-%d")
        except ValueError:
            print(f"WARN: ignoring non-date directory: {dir}", file=sys.stderr)
            continue

        # Iterate between the files in the date directory
        for post in sorted(dir.iterdir(), reverse=True):
            # Ignore non-markdown files or hidden files (draft)
            if not post.suffix == ".md" or post.name[0] == ".":
                continue

            # Grab the first H1 section to parse as title
            text = post.read_text()
            mTitle = re.match(r"# (?P<title>.*)\r?\n", text)
            if mTitle and (title := mTitle.groupdict().get("title")):
                posts[date].append({"title": title, "file": str(post)})
            else:
                print(f"WARN: did not find title for file: {post}", file=sys.stderr)

    return posts

Some interesting tidbits: if a Markdown file has a . at the start I assume it is a draft post, and ignore it from my scraper. I added a bunch of WARN prints to make sure that the me in the future doesn't do anything dumb. Also, sorting in reverse since reverse chronological order is the one most people expect in blogs (i.e.: more recent blog posts at top).

After running the function above, I have a resulting dictionary that I can use to generate either a README.md file or Markdown:

def gen_readme(posts):
    titles = []

    for date, dayPosts in posts.items():
        for post in dayPosts:
            # This creates a relative link to the Markdown file, .e.g.:
            # ./02-using-github-as-a-bad-blog-platform.md
            link = os.path.join(".", post["file"])
            # This formats the title, e.g.:
            # - [Using GitHub as a (bad) blog platform](./2024-07-26/02-using-github-as-a-bad-blog-platform.md) - 2024-07-26
            title = date.strftime(f"- [{post['title']}]({link}) - %Y-%m-%d")
            # This appends to the list to generate the content later
            titles.append(title)

    # README_TEMPLATE is a string with the static part of the README
    print(README_TEMPLATE.format(posts="\n".join(titles)))


def gen_rss(posts):
    # Got most of the specification from here:
    # https://www.w3schools.com/XML/xml_rss.asp
    rss = ET.Element("rss", version="2.0")

    # Here are the RSS metadata for the blog itself
    channel = ET.SubElement(rss, "channel")
    ET.SubElement(channel, "title").text = "kokada's blog"
    ET.SubElement(channel, "link").text = "https://github.com/thiagokokada/blog"
    ET.SubElement(channel, "description").text = "dd if=/dev/urandom of=/dev/brain0"

    # You create one item for each blog post
    for date, dayPost in posts.items():
        for post in dayPost:
            item = ET.SubElement(channel, "item")
            link = urljoin(RSS_POST_LINK_PREFIX, post["file"])
            ET.SubElement(item, "title").text = post["title"]
            ET.SubElement(item, "guid").text = link
            ET.SubElement(item, "link").text = link
            ET.SubElement(item, "pubDate").text = date.strftime('%a, %d %b %Y %H:%M:%S GMT')

    # Generate the XML and indent
    tree = ET.ElementTree(rss)
    ET.indent(tree, space="\t", level=0)
    tree.write("rss.xml", xml_declaration=True, encoding="UTF-8")

To publish a new Post, a basically write a Markdown file, run `./gen_readme.py

README.md` at the root of the repository, and see the magic happen.

It works much better than I initially anticipated. The README.md is properly populated with the titles and links. The RSS is kind of empty since it has no description, but it seems to work fine (at least in Inoreader, my RSS reader of choice). I can probably fill the post description with more information if I really want, but it is enough for now (update: it is working now, you just need to render the Markdown as HTML and escape the tags; permalink for the updated script). Not sure who is that interested in my writing that will want to use this RSS feed instead the one available in capivaras.dev anyway.

Also, while I am using GitHub here, the same idea would work in GitLab, Gitea, sr.ht or whatever. As long as your source hub supports Markdown files it should work.

So that is it. I am not saying this is a good idea for your primary blog platform or whatever, and I still prefer to publish to a platform that doesn't track users or have tons of JavaScript or whatever. But if you want a backup of your posts and you are already writing Markdown anyway, well, there are worse ways to do it I think.

Update: I rewrote the script again using Go (permalink). The reason for it is because when I started rendering Markdown (for descriptions) the Python version got quite slow (not the fault of Python itself, mostly because of the usage of nix-shell to manage dependencies; something that Go doesn't need). Took about half an hour, showing how easy it is to do the same.