Use Google Fonts for Machine Learning (Part 2: Filter & Generate)

Quickly generate hundreds of font images.

Machine learning training time-lapse
Machine learning training time-lapse
Machine learning training time-lapse. Image by Author.

Directory Setup

Again, I am treating this post as a Jupyter Notebook, so if you want to follow along, open Jupyter Notebook in your own environment. You will also need to download Google Fonts Github repo and generate the CSV annotations. We went over all of these in the last post, so please check it out first.

.
├── input
│ ├── fonts
│ │ ...
│ │ ├── ofl
│ │ ...
│ ├── fonts-master.json
└── notebook
└── google-fonts-ml
├── google-fonts-ml.ipynb
└── google-fonts-annotaion.csv

Load CSV Annotations

First, we will load the CSV annotation as pandas.DataFrame.

Look at Data Summary

Let’s have a look at what we are dealing with here — how many fonts are within each category, how many fonts have which variants, etc. We first filter the list of fonts using a regular expression to select the columns we want to look at. Find all the rows with the value of 1, sum all these rows, and sort by frequency.

Filtering Fonts

Now, let’s create conditions to create sub-groups of fonts. Because we took the time to create a nice CSV annotation it is easy to filter fonts according to our own criteria. Say, how many fonts that are thin, regular and thai?

Some Caveats

Before we move on, I want to point out some issues that I found. We are relying on the font data from the CSV annotations that we generated from the Google Fonts Developer API, but there are some discrepancies.

  • Some fonts that are not indicated as latin still have latin alphabets.
  • NotoSansTC is present in the annotation but is not included in the zip archive. Probably because some languages such as Chinese or Korean have a lot of characters and font files get much bigger compared to latin fonts. It’s just my guess…
  • A font called Ballet — the font file exists but it’s not in the annotation. There may be a few other cases like this.
  • The name of the folder containing font files is mostly the same as the name of the fonts themselves but there are some exceptions. For example, a font called SignikaNegative is inside a folder called signika.
  • A font called BlackHanSans is obviously a very thick black weight, but it is listed as regular weight.

Load Font File Path

We will use glob to load the fonts path. If you only want to filter fonts through weights such as Regular, Bold, Thin, simple filename checking will be enough as the weights/styles are already part of the file names, but you won't be able to use other annotations we looked at such as language subsets or category.

number of font files in total:  8708

Filter Fonts

It is time to bring different pieces together into a single function that will take a DataFrame object and the path to font files, and then use the filters we define to return a list of filtered font paths.

  • Some variable fonts are not selected (ex. CrimsonPro, HeptaSlab) because they have different naming conventions than others. I just ignored them because they are not that many, and the variable fonts that have static versions are searchable through the filters.

Preview Font Images

It is finally time to see the fonts displayed as images.

It’s working! Images by Author.

Everything Pieced Together

Here is the code block that pieces everything together. This time, all the images will be saved as PNG files.

Conclusion

Now, you can use the generated images as inputs to Convolutional Neural Networks such as Generative Adversarial Networks or Autoencoders. That is what I am going to be playing the images with. I am sure you can find examples of the model architectures pretty easily on Medium and elsewhere.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store