Back in mid 2019, we made our first attempt at building a model for generating Schlager lyrics. For those of you who are not familiar with Schlager, it is a style of European pop music especially famous in Germany. You can normally find it at Oktoberfest, in Après-Ski bars, and on the beaches of Majorca. It is fun, catchy, and often has simple lyrics covering themes of love, partying, and mainstream current events. During our first SchlagerAI project, we did not really spend as much time focusing on the actual model itself as we did on building up a training, evaluation, and deployment pipeline. In our most recent discovery session, we wanted to revisit our work on SchlagerAI and look to integrate current advancements in Natural Language Generation (NLG).
More specifically, we leveraged open-sourced, pre-trained language models published on HuggingFace’s model repository in order to generate Schlager songs in the German language. In the following sections, I will walk you through the process of building your own lyrics generation model.
Step 1: Find a Pre-Trained Model
In recent years, the Transformers library by HuggingFace has gained lots of popularity for their open-sourced implementations of State-of-the-Art Natural Language Processing (NLP) architectures. In addition, they also provide a hub for sharing model weights and data sets.This has made building NLP applications that apply recent advancements more accessible to those in industry and research alike.
Due to the massive amount of data and compute power required to adequately train a language model, we decided to search the model hub for implementations of text generation models trained on a German corpus. Luckily, the model hub provides filters for tasks (Text Generation) and languages (de), making it quite easy to narrow down our search field.
After exploring our options, we decided to go with the dbmdz/german-gpt-2 model provided by the Digital Library team of the Bavarian State Library’s Munich Digitilization Center (DBMDZ). Since we chose to work with the aitextget library (see Steps 4 and 5), we needed a gpt-2 style model, as the aitextgen library works well with gpt-2 style models. The model weights provided by DBMDZ were produced by training on “a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl, and News Crawl.” The data set consists of 16GB of data and has over 2 billion tokens.
This language model helps act as a starting point for building our SchlagerAI model. When prompting the dbmdz/german-gpt-2 model for text, the output does not currently look much like a Schlager song.
Clearly, we need to do some work to make the generated text look and sound like a song. In the next steps, I will walk you through the process of tuning the model to generate text which could pass as a song.
..But What Is a Language Model?
In its simplest form, a language model is just a method for producing a probability distribution over a sequence of tokens. A token can be a word, a sub word, or even just a character. When used for NLG, a language model will take as input a sequence of tokens and output a probability distribution for the next token(s). You can then leverage this distribution for stochastically generating text (see Step 5).
Current state-of-the-art approaches for creating language models leverage the transformer architecture (examples include the BERT and GPT architectures and their derivative works). For more information about how they work, check out this illustrated blog by Jay Alammar which provides an exceptionally good visual introduction to the topic.
Step 2: Build a Fine-Tuning Data Set
In order to make our output more Schlager-like, we need to gather a large set of lyrics from Schlager songs to help fine-tune our model. However, if you do not already have in mind the complete discography of the top Schlager hits from the past decades, it is going to be difficult to manually generate a list of songs large enough to work as a data set. To handle this, we turn to Spotify and Genius for help.
Spotify provides developers with a Web API that allows us to automatically scrape information about playlists, artists, and songs. Using spotipy (a python wrapper around the developer API), we can build a scraper that will help us to generate a set of Schlager artists which we will then pass to Genius to get the lyrics from their top songs using LyricsGenius. The following walks you through step-by-step in generating your own lyrics data set.
A. Register for Spotify Web API
B. In a Python Script, Initialize the Spotify API
C. Find a Playlist on Spotify and Get the Playlist ID
In the Spotify web player, navigate to playlist that you want to scrape.
---> open.spotify.com / playlist / < playlist_id >.
D. Scrape the Songs and Artists From a Playlist
E. Register for Genius API
Register to get your access token at https://genius.com/api-clients
G. Scrape Songs for an Artist
In the end, we have a data set under data/raw consisting of jsons of the following format (plus other unnecessary keys):
… but What Makes a Good Data Set?
- Relevance. Lyrics should be from German Schlager songs. Of course, some Schlager does include lines/phrases in other languages, so it is not the worst thing if there are some songs with English/another language. Nevertheless, you do not want to include lyrics from other genres like German metal or Swedish folk music, as they aren’t representative of Schlager lyrics.
- Quantity. In the ML field, there is a rough rule of thumb stating that your model should train on at least an order of magnitude more examples than it has trainable parameters. For our problem, this is challenging, as our dbmdz/german-gpt-2 model has over 125 million trainable parameters. Luckily, since the model is already trained on a data set of over 2 billion tokens, our fine-tuning data set can be magnitudes smaller, given we freeze enough parameters in our model (see Step 4 for an intro to fine-tuning). However, assuming you follow the other data set quality rules, a larger data set is not going to harm the performance of the final model.
- Breadth. Only having songs from one specific artist would be interesting if we were building a model for a single artist. But this project is called SchlagerAI, not HeleneFischerAI. Therefore, we need examples from many different artists in order to imitate the industry as a whole.
Step 3: Clean Our Data Set
Now that we have the lyrics to all the top Schlager songs, we need to do some cleaning of our data set to ensure high-quality results.
A. JSON to Text
Our models cannot take raw JSONs as input, so we need to apply some transformations to make it more usable.
Save all lyrics to single txt file: python clean.py > data/clean/lyrics.txt
A song will look like the following:
B. Clean Tags
As can be seen in the above song, Genius will sometimes add headers at the beginning of each section of the song, indicating if it is a verse, refrain, bridge, etc. However, Genius is not consistent in the tagging of different sections. For example, some songs use [Strophe] and others use [Verse]. As a result, it is more difficult for the model to learn the connection between different sections. Therefore, we look to standardize the different tags using regex substitutions.
We use many more regex expressions than are showed here. See our source code for the full set of examples.
C. Remove Lyric Headings
Genius will sometimes add Lyrics to "Song title" to the start of a song. We don’t want that in our training data, so we also use regex to remove such instances.
We now have a nicely clean data set which can be used in the next step for fine-tuning the model.
… but why do we need to clean the data set?
Raw data that you collect from external sources is not always going to be immediately usable by a model. Normally you must do some preprocessing to get it into a standard format which is expected by the model. For example, most models only work on real valued inputs. Furthermore, you can make it easier for the model to learn on your data set if you remove some variance/noise in the data set before passing it through the system.
For example, by standardizing the song tags in the above cleaning steps, we were able to use our domain expertise to make it easier for the model to learn that [Strophe] and [Verse] or [Refrain] and [Chorus] are referring to similar concepts. Instead of first having to learn the connection between [Strophe] and [Verse] and then connecting the different verses together, the model could focus on learning the structure of different verses.
Step 4: Fine-Tune the Model Using Our Data Set
Now that we have a high-quality data set of Schlager lyrics, we can finally start to fine-tune our model. For this process, we turn to the library aitextgen which is built upon HuggingFace Transformers, PyTorch, and PyTorch-Lightning.
Depending on your fine-tuning data set, you may need to play around with the hyperparameters such as the number of steps, number of layers to freeze, etc. Use the TensorBoardLogger to track the different experiments and keep the model which performs the best.
… but what is fine tuning?
Training a model from scratch using only Schlager Lyrics is challenging, especially since German is a complicated language with many different syntax and grammar rules. You need to expose your model to billions of examples of German language to learn all these rules. However, constructing a data set only consisting of German Schlager will not get you enough examples to work with, and the variety may not be enough for the model to pick up the more subtle nuances of the language. Luckily, there is an approach called transfer learning which can help with this problem.
The idea of transfer learning is that we first train your model on a more general, all-purpose data set from a wide variety of sources. In our case, this helps the language model form a strong, base understanding of the language. Using this pre-trained model, we can then fine-tune it by further training on a more task-specific data set. In doing so, we can shape the output to be focused on our domain, without losing the underlying understanding of the language.
Normally, this process is accomplished by freezing most of the lower layers of the network, and only allowing the weights in the top couple of layers to be adjusted during fine-tuning. As a result, we have a much more powerful model than if we were to just train from scratch on our own data set; not to mention we save a lot of time and compute resources.
Step 5: Generate Text
With our fine-tuned model, we can again use the aitextgen library for quick and easy text generation.
Using the above code snippet, we can generate results such as the following, which looks a lot more like an actual song:
Note: due to a bug that we found in the dbmdz/german-gpt-2 model, our base model was not properly trained using the EOS token, meaning the model never knows when to stop generating text. This issue has been addressed and was recently fixed by the maintainers.However, because of this issue, our final product at the end of the 4-day Discovery session could only produce either never-ending songs or songs which abruptly stop mid-sentence after hitting a token limit.
… but how do you generate text using a language model?
As mentioned previously, the output of language models is a probability distribution over the different tokens in the vocabulary. To generate text, we follow a basic loop.
In the loop, the prompt is first passed through a tokenizer and the output is then passed to the language model, resulting in a probability distribution. We then decode the distribution (i.e., select the next token) and then append the new token to the sequence. The process repeats until we hit some end criteria.
There are many ways to decode the probability distribution, ranging from the simple to the quite complex. A few of the most common approaches are as follows:
- Greedy. The next token is chosen by taking the argmax of the probability distribution (select the token with the highest probability). This is the simplest of approaches and results in a deterministic output.
- Beam Search. keep a set of “most probable” sequences and select the one at the end with the highest probability. This approach can be improved by enforcing diversity among the different sequences. Like the greedy approach, the results here are also deterministic.
- Sampling. Randomly sample the probability distribution. You can also shape the probability distribution using a temperature parameter to make the distribution flatter (high temperature) or more peaky (low temperature).
- Top-K. Randomly sample the top K probabilities in the distribution. Makes the output more predictable, however, it is difficult to select a value K that works for every distribution.
- Top-P. Randomly sample the tokens whose probabilities sum up to the top P% probabilities in the distribution. This helps deal with the problems Top-K has with super peaky or super flat distributions.
Furthermore, you can create a custom decoding method by adjusting the probability distribution to encode domain specific knowledge. Some applications of this in SchlagerAI can be seen in the next section.
The Future of SchlagerAI
After our 4 days of working on building up SchlagerAI, the results were quite promising, however,there is still a long way to go to make the perfect songwriting Artificial Intelligence. Improvements include the following:
- Better integrate song structure into the generation method. Thanks to the tags provided by Genius, our model appeared to start to understand the ideas of a [Verse] or a [Refrain], however, there is still a lot more expert knowledge we can look to provide it with. For instance, we know that a [Pre-Refrain] should always come before a [Refrain], or that a [Refrain] usually shows up multiple times with similar lyrics.
- Rhyming and meter. In order to fit musically, the different lines in a song normally follow some sort of meter/syllable structure. There are also rhyming schemes to make the song flow together better. We can encode this knowledge into our generation method by adjusting the probabilities of tokens if they fit into the predefined structures of the song. An example of this approach can be found in https://github.com/summerstay/true_poetry.
- Setting themes. As mentioned in the intro, Schlager music normally follows certain themes/topics. We can help shape the generation output in those directions by integrating keyword generation, an example of which can be found in https://github.com/minimaxir/gpt-2-keyword-generation.
- Integrating current events. Schlager songs sometimes reference mainstream current events in the lyrics. Our models are trained on a static, not so current data set, meaning it does not have knowledge of recent events. Facebook AI recently announced BlenderBot 2.0 which can integrate information from current events/internet searches into the text generation process. It may be possible to take some ideas from that and similar research to allow the generated lyrics to be more topical.
You can find the code in our GitHub repository if you want deeper insights into how we built SchlagerAI.
If you find this in-depth article about SchlagerAI interesting, then go and check out the recorded clip of the live SchlagerAI presentation at our most recent Discovery Conference!