Chiptunes in Tensorflow
Generating C64 music with RNNs
Written by Christopher HesseJanuary 11th, 2017
Can an RNN be used to generate Commodore 64 SID music without creating a custom solution? Sort of. If the abstraction has a high enough density of information (low number of bytes per second of audio data), it seems to produce a few noises, but nothing that can be particularly considered music. For a low density of information, such as raw audio data, something like WaveNet would likely perform better.
If there was a higher level representation available like MIDI has with notes/instruments, more musical output would be easier to generate, but extracting something like that from SID files could be its own project.
The Commodore 64
The Commodore 64, a popular 8-bit home computer released in 1982, was named for the staggering 64 kilobytes of RAM included with the system. It also included a sound synthesis chip, the MOS 6581 Sound Interface Device, supporting 3 voices and 4 different waveforms per voice. The SID chip was used by a variety of games and, after the era of the C64 was over, in a handful of synthesizers and has achieved a degree of cult classic status.
"The quality of the sound has to be heard to be believed."
BYTE Magazine, July 1983
Believe it:    Crazy Comets by Rob Hubbard
The C64 is no longer widely available, but a large amount of SID music is available from the High Voltage SID Collection: almost 50,000 SID files from various C64 games and even musicians, such as the German band Welle: Erdball.
SID files are not particularly similar to MIDI files, which store a linear sequence of commands like NOTE ON or NOTE OFF. They are instead C64 programs, often around 10KB in size, that include a play routine called at 50Hz, once per frame. The play routine will write to the various registers of the SID chip in order to create the desired notes and instruments. Anthony McSweeney has helpfully analyzed the source code of an early SID tune to show how it works. To play a SID file you need to emulate a C64 CPU and a SID chip and have them talk to each other.
Fortunately there's a large number of programmers who think the C64 was pretty cool and spent some time doing things like making various tools to work on SID files and emulate the SID chip. For decoding the SID files into SID register writes, I used cpu.c from siddump. To simulate the SID audio output using the memory writes, I used the resid library.
The RNN was trained on a subset of SID files that were chosen because they did not use any advanced SID features. All the SID files used have the same load address, play address, start page, speed, and are not RSID files which require particularly accurate simulation. The RNN was trained on 3 different representations: the original SID program, the memory writes to the SID registers, and the audio samples.
SID Program
The SID file is a header describing the settings required to play the file plus machine code for a C64 processor which plays the song when run.
Encoding
The encoding for SID files was the SID data bytes with the SID header and load address stripped off. The data was then converted to an array of integers representing the bytes, with the value -1 representing the beginning of a SID file in the data. The -1 value is supplied at sampling time, equivalent to the -primetext option in char-rnn.
Here's an example of the beginning of a song encoded in bytes converted to a numpy array:
4C 37 10 4C 85 10 4C 2F 16
Python
numpy.array([-0x1, 0x4c, 0x37, 0x10, 0x4c, 0x85, 0x10, 0x4c, 0x2f, 0x16])
Generation
Training was done with a 3 layer 384-node LSTM RNN with ~17k SID files meeting the requirements listed above, constituting ~64MB of data. After 12 hours/15 epochs of training on a ~1.3 TFLOPS Nvidia GTX 750 Ti, the training loss was 1.11 nats/character and the validation loss was 1.15.
Sampling the trained network produced a few SID files which ran, but almost all produce silence. A few produce the initial click of the SID chip being turned on. A single file out of the ~1000 sampled produces anything more than that. Here it is normalized so you can hear it:
  
The chance of an RNN like this producing not only valid machine code, but valid machine code that manipulates SID registers in some interesting way seems low. Presumably if you generated an infinite number of SID files this way, one of them would be a decent song, but that could take awhile.
Memory Writes
Instead of looking at the raw SID data, we can look at the memory writes to the SID chip's registers.
To record the writes we can use two symbols, poke and frame. poke means to set some memory value and has the form:
poke <address> <value>
frame takes no arguments and instructs the decoder to wait 1/50th of a second until the next frame before continuing. The start of a SID song in this format looks like:
poke 0x4 0x0
poke 0xb 0x0
poke 0x12 0x0
poke 0x4 0x0
poke 0xb 0x0
poke 0x12 0x0
poke 0x16 0xa0
poke 0x18 0x1f
frame
poke 0x4 0x0
poke 0xb 0x0
poke 0x12 0x0
frame
The poke 0x18 0xf is key since it sets the volume (which you can see is 0xD418 on the registers diagram) to the maximum value and appears often at the beginning of songs. Since the SID has 25 writeable registers, I only use the last byte of the address.
Encoding
For this encoding, each poke command with its arguments is assigned an integer as:
integer = (address << 8) + value
With the frame command given some integer higher than any possible poke integer. The beginnings of files are encoded as -1 and the results written to a numpy array as with the previous encoding. Here's an example from the beginning of a song with the numpy encoded result:
poke 0x4 0x0
poke 0xb 0x0
poke 0x12 0x0
poke 0x4 0x0
poke 0xb 0x0
poke 0x12 0x0
poke 0x16 0xa0
poke 0x18 0x1f
frame
Python
numpy.array([-1, 1024, 2816, 4608, 1024, 2816, 4608, 5792, 6175, 8160])
Generation
Training was done with a 1 layer 256-node LSTM RNN with a dropout of 0.7 and a dataset of ~18M commands generated by a subset of ~300 SID files from the previous collection of ~17K. After 1 hour/11 epochs of training on an Nvidia GTX 750 Ti, the training loss was 2.00 nats/command and the validation loss was 2.24.
Sampling 4M commands from the trained network produced a number of wav files, here are a few excerpts:
  Sample 1
  Sample 2
  Sample 3
Subjectively, some of the sounds seem recognizable, so it's possible that something of the original instrument patterns were captured, but there is no long range structure to the sounds and it would be hard to mistake something like this for an actual SID song.
Overfitting
If you overfit the network to your training data (meaning that the network has poor performance on the validation set), you can generate things that sound a lot like your training data. Here are some samples of a 1 layer 256-node LSTM RNN (no dropout) with a dataset of ~700k commands generated from SID files that are all covers of the song "Axel F" (train loss 0.01, validation loss 0.56):
  Sample 1
  Sample 2
They seem to sound more or less like the original songs, but with more glitches. If your goal is to generate sounds mostly like your training set, then perhaps overfitting could be a good idea. In general, it would seem that if you wanted to generate more "original" output, then overfitting is not the way to go.
Samples
We can also look at the samples output by a SID emulator. These can be generated from the SID file with a tool such as sidplayfp. I used the same tools mentioned in the first section to produce the output.
The wave file output by sidplayfp is a series of 16-bit integers representing the sound intensity at a rate of 44kHz. This means there are ~65k possible values for a sample, which makes it harder to process the data because you need a larger network to handle that many possible input values. We can reduce that to 256 values without much loss in quality by using a logarithmic scale instead of a linear one, specifically µ-law. For regular music the difference is probably more noticeable, but for noisy SID music, there is not a large difference. We can also cut the data in half by using a 22kHz sampling rate instead of 44kHz. This has an effect on audio quality, but it is not severe.
Here's a 5 second clip from Commando by Rob Hubbard, with different encodings:
  16-bit linear, 44kHz
  8-bit logarithmic, 44kHz
  16-bit linear, 22kHz
  8-bit logarithmic, 22kHz
Encoding
Reading the samples from the audio output into floating point numbers, you end up with values in the range -1.0...1.0, with 22,050 of those per second of audio:
[0.319, 0.319, 0.319, 0.319, 0.319, 0.319, 0.305, 0.305, 0.305]
The samples are written out as integers in the range 0...255 with -1 once again representing the beginning of a file:
Python
numpy.array([ -1, 229, 229, 229, 229, 229, 229, 228, 228, 228])
Generation
The output is indistinguishable from noise. It sounds sort of like someone blowing into a microphone. Getting a neural network to handle raw audio like this looks challenging, but the WaveNet model seems to be able to do this with a different approach.
  
Implementation
For these experiments I created a new version of rnn.py based on the ptb_word_lm.py example in the Tensorflow models along with some things from the previous rnn.py based on Andrej Karpathy's char-rnn. It now uses the Tensorflow LSTM library (tf.nn.rnn_cell.BasicLSTMCell) and can handle numpy's .npy input files.
In addition there is a go program to process the HVAC SID collection and create .npy files of the various representations, as well as a second go program to convert from .npy files into .wav files and other representations such as memory writes.
all code samples on this site are in the public domain unless otherwise stated