Four-Armed Marimba Robot Uses Deep Learning to Compose Its Own Music

Georgia Tech's Shimon has analyzed thousands of songs and millions of music clips and can now compose completely original music

5 min read

Evan Ackerman is IEEE Spectrum’s robotics editor.

Georgia Tech's Shimon musical robot uses AI to compose completely original music
Georgia Tech's Shimon has analyzed thousands of songs and millions of music clips and can now compose completely original music.
Photo: Georgia Tech

The Georgia Tech Center for Music Technology, led by Gil Weinberg, has a reputation for doing incredible musical things with robots, with a mix of creativity and technical expertise in robotics and AI. We’ve seen projects like a cybernetic second arm for a drummer, a cybernetic third arm (!) for a drummer, and a bunch of interesting research on ways that robots can dynamically collaborate with humans in the context of improvisational music. That last thing usually features Shimon, a four-armed expressive robotic marimba player, which can analyze music in real time and improvise along with human performers.

It’s an impressive thing to watch, but Shimon’s talents were mostly restricted to riffing on what other human musicians were doing. Now, Shimon has leveraged deep learning to create structured and coherent and totally unique compositions of its very own. 

This is Shimon’s very first original piece of music, a sort of classical-jazz fusion-y thing:

Shimon’s teacher (of sorts) is Georgia Tech Ph.D. student Mason Bretan. The melody and harmonic structure that you’re hearing is the output of a four-measure-long seed melody running through a neural network that’s been trained on nearly 5,000 complete songs (including music by Beethoven, The Beatles, Lady Gaga, Miles Davis, and John Coltrane), along with 2 million motifs, riffs, licks, and other foundational musical elements.

In the second piece that Shimon came up with, Bretan used a faster seed melody, and Shimon came up with something completely different but noticeably more brisk:

It’s important to understand that Shimon isn’t just mushing together different bits of music that it’s been programmed with, or using some kind of random-music generator. The special thing about what Shimon is doing here is that its deep neural network has, in effect, listened to those thousands of songs, and its compositions represent everything it’s learned from analyzing them. It’s able to generate harmonies and chords, and it focuses (like humans do) on the overall structure of the composition rather than simply what note should come next in an existing sequence.

Bretan calls this “higher-level musical semantics.” Shimon’s music isn’t something that we can necessarily identify with at this point, because we’re hearing the creative output of a deep-learning system. Weinberg calls Shimon’s music “beautiful, inspiring, and strange,” and we’d have to agree: This is something with coherence and structure, but it’s also completely unique.

For more details, we spoke with Bretan and Professor Weinberg over email:

IEEE Spectrum: Are the compositions that you selected to share videos of representative of what Shimon comes up with? Or did you select some that came out particularly well?

Gil Weinberg: These are the first two compositions Shimon composed using deep learning. No selection on our part. They are representing the data set Shimon learned from and the seed motif he was fed. One can imagine that if we extend the data set to include other music, and if we provide different kinds of seed melodies, the music Shimon will generate would be quite different.

If you trained the robot on only one type of music (say, classical music, or even classical music by a particular composer or school of composers), to what extent would the music it composes be identifiable as being related to the training set?

Weinberg: Shimon’s music is very much related to the training set, so if the data set had only one composer, the music would probably be quite identifiable with that composer (or genre). There is another important parameter at play, which is the seed music, which can lead to significant variations of the outcome.

Shimi musical robotGeorgia Tech’s Shimon, a four-armed expressive robotic marimba player.Credit

Why do you feed Shimon motifs, riffs, and licks, and other musical fragments as well as complete songs? How does it integrate those two things?

Mason Bretan: We want the network to learn important structural concepts. If we draw an analogue to language, in order for someone to write a story, he or she would need to understand the concept of words, sentences, paragraphs, and so on. In music, things such as licks, motifs, passages, and so on are somewhat analogous components. To encourage learning these musical concepts, we don’t explicitly say, “Here is a motif, here is a full song, here is a lick.” Instead, we train the network dynamically, by varying the sequence length so that sometimes the network has to predict the next measure given just the previous measure, or sometimes given the previous 2 measures, or sometimes given the previous 8 measures, all the way up to 16 measures.

Can you give us a more detailed description of the process that Shimon uses to compose original music?

Bretan: The first (and arguably most important) step is learning an effective numerical representation of a small snippet of music, like a single beat or a few beats of music. This is called “neural embedding.” In language modeling, you may have heard of “word to vector” or “word2vec,” which is a method for a network to learn word concepts—such as the words “good,” “great,” “pleasant,” and “wonderful”—that are all semantically similar. A similar process is done in this work for music so that a network learns how to effectively represent small musical snippets such that similar snippets are grouped closer together.

The second part is the sequence modeling and prediction of these musical snippet vectors. A recurrent neural network is trained to make predictions given the previous measures of music. It’s not exactly the type of reinforcement learning commonly used in robotics, in which a robot learns a sequence of discrete actions to solve some problem. Instead, Shimon is predicting a sequence of numbers in a continuous space. Let’s say given the sequence “1, 2, 1, 2, 1, 2, 1” the network is trained to predict the number “2.” That means during training, the farther away it is from the number “2,” the more substantially it will update the parameters. So once the network is trained, a seed is given to the network to provide some context, and then it continuously makes predictions, which make up Shimon’s composition.

Does Shimon have a particular style as a composer? Can you elaborate on how Shimon’s compositions are different from the music that humans create?

Weinberg: The underlying rationale behind all of our robotic musicians is to combine between music that we humans love and appreciate (using machine listening and machine learning) and new ways to play and think about music (using algorithms that humans can’t or don’t use). Here, the deep-learning architectures aim at capturing musical concepts and patterns that are used by humans. As part of the generation phase, we can play with the algorithms to add mathematical permutations that are machine based and may lead to novel music, which may be beautiful, inspiring, and strange.

Are there practical applications for this learning and improvisation technique beyond musical composition?

Weinberg: We are using LSTM (Long Short Term Memory) networks, and unit selection; both approaches can be (and have been) used in language modeling and generation, which can be equivalent to “improvisation.” 

What are you working on next?

Weinberg: We started to look at using deep learning to learn not only from symbolic notation but also from human performance of the music in the data set. This could allow the robot to learn not only what notes to play but also how to play them so the music sounds rich and expressive (controlling parameters such as microtiming, articulation and intonation).

Bretan: The next big questions I have are about interaction and how developing a deeper understanding of embodiment influences the compositional and perceptual processes of music. Shimon has four arms: How does that influence its interpretation of music compared to a human with two arms and 10 fingers?

Many thanks to Mason Bretan and Gil Weinberg for speaking with us. And if Shimon ever wants to learn from a bagpiper, just let me know.
[ Georgia Tech Center for Music Technology ] via [ Georgia Tech News ]

The Conversation (0)