Music generation should be creative
Why would you generate music through text prompts?
Discrete diffusion meets gesture control
Unlike image diffusion, Melody Diffuser operates directly on symbolic music tokens. Gesture inputs condition the denoising process in real-time.
Gesture Input
8 gesture types are captured: hand movement—up, down, hold, accent, and more.
Cross-Attention
Gestures are embedded and attended to at each diffusion step, controlling the melody.
Melody Output
Discrete tokens decoded to MIDI—pitch, duration, and velocity.
Architecture
class MelodyDiffusor(nn.Module):
def __init__(self, ...):
# Transformer backbone
self.blocks = nn.ModuleList([
TransformerBlock(dim, heads)
for _ in range(depth)
])
# Gesture conditioning
self.cond_embed = nn.Embedding(
num_gestures, dim
)
self.cross_attn = nn.MultiheadAttention(
dim, heads
)
def forward(self, x, t, gesture):
# Embed gesture sequence
c = self.cond_embed(gesture)
# Diffusion with conditioning
for block in self.blocks:
x = block(x, t)
x = x + self.cross_attn(x, c, c)[0]
return xThe key insight
Gesture as musical intent
Traditional music AI generates from prompts or examples. Melody Diffuser takes a different approach: Melody Diffuser integrates creativity through gesture controls.
Each gesture is embedded into a learned vector space. During denoising, the model uses these embeddings for conditioning, allowing your movement to directly influence pitch direction, rhythmic density, and melodic contour.
Trained on 1 million+ melodies paired with gesture annotations.