Music generation should be creative

Why would you generate music through text prompts?

How it works

Discrete diffusion meets gesture control

Unlike image diffusion, Melody Diffuser operates directly on symbolic music tokens. Gesture inputs condition the denoising process in real-time.

👆

Gesture Input

8 gesture types are captured: hand movement—up, down, hold, accent, and more.

⚡

Cross-Attention

Gestures are embedded and attended to at each diffusion step, controlling the melody.

🎵

Melody Output

Discrete tokens decoded to MIDI—pitch, duration, and velocity.

Architecture

class MelodyDiffusor(nn.Module):
    def __init__(self, ...):
        # Transformer backbone
        self.blocks = nn.ModuleList([
            TransformerBlock(dim, heads)
            for _ in range(depth)
        ])
        
        # Gesture conditioning
        self.cond_embed = nn.Embedding(
            num_gestures, dim
        )
        self.cross_attn = nn.MultiheadAttention(
            dim, heads
        )

    def forward(self, x, t, gesture):
        # Embed gesture sequence
        c = self.cond_embed(gesture)
        
        # Diffusion with conditioning
        for block in self.blocks:
            x = block(x, t)
            x = x + self.cross_attn(x, c, c)[0]
        
        return x

The key insight

Gesture as musical intent

Traditional music AI generates from prompts or examples. Melody Diffuser takes a different approach: Melody Diffuser integrates creativity through gesture controls.

Each gesture is embedded into a learned vector space. During denoising, the model uses these embeddings for conditioning, allowing your movement to directly influence pitch direction, rhythmic density, and melodic contour.

Trained on 10 million+ melodies paired with gesture annotations.

Available nowMelody Diffuser

Try it →