pleroma.debian.social

pleroma.debian.social

ok *whew* I finally did it! I implemented convolution reverb as a vulkan compute shader, and the results seem to be correct. I have it convolving the audio up front at the moment, but it seems to be reasonably fast and the results seem more or less correct. I'm using SDL3 to verify the output. It doesn't look like it'll be too crazy to rework it such that the stream is generated live.

it turns out the main difficulty working with vulkan is accidentally breaking your laptop in half

@aeva yes... accidentally...

@aeva Their logo suggests that it's working as intended!

White type Vulcan logo on red field. A large swoosh shape as of a hammer or sword travels left to right and seems to slice the entire word Vulkan in twain!

@aeva that's a severe difficulty though

@aeva that poor laptop

@aeva you what. thats really cool

@artemis I got it running live now, as in the processing happens concurrently with playback (and the processing is faster than playback). The input stream is still a wave file though so it remains to be seen if this can be made low enough latency with a live audio stream to be incorporated into a modular synth and still be useful for live play

I reworked it so the convolution shader processes the audio in tandem with playback, so I'm *very* close to getting this working with live audio streams.

But more importantly, I used this to convolve my song "strange birds" with a choir-ish fanfare sound effect from a game I used to play as a kid and the result is like the grand cosmos opened up before me and I'm awash in the radiant light of the universe. Absolutely incredible.

I want to power through and get this into a state where I can use it with live instruments, but I am completely exhausted 😴

@aeva how's the latency of the process?

@oblomov the compute shader running synchronous with the program is currently set up for a stable cadence of 15 milliseconds with room for significant optimization. Average is 8 milliseconds, but I have it targeting 15 to leave some wiggle room for timing variance caused by Linux's scheduler. I do not yet know how much additional latency is caused by the audio system or SDL3 or my use thereof, but it shouldn't be too hard to quantify, and I have ideas for workarounds if it's problematic.

@aeva nice

I reworked some things and now my audio convolving compute shader can convolve ~11 milliseconds worth of audio samples with an average processing time of ~7 milliseconds. That's with one channel with a bit rate of 22050. When the bit rate is 44100, the average processing time is a paltry ~8 milliseconds.

also sometime in the last week I made it so it can operate entirely on a live input stream from SDL3 rather than a wave file, so in theory I can incorporate this into a modular setup now, but the results are higher latency than I'd like, and SDL3 doesn't give you much control over audio latency.

Apparently my best frame time can get as low as 3 ms. I think vulkan should let me VK_QUEUE_GLOBAL_PRIORITY_REALTIME this program, but sadly vulkan is being a coward about it.

@aeva do people usually use the GPU for audio? This seems like a really good idea for applications like live convolution!

ok the problem I'm having with latency now is that the audio latency in the system grows over time and I'm not sure why. like it starts snappy and after running for a short while it gets super laggy :/

I'm guessing it's because SDL3 can and will resize buffers as it wants to, whereas I'd rather it just go crazy if it under runs.

@benetherington I'm under the impression that it is somewhat unusual, but not unheard of

@aeva Very cool, glad I ran across this thread. GPUs have magic smoke like all electronics but I’m sure it’s dark magic (smoke) after trying to work with shaders in any capacity. Smushing audio streams in there is delightfully forbidden.

@aeva as someone who's previous job was on realtime audio stuff this would drive me absolutely bonkers

@benetherington there's a lot of fiddly bits to it for sure

@JoshJers I'm hoping there's some way to tell it to not do that

@aeva I mean you're using a single lane on a single thread of your boofing GPU, right? So it's basically 68000 perf, but annoying.

@aeva @JoshJers I remember when we grew the buffers on our D3D8 drivers and the throughput increased by about 10% which was nice but also the latency grew to around 3 seconds which is sub-optimal when playing Quake but totally fine for the Quake benchmark.

@aeva this is a surprise to me

@mcc I've notice before that it will happily let me buffer minutes of audio in advance for output

@mcc and the docs sorta say it can

@mcc this also might be pipewire's fault though. I've seen something similar happen using pipewire to route stuff instead of just using alsa directly

@aeva I was curious about this buffer-resize as I've done some audio code in the past, but not with SDL3(yet). that would be frustrating to write well-timed audio code if it dynamically resized things on the fly rather than doing audio dropouts in buffer exhaustion.

Iooks like there are hints you can set to force a buffer size in frames https://wiki.libsdl.org/SDL3/SDL_HINT_AUDIO_DEVICE_SAMPLE_FRAMES

Not sure if this could help? It doesn't seem to promise this will work, so may still need to fetch the actual buffer size gotten back.

@eggboycolor my program sets that hint. best I can tell it does nothing

What I want to do is have a fixed size buffer for input and output, enough that I can have the output double or tripple buffered to smooth over hitches caused by linux. if my program can't keep up I don't want it to quietly allocate more data I want it to scream at me LOUDLY and HORRIBLY, but it wont do that because I'll rejigger my program until it is perfect.

What actually happens is it (sdl? poopwire?) just infinitybuffers so it never hitches and I get a second of latency after a little bit

@aeva wouldn't it hitch the first time the game fails to fill the buffer, causing it to make the buffer larger?

@cancel what game

@aeva I'm using game in place of application. whatever the application is. the user of SDL.

@cancel it's not hitching, but latency is growing. like, I start the system, and press a key on my instrument and it's ok. I noodle around for a bit and it's fine but the delay gets longer and longer until I press a button and there's this noticeable pause before anything plays. It is really weird.

@aeva darn. hope you're able to figure out something there :< maybe audio backend specific, if you don't mind diving through sdl's code/statically linking stuff with symbols to do more debugging, but yeah these sorts of latency things are indeed messy if they (or the backend they wrap) don't give easy direct control over buffers.

either way curious to know what comes of it, since it's been really cool following your thread especially re:applying Vulkan for real-time audio!

@aeva 🤔 my impression is that, if the layers outside of your code are dynamically growing in order to deal with some latency deadline not being met, you'd hear at least one glitch in the audio each time it has to grow. otherwise, how would it know the application/usage layer failed to fill the output audio buffer in time?

maybe something else is happening? which audio API are you using? it looks like SDL has multiple.

@eggboycolor I'll probably have to do something like that, though I might just rip out the SDL stuff and hammer ALSA directly

@aeva alternatively, it's possible that it does something extremely paranoid like hidden double buffering, where it wants your usage code to keep *two* buffers full, and if it drops down to only one, it starts growing the chain or buffer size. but this seems crazy and i doubt it.

@cancel I believe SDL is using pipewire by default. If I set it to ALSA it prints an error but otherwise behaves the same, suggesting it falls back to pipewire. If I set it to jack it freezes.

FWIW I have also seen this behavior with pipewire.

@cancel The scenario you are imagining only works if everything is running perfectly synchronously except the application. For example it could be that the recording device is feeding frames just a tad too fast. Or maybe SDL3 is gradually accumulating frames that do not exist via imperfect resampling. Or perhaps pipewire is complete dog shit. who knows

@cancel it is almost certainly doing that because the docs say it does

@aeva that's insane and i don't have any suggestion except to use something else for audio. if that's true they have made the problem way more complicated than it actually is.

(i am a professional programmer who does audio, including having done commercial embedded projects using linux with custom audio stuff)

@aeva the last audio thing i worked on in linux i shipped to the client with 5ms of audio latency from user input to dac output. i used alsa and it was like 200 or 300 lines of code to set up.

@aeva toot chains like this one remind me that programming is mostly about fighting bureaucracy, but as robot wars.

@cancel I will likely end up doing that. I don't suppose you know any good tutorial resources for working with alsa?

@lritter personally I object to the common characterization of programmers as bureaucrats. other programmers might write "business logic" but not me! I d osomething cool for a living

@cancel ok this is wild. I decided to start calling SDL_ClearAudioStream right before putting data into the output stream. the result is distorted af, but the latency still accumulates. the problem might not be sdl3 or the problem might be on the input stream

@aeva is the SDL audio system a push-mode design?

@cancel I don't know

@aeva do you fill the audio buffer from a callback in an audio thread, or does your application code just kinda chuck audio buffers at a sink when there's time to do so?

@aeva everything I saw was trash, and I just read the official docs.

@aeva that's right! we da kool bureaucratz

I like that pipewire has an option to not be terrible ("pro audio" mode) and it doesn't work

@aeva once they hit "not a total pulseaudio shitshow" it was mission-accomplished.

99% of audio problems on linux these days are just programmers refusing to just fucking use alsa. I'm part of the problem, because I'm using SDL3 instead because the API is simple. SDL3 is part of the problem because when I tell it to just fucking use alsa it uses pipewire instead! and pipewire is part of the problem because it's just completely terrible. like, wayland terrible.

want to have low latency audio on linux? we have a tool for it, it's called STOP PILING LAYERS OF BOILERPLATE ON TOP OF ALSA YOU IDIOTS YOU ABSOLUTE FOOLS

@aeva *snores in pulseaudio*

@aeva Was surprised and horrified to learn the solution to latency in Resolve through pipewire is... ...to install pulse.

Like ffs, linux audio, disenfuck yourself.

I'm like 30% sure SDL3 is not the problem or at least not the only problem because I tried resetting the streams every frame with SDL_ClearAudioStream and it still accumulates latency (in addition to also now sounding atrocious due to missing samples).

I've also seen this happen with pipewire before in other situations, and it was resolved by bypassing pipewire.

@lritter how do you make pulse audio laugh on monday

@aeva Did you try miniaudio yet? I don't quite remember what it uses as a default, but it has a whole bunch of compile time options to exclude stuff you don't want/need.

https://miniaud.io/docs/manual/index.html#Building

@lritter you tell it a joke on tuesday 😏

@joshuaelliott wait what really?

@code_disaster this is the first I've ever heard of it

@aeva Sounds like what you need is just setting right quantum values?

@aeva@mastodon.gamedev.place I can't tell if I like pipewire because my experience with it is good or if it's just good compared to fucking pulseaudio

goddddd pulseaudio

@aud let's be real, pulse audio sets the bar real low

@dos sounds like what I need is to throw it in the garbage and just use ALSA

@aeva
You can check the pipewire buffer size and latency with pw-top (discalimer: I wrote a UI version of it called pipecontrol)

@aeva Annoyingly yeah. (Thankfully having both seems not to cause issues elsewhere, but... it just feels like a dirty install.)

@aeva there should be a cmake option to compile out pipewire 🤔

@aeva I'm used to audio systems using EITHER a buffer/watermark mode OR a realtime mode and you never want the former for low latency.

@aeva

It's been over 20 years since audio made me switch from Linux to FreeBSD.

The new version of OSS is proprietary, what shall we do?

FreeBSD: Well, the old version is still BSDL, I guess we'll just fork it and add low-latency in-kernel sound mixing and extend it with the features OSS 4 added.

Linux: Rip that stuff out of the kernel and replace it with ALSA, which doesn't do software mixing in the kernel at all!

KDE: Wait, now two apps can't go 'ping' on Linux. Let's write a sound daemon.

GNOME: Wait, now two apps can't go 'ping' on Linux. Let's write a sound daemon.

KDE and GNOME: Oh, now KDE and GNOME apps can't go 'ping' at the same time. I guess we should agree on some standards.

PulseAudio: Hi everyone! I have come to save you from the perils of usable sound! But now you can have sound move from your speakers to USB headphones when you plug them in! Maybe! If you get the config right.

Everyone: Nooo, someone let Lennart Poettering write some code! We're doomed!

Hans Petter Selasky: Wait, that thing with moving audio sounds useful. Rewriting all of your software to do it? Less so. *Writes virtual_oss to provide a layer that lets you send audio to USB devices with userspace drivers or to different in-kernel devices*.

PipeWire: Okay everyone, we can all agree PulseAudio was a bad idea, but we've rewritten all of the code and have a migration path. I guess we're good now?

FreeBSD: Curses, hps just died. I guess he won't be fixing all the things anymore. We'll need to start maintaining virtual_oss and integrate it with the base system. Should probably also fix a bunch of issues in the kernel drivers and make sure low-latency sound mixing is reliable and robust with new hardware. By the way, software that you wrote 20+ years ago still works fine with the kernel and userspace drivers and has low-latency mixing.

@david_chisnall the amazing there here is ALSA has had software mixing for ages. there really is no reason for any of the shit people pile on top of it anymore

@mcc I wish this had a realtime mode. The most I've found is a snide remark in https://wiki.libsdl.org/SDL3/SDL_HINT_AUDIO_DEVICE_SAMPLE_FRAMES about people who want low latency.

@aeva I think you mean the tool is called OSSv4

@aeva I’d love your opinion on JACK. Signed: embarrassed, who has yet to get his synth to be audible from his Linux machine
replies
1
announces
0
likes
0

@jmtd best I can tell JACK is mostly there to "add skill"

@portaloffreedom ah, interesting, it seems the reason why forcing SDL3 to use ALSA behaves identically is because it's picking pipewire's man-in-the-middle virtual device 😭 this has been illuminating, thanks for the recommendation to check out pw-top

@aeva maybe i was right to stick with portaudio so long…

@mcc maybe I should give that a try

@aeva I've been leaning heavy on a Rust-specific solution of late so I'm not up to date on best practices. PortAudio feels like it's for people who Care About audio. I don't offhand remember what latency guarnatees it makes.

@aeva
Found the buffer size commands here: https://gitlab.freedesktop.org/pipewire/pipewire/-/wikis/Migrate-JACK

`pw-metadata -n settings 0 clock.force-quantum <bufsize>`

works only for the current session, it's not saved in a configuration file.

@aeva
As for using alsa directly, pipewire-alsa replaces the libraries, so that won't work.
To make it work, you would need to call the kernel api directly. Maybe statically linking the old alsa api directly? Might even be enough to force loading the dynamic one.
(Pulseaudio was doing the same).
But you do encounter all of the old Linux audio problems, like being in the audio group, only one audio program at the same time, etc...

@portaloffreedom maybe I should uninstall pipewire then

@mcc @aeva I care about audio and have fond memories of PortAudio from back when I wrote audio processing in C to get great performance on modest hardware. One of these days I want to do it again, with Zig instead of C.

I just realised I have no clue what the actual latency numbers were, but they seemed on par with professional DAWs.

Caveat: This was on Mac, not Linux.

@portaloffreedom @aeva You can also set the buffer for your program only, if you use Jack:

pw-jack -p 64 ./your-thing

This is what I've been using.

@nen @portaloffreedom this starts pretty snappy but the latency still renders the whole thing unplayable after a minute or so

@aeva There was something silly with software mixing in the early days of ALSA that I’m trying to remember. I think their OSS emulation did pass through so you could only use the software mixing if your apps all used ALSA APIs, if one did OSS then the others could only use software mixing if your hardware supported multiple channels, which cheap AC97 CODEDCs didn’t. My SoundBlaster Live! did, but the driver was staggeringly bad and causes kernel panics at least once a day (it also wasn’t SMP save and would cause kernel panics in under a minute on SMP systems).

@david_chisnall this was very enlightening to read. Thank you!

I’m here for your rant on oom-kill whenever you’re ready.

@xconde You mean the kernel task that identifies the app with the most unsaved data and kills it? It got better once you could exclude the X server…

I could also rant about the balloon drivers in Linux that don’t interlock in any way with things that demand memory and so you see the sequence of:

  1. Process page faults.
  2. Balloon driver starts getting more memory.
  3. Page fault handler decides that there are no free pages.
  4. Process receives SIGSEGV and dies.
  5. Balloon driver provides memory to the system.
  6. User tries to figure out why the process died randomly when rerunning it works fine.

@david_chisnall @aeva What are you using on FreeBSD?

@jimmysjolund @aeva

OSS, anything else is a layer on top and rarely brings anything useful. Things like VLC and musicpd happily talk OSS directly, a lot of other things use a lightweight abstraction layer like libao.

Surround sound from VLC just works on my machine, no configuration needed. Other people have buggy firmware setting that mislabel pins on the audio devices so need a bit of tweaking.

@david_chisnall @aeva Ah I misread as "audio production". I have not fiddled with BSD but perhaps Ardour etc is available?

@jimmysjolund @aeva

Not something I’m familiar with, but you might be interested in this article from folks in that space.

Ardour is available in packages.

@david_chisnall @aeva Thanks! I have some reading to do.

@david_chisnall that was in the aughts, which was 30 years ago

@aeva I’m pretty sure 2025-2022 is not 30.

It was around then that I gave up on Linux and moved to FreeBSD. Nothing in the Linux world subsequently has made me regret that choice.

@david_chisnall linux is kinda dog shit if we're being fair. does bsd graphics yet?

@aeva FreeBSD has been able to use the same DRI/DRM graphics as Linux for many years, as well as proprietary NVIDIA drivers. I had working 3D on FreeBSD with an ATi Radeon 8500 and a Matrox G550, and pretty much everything I've owned since.

@aeva next step is to determine who is really growing the buffers, alsa, pipewire or SDL.
I'm even tempted to ask for the source code and test myself to help, not sure if I get the time to do it though...

@portaloffreedom be my guest https://github.com/aeva/convolver

you need an intel xe integrated gpu.

build it with `dotnet build` like any other normal c++ program, and an executable called convolver will appear. (you can also pick the build commands out of the csproj file i guess)

you'll also need to add a short wave file called revolver.wav to the same directory as the executable, preferably a gunshot sound but other things can work. 16 bit singed pcm mono 22500 is recommended for that file

*spaces out* so anyways, this is usually the point where I'd try to cut this down to a simple loop back with as few layers as possible and gradually build back towards my program until I either find where the fault is or or have something working properly. That would mean targeting ALSA directly, except that appears to not be possible without uninstalling pipewire-alsa, which I can't without uninstalling Steam :/

@aeva okay so this may be severely out of date advice (like … a decade or more) but if the problem is that pipewire is hogging all your physical mixing channels on your sound device, could you write an alsa.conf targeting them both at a synthetic dmix device instead of directly at the hardware?

so abnormally, this means starting with a pipewire loopback instead and seeing if all you brave defenders of the status quo are flickering my lights or not.

this makes me unhappy, but the single silver lining here is pipewire's API docs seem to be a little more newbie friendly than ALSA's

@glyph what I want to do is completely shut off pipewire temporary so I can have one program (mine) talking to alsa directly without any virtualization and experience my back and shoulders unclenching, so that I have a ground truth reference point for what I can reasonably expect my computer to do. once I had that in hand, then I would then restore pipewire and try see if it still works with pipewire-alsa and go from there until I isolated the real problem and banished it forever

@glyph However, I did not install pipewire-alsa. As it happens, the reason why I have pipewire-alsa is because Steam wants it, and steam will uninstall itself if it is not available. I do not want to uninstall Steam. I am unhappy about this turn of events. I have been personally betrayed by Cave Johnson.

@aeva
I have an Intel arc770, would that work?

@aeva ah. I think you can also tell pipewire to talk to a null device, so you can functionally disable it without uninstalling anything

@portaloffreedom I'm guessing not because it's a discrete GPU, and the program is set up to stage audio buffers as memory that's shared between the GPU and CPU, but if you're willing to try anyway paste me `vulkaninfo --summary`

@aeva My experience is the opposite. Most of the API docs are completely inexistent, and a large part of the API is an unholy abomination of header-only C inline defs that can't be used from other langs.

@portaloffreedom I pushed a change just now to fix how device selection handles discrete GPUs (it wouldn't have listed that it rejected it), but I also tested it with LLVM pipe just now and it does work, you just need to change the GroupSize constant in the csproj file from 32 to 8 and then it'll work.

@portaloffreedom I'll have to find a way to make that automatic later

@shinmera this is what I found https://docs.pipewire.org/page_tutorial.html

maybe newbie friendly isn't the right term, but it's probably sufficient for my needs

@aeva Yes. The examples are... OK, though they're also very lacking in explanations. The actual API doxygen is 99.9% without docstrings or explanations on anything though. I had a horrendous time trying to get output working without it either eating all my cpu or hitching horribly. Ultimately had to look at SDL2's implementation to get it working right 🫠

@aeva And I still needed a fucking C shim library to handle those horrendous inline towers. That library, despite only having 2 (two) calls in it is 32KB big (!!!)

@shinmera good to know

@aeva in any case I wish you luck and hope you conclusively figure out what is going on here

@aeva this kind of bullshit is why everything lives in containers now (and is consequently always way larger than it needs to be). ugh.

@aeva I opened up some documentation for present-day pipewire and alsa to see if I could say anything more specifically useful because I recognize this pain even if it’s been a while but then I noticed I was taking like 3 SAN damage per turn and needed to go back to my existing obligations

@mcc @aeva there is an extreme irony here in that this specific issue can only be made worse by containers (does alsa even participate in cgroups, why can I not even google an answer to that question?!?)

Love the UNIX® design philosophy where everything* is a file

*: reader’s note: pronounced as “a few things, sometimes”

@mcc @aeva
I’ve never thought much about rust before...
Is there a primer?

@aeva uploaded, ready for the weirdness of my system :D
(encrypted pastebin that expires in 1 week)
https://privatebin.app/?ee6be9121d68fbd0#-GtkwH5v2cDDxq9BbdkUxRAv2A85JNtV3A1ZShfD1jWKe

@aeva ok, I've managed to run something with llvmpipe (made a makefile if you want). I figured out something:
`pw-metadata -n settings 0 clock.force-quantum <bufsize>` will set the audio card buffer size
`pw-jack -p 64 ./your-thing` will set up the application buffer size.

I've managed to get pretty small buffer size and keep the constant if I use them both, but I can't tell if it's an actual improvement.

In the image I've set the audio card to 32 and the convolver to 64
(QUANT is the buffer size)

@ipd @aeva …to … writing Rust?

@portaloffreedom how responsive is it after running for a few minutes?

@aeva
I get a thick every second since the beginning. I'm probably using a too heavy wav file.

ok I did it. I've got a program that writes a pipewire stream of F64 audio samples where each sample is the total elapsed time since the first frame, expressed in mintues.

I've got a second program that reads that pipewire stream, and checks the offset against it's own elapsed time since the first sample processed. This program prints out the calculated drift ever second.

The results are interesting.

In the first version of this, both programs just measured the time using std::chrono::steady_clock::time_point. This resulted in an oscillating drift that was well under a millisecond at its peak and nothing to be concerned about.

This is good! That means there's no place what so ever within pipewire on my computer for this specific audio setup where any intermediary buffers might be growing and adding more latency as the programs run.

This is not the interesting case.

In the second version, I changed the first program to instead calculate elapsed time as the frame number * the sampling interval, and left the second program alone.

In this version, the calculated drift is essentially the difference between the progress through the stream vs the amount of time that actually passed from the perspective of the observer. In this version, the amount of drift rises gradually. It seems the stream is advancing just a touch slower than it should.

The samples in the stream are reporting that more time has elapsed in the "recording" than actually has transpired according to the clock. The amount of drift accumulated seems to be a millisecond every few minutes.

I'm honestly not sure what to make of that.

anyways, for the curious, I put the source code for the experiment up here https://github.com/Aeva/slowtime

@aeva
Feels like a rounding or off by one error. Are you sure FramesProcessed = 1, and then (FramesProcessed + frame) is correct? What if FramesProcessed = 0?

@jannem the source is right there

@aeva yep. And no, that is correct.

@aeva please report it to pipewire if it's unexpected. They're extremely nice to work with in my experience. (Like, surprisingly good and helpful) They care about timing/sampling/latency.

also interesting is the drift is faster if I have the second program's monitor pin hooked up to my sound card, but there's still drift either way.

@aeva so, I'm looking at the code. I'm assuming FRAME_TIME_MODE is true. It looks like it uses the number of video frames rendered, the number of video frames per second, the number of audio buffers filled, the number of samples per second that are output from the audio device, and the size of that audio buffers are, and then sees that the time starts becoming misaligned over time. Meaning the number of audio samples rendered is running either too fast or too slow. Is that right?

@aeva so basically, over time,

(samples per second / samples rendered) != (frames per second / frames rendered)

Right?

@viraptor it is unexpected, but I am unsure whether it is problematic enough to bother someone about it

@cancel there's no video frames?

@cancel it's using the frame number (monotonically increasing sample index) multiplied by the sampling interval to calculate the progress through the stream in terms of play time, and then this is compared against the clock time in the other process. Since these are processed in batches, it is expected that the second program wont line up perfectly, but the fact that the delta increases over time means it is running too slow or too fast. I believe it is running too fast.

@aeva I saw

double(FramesProcessed + Frame) * Interval / 60.0

and assumed the 60 was video frames per second

@cancel I think this means that pipewire is scheduling these callbacks a touch earlier than it should on average. The error is a millisecond per minutes of real time, which is probably small enough that it wouldn't result in any significant pitch shifting, but I suspect that if any step in the process handles the passage of time more precisely than pipewire does then cool things might happen

@cancel oh, no, I was just putting it into minutes to make it easier to see in audacity before I wrote the second program

@aeva OK. I see.

So the comparison that's not working, then, is,

(samples per second / samples rendered) != (std::chrono::now_time() - std::chrono::start_time())

Is that right?

@cancel generally what you will see is this:

(samples per second / samples rendered) > (std::chrono::now_time() - std::chrono::start_time())

after running for a few minutes

@cancel that could potentially explain the problem of the increasing latency I was seeing. If SDL3 is handling time more precisely than pipewire and buffering the surplus it receives it might be functioning like a dam

@aeva the thing you are seeing is the clock drift between the quartz crystal clock in your audio interface and the one in your cpu

@cancel this happens even if I don't have this hooked up to any output devices

@cancel it happens *faster* if i do have it hooked up to an output. so you are correct, but there is also drift inherent in the system

@aeva not surprising, i suppose. maybe they accumulate the time delta slightly differently than your code, or they put some drift in there on purpose for simulation/headless testing purposes.

@aeva this is not really a problem that can be fixed. the clocks for the audio interface and the cpu are different, so the design of the software has to always deal with it. usually, for synchronized audio and video, the audio interface gets chosen as the authoritative clock, because audio glitches are less tolerable than an occasional frame of pulldown in the video.

@aeva what is a "monitor pin" in the context of this program?

@aeva this is kicking me repeatedly in the special interest, so: https://onlinelibrary.wiley.com/doi/10.1155/2008/583162

"Values of 1 ppm or 10−6 are typical for computer grade crystals (1 microsecond every second or 0.6 second per week)"

sound cards have independent sample clocks

1µs/s of drift between CPU clock and sound clock would be about 1ms of drift every 8-16 minutes

so it sounds like you are rattling the bars of the cage of the constraints of physically embodied computer hardware

@aeva (although nowadays I am doing very different stuff with it, the origin story of the code that eventually became <https://fritter.readthedocs.io/> was in a softphone & SIP framework called Shtoom by Anthony Baxter, whose initial version did exactly the naive thing that causes all kinds of hilarious drift problems over the course of a phone call lasting more than 3-4 minutes)

@cancel I don't need you to explain this to me

@glyph "so it sounds like you are rattling the bars of the cage of the constraints of physically embodied computer hardware" that happens a lot XD

@aeva yeah I am hesitant to even chime in here because this super does *not* seem particularly novel for you :)

@aeva for _most_ programmers encountering a problem like this I would probably have opened up here with some pablum about how there is no such thing as "real" time, clocks are all I/O devices, maybe some vague gesturing towards vector clocks, because those things are often surprises, but you are if anything much further down this road than I usually am. if you already knew more than this about average timing crystal purity metrics, or something more recent than 2008, I'm keen to know :)

@glyph this paper sounds like an exciting read XD I'll have to take a look later

@aeva re: this paper specifically, I do wonder if pointing a hair dryer or a jet of A/C air at various components on your computer affects the measurements you're taking ;)

@glyph @aeva and here is the point where *I* cannot resist sharing a paper because I think it's just the funniest way to demonstrate that at even an epistemological level we have at best a slippery grasp on time even existing

https://arxiv.org/abs/0811.3772

@SnoopJ @aeva hahaha _incredible_

@SnoopJ @aeva I read "Hilbert space" and I feel like I need to take a drink; I have low confidence that I would understand this if I had time to read it

@glyph tbh I don't know much about the crystal timers, but one fun thing about ghz processors is by design they can't run at a perfect cadence so as to prevent radio interference

I think my conclusions from this are

1. the latency drift I observed with my experiments with pipewire today is probably inconsequential.

2. there is probably nothing sinister about pipewire.

3. if you have a chain of nodes that are a mix of push or pull driven and have different buffering strategies, you are in the Cool Zone

4. my program is probably going to have to handle "leap samples" in some situations. I admit I wasn't expecting that, but it feels obvious in retrospect.

@aeva @glyph i'm curious about what happens if you disable spread spectrum clocking. according to the rk3588 trm there are registers you can use to fully disable ssc on the pll for the cpu.

(well, i know you don't get fcc certified)

5. the unplayable latency accumulation in my convolution experiment is a different unrelated problem, that is probably going to be solved by stripping out all the SDL3 audio stuff and replacing it with using pipewire directly. this is thankfully a minor inconvenience.

@aeva 4) yes, that is inherently going to be the case due to clock drift any time you have data traveling between multiple sources.

You have to deal with this in e.g., Ethernet land too, where the timebase at opposite sides of a link are required to be +/- 25 ppm but that level of drift still adds up over time. This is one of the things the inter-frame gap does, provides room to add/remove idle words to keep both ends in sync

@artemist @glyph my guess is mostly nothing for a while and then you get in a lot of trouble, but idk

@aeva the obfuscating factor that I would guess is most likely is that somewhere along the way the operating system has lied about some aspect of timing resources and how time is scheduled - or there was a lie before, and it's actually more honest now! From past experience I expect user processes to consistently drift behind wallclock time and to exhibit "rubberband" timing oscillations while trying to catch up.

I did a relatively deep dive into doing graphics frame timing based on "estimated frames from start based on wallclock" back in 2019, using SDL2, and I had to allow it to gradually drop frames, because the rubberband artifact was just too large and noticeable to tolerate - it's much more jarring to drop one frame and then skip two than to just drop one and pretend it's fine. I put a filter over my method of dropping frames so that the apparent deltas were perceptually more smooth, and my journey ended there.

So from my view, not having much insight into the kernel or pipewire internals, there could be a bug anywhere in that stack, or an assumption that you're still going to be behind wallclock time and you have to deal with it. The test case could be a "works for me" with a slightly different kernel. Or there could be a best practice that presumes you know you need to resample to account for this phenomenon, and maybe it was hidden before and this particular environment surfaced it to you, possibly unbeknownst to SDL.

I do remember old late 2000's versions of Ubuntu that played audio too fast - like, it was at the wrong pitch, or it oscillated. That kind of thing could reappear if the goal of pipewire is to be more transparent and low-latency.

@Triplefox "somewhere along the way the operating system has lied about some aspect of timing resources and how time is scheduled - or there was a lie before, and it's actually more honest now"

I think it would be more accurate to say that it is impossible to be perfectly precise about time on commodity computering hardware, so it's really more of a "if you can't be good be careful" kind of situation

@azonenberg I just think it's wonderful that that kind of complexity exists within the computer, and can just be a consequence of two pieces of software conceptualizing and quantifying time differently causing error to appear in different places. it is an ordinary thing, but still a delight.

@aeva I think it's actually hardware not software? Like, the sound card is running at 44.1 or 48 or whatever kHz, but its timebase may not be an exact integer multiple of whatever the CPU measures time at.

@azonenberg there's no sound card in this setup

@azonenberg like time drift happens faster if I connect the processes up to a real sound device, but it happens even if it's just two processes floating on the CPU completely disconnected from any output device

@aeva oh now that is interesting, I wonder what's causing it.

I'm used to this kind of drift but usually it's when you are dealing with two different pieces of hardware that each have their own timebase, which are a few ppm out of sync from one another

@azonenberg strictly speaking all I've done was estimate the relative drift between pipewire's throughput and std::chrono::steady_clock on my machine. given that std::chrono provides three different clocks and all of them have caveats, I don't think there's enough useful information to come to a firm conclusion

@azonenberg I think a better experiment would be to run a long audio stream and compare it with a higher quality time source (and run it long enough to overcome the wobble on reading that time source ofc)

@aeva
You're not wrong as such, but JACK does do a very nice job of routing/multiplexing.

I've stopped listening to people saying "OK this time Pipewire is production-ready for pro audio, no seriously I mean it" and started waiting for longtime JACK-enjoyers to say "yeah, they finally cracked it."

@aeva I'm a little late, but if you want to test things with ALSA or JACK without dealing with them directly, we've been pretty happy with PortAudio at work. https://www.portaudio.com/

We use it to sling 32 channels of latency-sensitive audio on Windows. (In our case with WDMKS, which ≈ ALSA on Windows.)

As far as abstractions go it's pretty thin and focused on low-level.

(There's no backend for PipeWire directly, but my understanding is PipeWire provides a JACK-compatible API that works well.)

@aeva Also I don't believe you need to uninstall PipeWire to use ALSA directly. I think you can just stop the PipeWire service.

Also I stumbled upon this when checking if PortAudio supported PipeWire: https://gitlab.freedesktop.org/pipewire/pipewire/-/wikis/Config-PipeWire#setting-buffer-size-quantum
You've maybe already seen it, but that PIPEWIRE_LATENCY environment variable might be helpful.

The linked FAQ also mentions a "Pro Audio Profile" that sounds very useful for what you're doing.

@aeva Also also: I don't think it is, but if this is at all related to your C# project and you end up liking PortAudio, here are the C# bindings I made for that work project: https://github.com/horizongir/PortAudio.NET
(Barely tested on Linux, but should work. If you can confirm it works for you I can find some time to add Linux to our CI and publish NuGet packages.)

@PathogenDavid the project is currently a C++ project but I might check it out in the future

@david_chisnall @aeva OpenBSD: Hi there! We have sndio.

@pertho @david_chisnall which bsd is the good one :3

@mcc @aeva

Or reading Rust.
At the moment I'm just watching it.

nice, pipewire has some special case stuff for filters

@aeva are there any microphone equalizers for Linux? I want to emulate the Ghub experience in windows. Without Ghub. Or Windows.

@Surlytom 🤷‍♀️

holy smokes I got it working :O!! i got my audio convolver working using the pipewire API directly!! and the latency seems to be very adequate for real time play :D

my revised opinion on pipewire is that I like that the API is wizards only. I'm a wizard, so that makes me feel special.

that or I'm just good at creating wizard problems for myself. either way I'm in a good mood.

https://github.com/Aeva/convolver/blob/c5d1ca8ec8a4aafd640def16d68e1c84bbc6b240/src/convolver.cpp#L509

@aeva I managed to make it work, but I wish I knew how to actually use it

@efi *dances around the room leaving behind a radiant after image*

@aeva *chases* owo

@efi eee kitty *pets*

@aeva *purr* <3

@aeva I think being good at both creating and solving wizard problems is the defining trait of a wizard

@aeva constantly setting up hurdles for Future You to jump over, so to speak

@aeva also probably the reason why 20th level Wizards aren't running the world

With like say a Monk or Barbarian it's pretty obvious. If you just maintain a large enough distance you're pretty safe.

But Wizards? They got Power Word Kill and Wish and Time Stop and yet somehow those poor border towns are overrun by *Giant Rats* and Goblins? Why do they let this happen?

Answer: on it on it on it, as soon as they figure out why the enchanted quill refuses to draw black runes without magenta ink

@rygorous relatable

@aeva nice work, very clean code 👌

@aeva sweet, way to go...going to give this a try just because I like to hear strange sounds come out of my guitar.

@skryking awesome! if you're on linux, clone https://github.com/aeva/convolver, build it with `dotnet build` put a short 48000 hz mono sample wav file in the same directory as the output executable, name the sound file revolver.wav or change the hard coded name in the source files and run ./convolver to fire it up. you'll need to use helvum or something to hook up the inputs and outputs, and it only supports mono audio right now

@skryking if you don't have an appropriate integrated gpu, it should fallback to llvmpipe which may or may not be sufficient to run it

@aeva thanks for the howto, was going to dig through code to try to figure out how to run it.

@skryking the csproj file also contains the build commands if you don't want to install dotnet

@aeva will it play nice with my rtx 4070 mobile?

@skryking no idea! it needs a unified memory architecture to work, as that allows the CPU and GPU to share allocations without having to do a run around with a copy queue

@aeva ok ill just give it a try.

@aeva So I got it running but I'm getting an error and just a clicking sound out of it when I feed in my guitar straight to it... should I amplify it first? is a 20 second clip too long?

@skryking that means it can't run fast enough to complete enough convolutions on time. try using a shorter wav file

@skryking I had been using this sample for the revolver https://sound-effects.bbcrewind.co.uk/search?q=07019168 trimmed down to roughly this though if 1 second is too long you might want to try something shorter

starts at the first peak about a second in, goes for a second, and you can add a fade out to crop the fall off a little short

@skryking you could also turn the sampling rate down for the entire system. that might make latency a little worse but it'll give convolver more head room. there's a global at the top of the file that controls the sampling rate. I find that sdl3 was doing a poor job at resampling the impulse response file (revolver.wav) so you'll probably want to resave it at the same rate. 22050 might be workable

@skryking were you able to get it working?

god damn this thing is so fucking cool. I've got it hooked up to my drum machine right now and the fm drum in particular is pretty good at isolating parts of the impulse response sample. I'm using a short sample from the Nier Automata song "Alien Manifestation" to convolve the drum machine and it sounds *amazing*. It's a shame I can't modulate the drum parameters on this machine, or I'd be doing some really wild stuff with this right now.

some small problems with this system:

1. I've had to turn down the sampling rate so I can convolve longer samples. 22050 hz works out ok though for what I've been messing with so far, so maybe it's not that big a deal. longer samples kinda make things muddy anyway

2. now I want to do multiple convolutions at once and layer things and that's probably not happening on this hardware XD

I'll probably have to switch to an fft based system for non-realtime convolution to make this practical for designing dynamic sound tracks for games that can run on a variety of hardware, otherwise I'll probably have to opt for actually recording my songs and stitching it together by some more conventional means

this thing is also really good at warming up my laptop XD

idk if I'm done playing around with this prototype yet, but I'd like to explore granular synthesis a bit soon. I think there's probably a lot of cool ways it can be combined with convolution, like having the kernel morph over time.

@aeva my math is rusty... how do you combine two impulse responses, anyway? Do you have to convolve one with the other?

@sixohsix oh i mean i want to have A × B + C × D, where × is a convolution and + is just mixing

probably first is reworking this program so i can change out the convolution kernel without restarting it or at least make it so i don't have to recompile it each time

anyways i highly recommend building your own bespoke audio synthesis pipeline from scratch, it's a lot of fun

@aeva I think the airwindows guy is using fft

@ohmrun idk who that is but i gather fft is the normal method for this

@aeva we've been meaning to, tbh

@ireneista it's very satisfying to make sounds

@aeva I built my own audio system and hate every time I have to work on it, so I guess different strokes and all that.

(fwiw:
https://shirakumo.github.io/libmixed/
https://shirakumo.github.io/cl-libmixed/
https://shirakumo.github.io/harmony/ )

@shinmera mine rewards me with magnificent sounds every time i play with it 😌

@aeva Yeah, I had much the same thought myself and I'm working towards it (slowly 😅)

Currently I have FFT based morphing for grains (grab two grain chunks, FFT into blocks, then over X frame linearly blend between the two), and FFT based convolution for a filtering, so its only a short hop to mash 'em together 🤔

@bobvodka oh cool, how's it sound?

@aeva Agreed! My DSP project is the most coding fun I've had in years, with bonus fun sounds too 🥳

The Graphics Programmer to Audio Programmer pipeline is real 😂

@aeva Mine frequently rewards me with ear-destroying noise and incomprehensible bugs

@shinmera puzzles :D

@aeva Sounds good to me.
Its nice how, over a long sample and a lot of "frames' you can hear the target sample slowly come in at various frequencies.

Should work well on smaller grains too; need to implement it in my 'grain swarm' code at some point.

@bobvodka awesome :D

Frankly @aeva ? I'd love to if I understood where to start and the involved math, it'd be a pleasure to suck at it as I do with rendering!

@aeva I have notes on shader reflection if you need them :3

@aeva something about scrolling up through this thread and the length of it make me somewhat doubt that statement…

@aeva
Someone with a great deal of experience in fft that open sources their code. What am I, chopped pork?

@ohmrun can i offer you an imaginary number in these trying times *hands you "mañanity", which is the number of days i will leave you waiting*

It occurred to me just now that I might be able to make this faster be rewriting it as a pixel shader. Each pixel in the output is an audio sample. Each PS thread reads a sample from the impulse response and the audio stream, multiplies them together, and writes out the result. To perform the dot product, the draw is instanced, and the add blend op is used to combine the results. I've also got a few ideas for variations that might be worthwhile.

Like, having the vertex shader or a mesh shader read the sample from the audio stream, have the PS read the impulse response, and stagger the draw rect. Main snag there is the render target might have to be 512x1 or something like that, or I'll have to do counter swizzling or something.

Also FP32 RGBA render targets would probably just batch 4 samples together for the sake of keeping the dimensions lower I guess.

I think this should be likely to be a lot faster, because I've made a 2D convolution kernel a lot slower by rewriting it to be compute in the past 😎 but if any of ya'll happen to have inside knowledge on if ihv's are letting raster ops wither and die because AAA graphics programmers think rasterization is passe now or something absurd like that do let me know.

@aeva The actual reason for that was almost certainly memory access patterns. Thread invocations in PS waves are generally launched and packed to have nice memory access patterns (as much as possible), compute waves and invocations launch more or less in order and micro-managing memory access is _your_ problem.

This really matters for 2D because there's lots of land mines there wrt ordering, but for 1D, not so much.

@aeva To give a concrete example: suppose you're doing some simple compute shader where all you're doing is

cur_pixel = img.load(x, y)
processed = f(cur_pixel, x, y)
img.store(x, y, cur_pixel)

and you're dispatching 16x16 thread groups, (x,y) = DispatchThreadID, yada yada, all totally vanilla, right?

@aeva well, suppose we're working in 32-thread waves internally (totally hypothetical number)

now those 32 invocations get (in the very first thread group) x=0,...,15 for y=0 and then y=1.

Say the image is R8G8B8A8 pixels and the internal image layout stores aligned groups of 4 texels next to each other and then goes to the next y, and the next 4-wide strip of texels is actually stored something like 256 bytes away or whatever.

@aeva so, x=0,..,3 y=0 are all good, these are all adjacent, straight shot, read 16 consecutive bytes, great.

x=0,...,3 y=1 in threads 16..19 are also good, these are the next 16 bytes in memory.

But if we have 256-byte cache lines (another Totally Hypothetical Number), well, those 32 bytes are all we get.

x=4,..,7 for y=0 and 1 are in the cache line at offset 256, x=8,...,11 for y=0,1 at offset 512, x=12,...,15 at offset 768.

@aeva And caches are usually built to have multiple "banks" that each handle a fraction of a cache line. Let's say our hypothetical cache has 16 16-byte banks to cover each 256B cache line.

Well, all the requests we get from that nice sequential load go into the first 2 banks and the rest gets nothing.

So that's lopsided and causes problems, and will often mean you lose a lot of your potential cache bandwidth because you only actually get that if your requests are nicely distributed over mem.

@aeva long story short, this whole thing with your thread groups being a row-major array of 16x16 pixels can kind of screw you over, if the underlying image layout is Not Like That.

This happens all the time.

Ordering and packing of PS invocations into waves is specifically set up by the GPU vendor to play nice with whatever memory pipeline, caches, and texture/surface layouts it has.

In CS, all of that is Your Job, generally given no information about the real memory layout.

Good luck!

@aeva If you do know what the real memory layout is, you can make sure consecutive invocations have nice memory access patterns, but outside consoles (where you often get those docs), eh, good luck with that.

The good news is that with 1D, this problem doesn't exist, because 1D data is sequential everywhere.

So as long as you're making sure adjacent invocations grab adjacent indices, your memory access patterns are generally fine.

(Once you do strided, you're back in the danger zone.)

@aeva also I want to emphasize that this Purely Hypothetical Example with row-major invocation layout in CS vs. a column-heavy layout in the HW is of course entirely hypothetical and in no way inspired by real events such as https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/

@rygorous @aeva
On the CPU it's generally best to organize structured data as separate contiguous arrays for each element. But with the graphics pedigree of GPUs does that still hold, or does it handle interleaved data better?

@jannem @rygorous It depends on the IHV whether SoA or AoS is better and in what situations. Usually there will be a document outlining recommendations somewhere.

@shinmera @aeva [i know nothing about audio processing so i'm like 99.9% sure that there's a good reason why the following doesn't make sense; asking the following out of curiosity]

can the ear-destruction be avoided by like... doing some kind of analysis/checks on the final sample before sending it to the audio device...? (e.g. checking & asserting that its amplitude is less than some upper bound?)

[but if it were that easy, it probably would have been the fist thing anyone would try, so]

@JamesWidman @shinmera I just try to remember to turn the volume down before testing new changes

@rygorous that sounds likely. I don't think I accounted for memory layout of the texture. I assume this is also why Epic seems to be so fond of putting everything in scan line order these days?

@rygorous so, my program as written is two linear memory reads, some basic arithmetic, and some wave ops. I think it should be pretty cache efficient, or at least I don't have any obvious ideas for making it moreso. I would think all the extra raster pipeline stuff would not be worth it, but the opportunity to move one of the loads into an earlier shader stage to effectively make it invariant across the wave and make use of the ROP to implement most of the dot product seems maybe worthwhile?

@rygorous the ROP is, like, free math, right?

@aeva @shinmera if i ever do audio programming, i will try to remember to make my program start with a giant ASCII-art splash screen that asks if the volume is set correctly before proceeding and makes me type "yes", because i would definitely forget sometimes (:

@aeva for 1D there's not much way to go wrong honestly, it's mainly a 2D (and up) problem

@aeva Not really. The "math" may be free but the queue spots are not and you'll likely end up waiting longer in the shader to get to emit your output then you would've spent just doing the math directly

@aeva Looking at the shader you posted yesterday (?) at https://github.com/Aeva/convolver/blob/excelsior/src/convolver.cs.glsl, you're in the Danger Zone(tm)

@aeva the issue is SliceStart is derived from LaneIndex (Subgroup invocation ID) which is then multiplied by Slice

@aeva I don't know what value Slice has with the sizes you pass in, but it would be really bad if Slice works out to be some medium to large power of 2.

The issue is that the "i" loop goes in sequential samples but invocation to invocation (which is the dimension that matters), the loads inside are strided to be "Slice" elements apart.

You really want that to be the other way round. Ideally sequential loads between invocations.

@aeva so basically, try making the loop be "for (i = LaneIndex; i < SizeB; i += GROUP_SIZE)" and poof, suddenly those loads are mostly-sequential invocation to invocation instead of always hitting a few dozen cache lines

@rygorous I gave that a try earlier today, but it ended up being a wash. I think the slice sizes should be big enough for it to matter, so I'm guessing the savings are smaller than some other bottleneck elsewhere, probably on the CPU side

@aeva separately, don't want that % SizeA in there, that doesn't have to be bad but it can be, I don't know how good shader compilers are about optimizing induction variables like that

might want to keep that as an actual counter and just do (in the modified loop)

j += GROUP_SIZE;
j -= (j >= SizeA) ? SizeA : 0;

(you also need SizeA >= GROUP_SIZE now, but I don't think that changes anything in your case)

@aeva even on a GPU, if you do enough MADs per sample eventually you're going to be compute bound with this approach, but I'd be shocked if you were anywhere close to that right now.

First-order it's going to be all futzing with memory access.

@aeva I mean, you can literally do the math!

If you're on a GPU, then even on a mobile GPU from several years ago, you're in the TFLOP/s range by now for actual math.

So, ballpark 1e12 MADs per second.

48kHz stereo is ballpark 1e5 samples per second.

Math-wise, that means you can in theory do 1e7 MADs per sample, enough for brute-force direct convolution with a >3 minute IR. You're probably not doing that.

@aeva You can always do better convolution algs, but even for brute-force, the math is just not the problem for IR sizes you're likely using.

But as written in your code, you also have two loads for every MAD, and there's nowhere near that level of load bandwidth available, not even if it's all L1 hits.

Making it sequential across invocations should help noticeably. But beyond that, you'll need to save loads.

@rygorous huh. so is the ideal pattern something like out[0...N] += IR[0] * in[0...N], where the IR[0] is loaded once, and you basically just peel the first MAD for each sample being processed at once, and then do it all again for IR[1] etc. And the += would have to be an atomic add 🤔

@aeva I don't know about ideal but there is definitely is some mileage to be had in loading one of the two into registers/shared memory in blocks, double-buffering the next fetch and having only one load via L1/tex in the inner loop.

That said the better FFT-based conv kinda nukes that.

Good news: FFT-based conv kinda automatically exploits all the sharing for you!

Bad news: that means you're now down to loading and using each IR FFT coeff exactly once.

@aeva It is work-efficient and gets both your load count and your mul-add count way down, but it also means what's left is kinda BW bound by construction and there's not much you can do about it.

(I mean you can make the block sizes small enough that you're still summing tons of terms again, but that's defeating the purpose.)

@rygorous ok weird thing happened just now, I gave the your about changing the iterations another try and did notice the worst case runs were cheaper while the average was about the same (this is not the weird part), but then I dropped GROUP_SIZE (sets both the work group size and required subgroup size) from 32 to 16 and the average time went from 7.7 ms to 6.175 ms and the best record time went from 6.8 ms to 1.8 ms.

@rygorous this is on Intel. I tried setting the group size to 8 and the numbers got really nice but it stopped making sound lol

@rygorous oh no lol if I add a print statement into the hot loop on the CPU it starts making sounds again with GROUP_SIZE at 8. It looks like that might improve throughput enough to show that my synchronization with pipewire's thread is broken :3

@rygorous idk why dropping the group size did that, it didn't do that before. not sure if it's related to your change

@rygorous well, either way, it's down to about 8 ms of latency now so ty for the wisdom :)

@aeva this is a Side Question, do you have any idea how much of that is just fixed costs that don’t depend on the amount of compute at all? I‘m wondering because last time we looked at „should we run our audio processing on Not CPU“ the conclusion was a clear nope, latency to dispatch any work in 10ms batches already kills us, but likely a lot / most (?) of that was tflite not being set up for this type thing rather than the underlying systems, and we never had time to dig deeper than that

@halcy I'm seeing round trip times as low as 1ms with an average of about 6. I'm using a UMA GPU though, which lets me leave memory mapped to both the GPU and CPU. Most of my perf is down to how much I can juice the compute shader and bringing down the timing variance caused by synchronization scheduler noise. Right now I have to leave 2 ms of head room or the audio becomes unstable, so my latency is roughly 8 ms.

@halcy the rules are different if you're targeting discrete GPUs, but I'm not sure how much anymore since bus speeds are fast and we got stuff like resizable mbar now. Conventional wisdom says readback is a no-no, but I think enough has changed in the last few years to merit more investigation.

@aeva I will happily take credit if it works but must disclaim all responsibility if it does not!!!11

@halcy as more useful proof than measured numbers that can be wrong, I've been able to play using this with a live instrument.

@aeva @halcy main problem with readback is usually the way it synchronizes your CPU and GPU, not really that the readback itself is slow (ofc if you're reading something large that can be slow too). Since GPUs are big async chonkers if you just naively queue up some work and then wait for it on the CPU you're likely going to end up waiting for a lot more than you intended. But you can do high priority queues and whatnot these days if you want to juice the latency.

@rygorous wisdom was imparted, action was taken, and results were accomplished. nothing suspicious here 😎

@aeva thanks! That does seem lower than what I remember getting…. though probably still means that for the amount of computation we usually have to do in a frame, it‘s CPU all the way

unless maybe in The Future we make the model a lot bigger

@halcy I don't think audio processing on the GPU is worthwhile unless you're doing an absurdly expensive transform like what I'm doing or you're able to dispatch a large enough batch to saturate the GPU. there's a sweet spot where it is faster to do the round trip to the GPU, because time is convoluted

@aeva yeah, I think that‘s still the conclusion

which is unfortunate because I would really like to be paid to do that! but it is difficult to argue for when like, even if you implement everything very well the roundtrip already has highs that would (in presence of the rest of the audio stack) cause issues. like, right now our model runs, on a weakish desktop, with 2ms averages and 3 to 4ms highs, for a 10ms frame, and that’s already kind of as high as we dare going

@halcy well, if the end goal is to simply be paid to put audio on the gpu, then you can simply find a problem that fits the solution

@halcy haha time is convolved

@aeva (and there’s also power efficiency because Phones Exist, and there it becomes even harder still)

@halcy what sort of audio processing are you doing?

@aeva I like this gig. See, at work I am usually eventually required to put up or shut up.

@dotstdy @halcy so you do have limited bandwidth for transfers with discrete GPUs so in theory that can also be problems depending on how much data you want to read back. I think a lot of the superstition people have about doing any readback at all is a product of bad engine design. like, you can have everything pipelined perfectly and have multiple readbacks in the middle of the frame. you just need to check a fence to see what data is ready to process, and let it be asynchronous.

@dotstdy @halcy in the case of my audio convolver, the CPU thread that manages all of the vulkan stuff is basically just alternating between waiting for audio to buffer, running a compute dispatch, waiting for the results of the dispatch, bumping an atomic to tell the audio thread what is safe to read now, and then repeating the loop. I can probably shave the latency down a little more there by keeping the GPU hot, but the loop is tight enough that I'm not sure.

@aeva noise / echo / reverb suppression and/or speech enhancement in various different configurations, for voice calling. so essentially „run smallish neural network on audio fast“

and yeah sure we can make the task arbitrarily more complex by making the model larger but then we need to do annoying things like justify the added compute by showing a major quality improvement. maybe requirements will do it for us eventually if someone decides we must have fullband stereo audio or something

@halcy oh just make it worse and slap the "AI" sticker on it

@aeva @halcy right exactly, in an engine readback often means like "wait for downsampled depth buffer from the previous frame before starting to submit work for this frame", which can be a huge issue. most of the time your cpu and gpu executions are out of sync by a frame or so and throwing a random sync() in there is disaster. but nowdays you can even have a high prio compute pipeline that preempts whatever work is currently running (at least, if you can actually access that hw feature...)

@aeva pretty sure the ad copy is extremely AI‘d up already. and unfortunately, if we make it worse, people will increasingly click the little „call audio was bad“ button in the little window you sometimes get at the end of the call, making a number go down that then causes me (well, okay, our PM) to panic and stop rollout

@halcy wait that button actually does something ?!

@dotstdy @halcy yup. it would be a pain in the ass to retrofit a sufficiently crufty old engine for that, but there's some really cool stuff you could do designing for it.

Also it's worth pointing out that deep pipelines with full utilization is great if you goal is building a makeshift space heater out of the tears of AAA graphics programmers, but if your goal is juicing latency, frame time, and/or battery life running the CPU and GPU synchronously can be advantageous.

@aeva doesn’t immediately file a bug, but when you press the uhhh idk what it looks like now they‘ve been messing with it but, thumbs down or below 4 stars or sad smiley face, whatever it is, button, ideally also with a Problem Token (the little what was actually bad questionaire) then at least for media (so calling a/v - can’t speak to how fronted or chat or whatever do it) in Teams/Skype it goes to a big database along with technical telemetry that is usually correlated with call quality (stuff like „percent of aborted calls“) which then feeds into a quite thorough A/B system, and if we spot regressions in what are considered Tier 1 metrics rollout stops (no statsig positive change is okay generally if you can justify why the change fixes a rare bug, adds a useful feature or whatever. Though of course if you can move a problem token or T1 metric in the right direction, that’s even better).

Mostly we catch issues before changes make it to public, though

@aeva anyways what I‘m saying is you should definitely always vote 5 because that makes my KPI numbers go up

@halcy i'm reminded of this xkcd for some strange reason https://xkcd.com/958/

@halcy any time i get the urge to press one of those buttons, i'll gladly stab that 5 for you ♡

@aeva in no jokes land, please do press the buttons that most reflect the call experience, which will make my life easier by contributing to a realistic picture of where we‘re at and what we most need to work on and what is and is not working

@halcy the little reaction emoji thing has been broken for months, there's no overlay for it anymore so i don't see when people press it and i'm not sure if they see when i do. could you pass that along to whoever's problem that is?

also there doesn't seem to be an option to make it not use the GPU anymore which is somewhat problematic for me since i have to completely shut down teams when doing perf stuff

@aeva i can try, but these sort of reports tend to not go anywhere unless I can repro

@halcy i can try and provide more info tomorrow if you want