Making configuration global was moronic considering there are
different types of media streams per session (f.ex Opus and HEVC)
which have very different types of needs. For example, setting
receiver's UDP buffer size to 40 MB would make no sense for Opus.
Now each connection can be configured individually which is also
a needed feature for SRTP
This change reverted the changes made earlier to global API
The security layer is injected between reading a datagram from OS and
RTP/RTCP payload processing so the obvious place for that layer is socket.
Make all recv/send function calls go through socket API so the security
layer function calls doesn't have to be copied everywhere
To prevent excess relocations but to minimize the nubmer of system
calls done, OFR reads 15 datagrams from OS using one system call if
more than 2% or less than 98% of the frame has been read.
These values are a result of experimentation and they lowered the CPU
utilization most. Compared to simple receiver, OFR, with dynamic datagram
read size, uses 14% less CPU than.
These numbers could be improved even further if media-specific
optimizations would be done such as keeping track of intra or VPS
period to adjust max datagram read size or the legal interval for
max read.
This optimizations are, however, probably not going to yield a lot
of benefit compared to the current state of OFR and are thus not
implemented. As it is, OFR is already able to receive HEVC at
580 MB/s and uses 14% less CPU than simple so for high-quality video
conferencing situations with multiple participants this is a good
choice.
The version that does not use sendmmsg(2) didn't return proper status
codes for __push_hevc_frame() when it had sent the smaller NAL units.
This caused it to send far less data than it should have
The NTP millisecond diff calculation seems to be incorrect (it gives
very weird results) but miraculously it still produced playable video
stream.
I'll need to figure out what's wrong with the calculation at some point
but for now switch to use HRC.
Linux seem to have an undocumented "feature" where it accepts only
1024 messages per sendmmsg(2).
So basically, if you gave it a buffer containing f.ex 1100 messages,
it would only sent the first 1024 **without returning an error**.
This caused large intra frames not to be received fully creating
broken stream
When MSG_WAITFORONE is used, the system call returns 1..N packets
but the code initially assumes N packets are read so the offset pointer
might need adjustment after the fragments have been processed.
The sequence number counter is only 16 bits long meaning that it will
overflow quite fast and can cause S fragment to have larger sequence
number than what E fragment has.
Previous calculation didn't take this into account which caused all
fragment after the first overflow has happened to be discarded
Several classes have common active_ and runner_ variables and
stop/start/active routines (such as reader and dispatch).
Create one common class for these to make the interface cleaner
This is the third way of dealing with memory ownership/deallocation.
Application "lends" the memory for kvzRTP's use and when SCD has finished
processing the transaction associated with this memory, it will call the
deallocation hook provided by the application to release the memory.
This makes, for example, custom allocators possible where wrapping the
memory inside a unique_ptr is not suitable and creating copies is also
not acceptable.
This is a separate thread running in the background responsible for
executing system calls (mainly sending UDP packets).
This commit divides the sending into frontend and backend:
- Frontend packetizes the media into "transactions" which are then
pushed to backend's task queue
- Backend executes these transactions FIFO-style and pushes executed
transactions back to frame queue for reuse
Frontend is the part of sending that executes in application's context
and backend (system call) happens in a background thread.
Ideally frontend and backend would be run on separate physical cores.
This change made sending significantly faster (from 650 MB/s to 720 MB/s)
and cut down the delay experienced by application from 315us to 45us
for large HEVC chunks (177 kB)
Using system call dispatcher complicates the frame queue design because
we can no longer store f.ex. NAL and FU headers to caller's stack (as
SCD doesn't have access to that stack).
We must create transaction object that contains all necessary
information related to one media frames (all Vectored I/O buffers,
RTP headers and outgoing address).
This model works both with and without SCD and is much cleaner than the
previous implementation. It also makes a more clear distinction between
the frontend and backend of sending operationg by creating
a clear producer/consumer model.
One problem that has arisen is memory deallocation and ownership in
general: when SCD is used, it owns the memory given to kvzRTP by
push_frame() BUT it doens't know what kind of memory it is so it doesn't
know how to deallocate it. Some kind of deallocation scheme must be
implemented because right now the library leaks a lot of memory.
The payload format must be known when creating the Connection object
because Connection owns the frame queue and frame queue needs the format
in order to allocate correct media-specific headers for transactions