PDA

View Full Version : [VU+ Solo4K] Write error that is not write error



B-N
12-11-16, 16:20
Hi support!

I get intermittent write errors on recordings a couple of times each week. I use a Synology DS413 as a network HDD for recordings. I have checked and rechecked the network and cables and can find no problem. No problem with diskspace or the server as far as I can determine.

So I started logging and am no longer so sure that error really is a write error. It may even be a serious error in OpenVIX code.

First indication of the error seems to come from the method
int eDVBRecordFileThread::AsyncIO::wait()
in the file https://github.com/OpenViX/enigma2/blob/master/lib/dvb/demux.cpp
I hope that is the right place and I’m not looking at old code :-)
I include a copy of that method down a bit.

I get two different errors: EINTR and Success (0, no error!)

This suggest to me that something might be wrong with the code so I started trying to understand how the code works and why it may be possible to get those errors. Lots of code so I have probably misunderstood some of how it is supposed to work :-)

If it really was a write error it would have been presented as a EDQUOT , EFBIG, EIO or ENOSPC or something like that.

EINTR means that the operation (in this case aio_write) was interrupted before it had any chance to begin writing data. If it would have been interrupted later it would have been a “short write”. In both cases the write need to be restarted.

The code just aborts on EINTR. Is that really the intended behaviour? The return value of aio_return is only used for error checking and don't check how many bytes was actually written successfully, to check for a short write.

I have no idea what signal causes the interrupt, maybe internal use in some library or elsewhere in Enigma2, but the possibility of an interrupt should always be taken into account in all code. Even if the interrupt is handled correctly so that user code continues without problems, system calls may be interrupted and has to be restarted after an interrupt.

My first thought was that maybe AIO-calls don’t need to check for interrupts or short writes. Searching the Internet indicates that some people believe so. But reading the documentation for AIO a bit I think it says that the AIO-calls behave exactly the same as the corresponding normal IO-calls with mostly the same errors and return values.

And I obviously get an interrupt during aio_write so in my mind that means it must be handled correctly! And from that follows that short writes must also be handled.

Even worse is that none of the potential short writes are logged or reported to the user. That may mean there are a lot of recordings out there that are missing small chunks here and there without anyone knowing. When I notice freezes and jumps in a recording I have previously said “it was probably bad weather when it was recorded”. Maybe not?

So I think there are two problems with the code: EINTR and short writes are not handled correctly.

Disclaimer: I have not been programming low level Unix code for the last 15 years so my memory may be a bit fuzzy and out dated about some details :-) I hope no one feels insulted by my opinions how code should be written in the best of worlds. And I have a tendency to talk to much :-)

I have yet to figure out why I got one “Success”-error, that is a bit strange.

Another thought: why does the code use AIO instead of normal IO?

The method I think generates the first error:


int eDVBRecordFileThread::AsyncIO::wait()
{
if (aio.aio_buf != NULL) // Only if we had a request outstanding
{
while (aio_error(&aio) == EINPROGRESS)
{
eDebug("[eDVBRecordFileThread] Waiting for I/O to complete");
struct aiocb* paio = &aio;
int r = aio_suspend(&paio, 1, NULL);
if (r < 0)
{
eDebug("[eDVBRecordFileThread] aio_suspend failed: %m");
return -1;
}
}
int r = aio_return(&aio);
aio.aio_buf = NULL;
if (r < 0)
{
eDebug("[eDVBRecordFileThread] wait: aio_return returned failure: %m");
return -1;
}
}
return 0;
}

Selected parts of the loggs:


...
< 52876.093> [eDVBRecordFileThread] wait: aio_return returned failure: Interrupted system call
< 52876.094> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Interrupted system call
< 52876.094> [eDVBRecordFileThread] waiting for aio to complete
< 52876.094> [eDVBRecordFileThread] Waiting for I/O to complete
< 52876.094> [eDVBServiceRecord] record write error
< 52876.094> [eDVBServiceRecord] stop recording!
< 52876.095> [eFilePushThreadRecorder] stopping thread.
< 52876.095> [eDVBRecordFileThread] Waiting for I/O to complete
< 52876.096> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
< 52876.096> [eDVBRecordFileThread] 0: 18
< 52876.096> [eDVBRecordFileThread] 1: 5559
< 52876.096> [eDVBRecordFileThread] 2: 5507
< 52876.248> [eFilePushThreadRecorder] THREAD STOP
< 52876.295> [eDVBTSTools] setSource loading streaminfo for /media/hdd/movie/20161108 0625 - SVT1 HD - Gomorron Sverige.ts
< 52876.304> [eDVBServiceRecord] fixed up 16b837f25 to 45241 (offset 0)
< 52876.309> [RecordTimer] WRITE ERROR on recording, disk full?
< 52876.310> [Notifications] RemovePopup, id = DiskFullMessage
< 52876.310> [Notifications] AddPopup, id = DiskFullMessage

< 19043.820> [eDVBRecordFileThread] wait: aio_return returned failure: Interrupted system call
< 19043.820> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Interrupted system call
< 19043.820> [eDVBRecordFileThread] waiting for aio to complete
< 19043.820> [eDVBRecordFileThread] Waiting for I/O to complete
< 19043.821> [eDVBServiceRecord] record write error
< 19043.821> [eDVBServiceRecord] stop recording!
< 19043.822> [eFilePushThreadRecorder] stopping thread.
< 19043.824> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
< 19043.824> [eDVBRecordFileThread] 0: 67
< 19043.824> [eDVBRecordFileThread] 1: 4879
< 19043.824> [eDVBRecordFileThread] 2: 4818
< 19043.995> [eFilePushThreadRecorder] THREAD STOP
< 19044.002> [eDVBTSTools] setSource loading streaminfo for /media/hdd/movie/20161111 1500 - H2 HD - Engineering Disasters.ts
< 19044.006> [eDVBServiceRecord] fixed up 9b646bb8 to a4fdb (offset 0)
< 19044.009> [RecordTimer] WRITE ERROR on recording, disk full?
< 19044.009> [Notifications] RemovePopup, id = DiskFullMessage
< 19044.009> [Notifications] AddPopup, id = DiskFullMessage
...
< 20762.202> [eDVBRecordFileThread] wait: aio_return returned failure: Success
< 20762.203> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Success
< 20762.203> [eDVBRecordFileThread] waiting for aio to complete
< 20762.203> [eDVBRecordFileThread] Waiting for I/O to complete
< 20762.203> [eDVBServiceRecord] record write error
< 20762.203> [eDVBServiceRecord] stop recording!
< 20762.204> [eFilePushThreadRecorder] stopping thread.
< 20762.205> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
< 20762.206> [eDVBRecordFileThread] 0: 49
< 20762.206> [eDVBRecordFileThread] 1: 11634
< 20762.206> [eDVBRecordFileThread] 2: 11565
< 20762.268> [eFilePushThreadRecorder] THREAD STOP
< 20762.277> [eDVBTSTools] setSource loading streaminfo for /media/hdd/movie/Helmy/20161107 2100 - Animal Planet HD - My cat from hell.ts
< 20762.294> [eDVBServiceRecord] fixed up 14dd02ade to e4af6 (offset 0)
< 20762.294> [eDVBServiceRecord] fixed up 15f79eb9e to 11b80bb6 (offset 0)
< 20762.304> [RecordTimer] WRITE ERROR on recording, disk full?
< 20762.304> [Notifications] RemovePopup, id = DiskFullMessage
< 20762.304> [Notifications] AddPopup, id = DiskFullMessage
…..

spanner123
12-11-16, 17:50
My brain hurts!

Trial
12-11-16, 19:00
Hi,
asynchronous writes are used so that the load of the box is minimized. The effect is normally very drastic. It does not make a lot of sense to use synchronous writes as each write access will block the box for a while.

ciao

birdman
12-11-16, 19:54
So I started logging and am no longer so sure that error really is a write error. It may even be a serious error in OpenVIX code.Sounds similar to a thread for ~6 months ago (to which there was no resolution).

EDIT:

The log reports bits in that are in this area of the thread:
http://www.world-of-satellite.com/showthread.php?49261-quot-WRITE-ERROR-on-recording-disk-full-quot&p=387374&viewfull=1#post387374

Although there might have been some resolution - it appeared that there was, at least in one case, a network issue resulting in slow network file-system writing. (See last two posts).

birdman
12-11-16, 21:48
Hmmm, it does look as though there is a bug in the piece of code you posted in #1.

aio_error() can return EINTR, so that should also be catered for.

Then the odd error=Success can be explained by this from the aio_return() man page:

If the asynchronous I/O operation has not yet completed, the return
value and effect of aio_return() are undefined.

and if the return code is EINTR then the request hasn't yet completed.


Although the question that still obtains is, "What is sending a signal to cause the interrupt?".
And, perhaps, why the signal handler isn't set to restart system calls....

birdman
12-11-16, 22:04
Although the question that still obtains is, "What is sending a signal to cause the interrupt?".
And, perhaps, why the signal handler isn't set to restart system calls....
Turns out that lib/dvb/filepush.cpp is doing this deliberately.


static void signal_handler(int x)
{
}

static void ignore_but_report_signals()
{
/* we set the signal to not restart syscalls, so we can detect our signal. */
struct sigaction act;
act.sa_handler = signal_handler; // no, SIG_IGN doesn't do it. we want to receive the -EINTR
act.sa_flags = 0;
sigaction(SIGUSR1, &act, 0);
}

So it creates a handler that does nothing at all (so nothing shows up in the log) just so that EINTR can be received.
This appears to be so that async I/O can be terminated (set m_stop non-zero and send a USR1 signal).
Perhaps some possible situations haven't catered for this.

birdman
12-11-16, 22:07
I don't think it will be easy to set-up a situation to provoke this at will for testing....

birdman
12-11-16, 22:52
My first inclination is that EINTR should be ignore in the same way that EINPROGRESS is.
My second is that the whole attempt should be abandoned (since the read/write has been interrupted to abandon it...).

That's base on reading this:

You call aio_error(3) to see if the operation is in progress or is completed (or was canceled). If it returns anything other than 0, then either there was an error, it's in progress, or it was canceled. So, no writes actually happened and you cannot call aio_return(3). If it's in progress, either do something else and try again later, or use aio_suspend(3) to wait for it to complete.

from here:
http://stackoverflow.com/questions/31291136/errors-of-aio-write(which I've put here so that I know where to look for it in the future).
It's a matter of what "try again later" means. If you are aborting this read/write then perhaps the second and first above get swapped around?

B-N
12-11-16, 22:58
Thank you birdman for all explaining and thinking! I thought I was going a bit crazy trying to figure out what was going on :-)

birdman
13-11-16, 02:02
Thank you birdman for all explaining and thinking!However, don't assume that it's necessarily a correct analysis.

birdman
13-11-16, 04:14
My first inclination is that EINTR should be ignore in the same way that EINPROGRESS is.There's even a macro in unistd.h to do this:


/* Evaluate EXPRESSION, and repeat as long as it returns -1 with `errno'
set to EINTR. */

# define TEMP_FAILURE_RETRY(expression) \
(__extension__ \
({ long int __result; \
do __result = (long int) (expression); \
while (__result == -1L && errno == EINTR); \
__result; }))
#endif

birdman
13-11-16, 15:33
More thinking about this results in:


EINTR means that the operation (in this case aio_write) was interrupted before it had any chance to begin writing data. If it would have been interrupted later it would have been a “short write”. In both cases the write need to be restarted.

The code just aborts on EINTR. Is that really the intended behaviour?Yes, this is the intended behaviour: at least the abort of the write is - the reported error messages aren't.

The only signal that will cause this is SIGUSR1, and that is only sent when the async I/O needs to be aborted; because something else has already gone wrong. That's why the signal is sent in a way that forces any system call waiting on I/O to return with EINTR.

So this code does need to be tidied up to abort on EINTR (perhaps with a useful message, but certainly setting the return code to -1; and the same applies to the poll())

The real issue you have, though, has already happened by the time you get here, and your log in #1 doesn't record that.

What do the preceding 5s or so of log show?

B-N
14-11-16, 17:29
Thanks!
I include much more of one of the logs, starting before recording starts. But the event one step before the error is 271 seconds earlier...
There is one set of rows that looks a bit suspicions:
< NNNNNN.NNN> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd NN(16)


< 51163.382> [SoftcamManager] oscam-latest already running
< 51163.383> [SoftcamManager] Checking if oscam-latest is frozen
< 51163.383> [eConsoleAppContainer] Starting /bin/sh
< 51163.496> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 42(16)
< 51165.385> [SoftcamManager] oscam-latest is responding like it should
< 51165.387> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 51240.601> [RecordTimer] activating state 1
< 51240.604> [RecordTimer] Found enough free space to record
< 51240.605> [RecordTimer] Filename calculated as: '/media/hdd/movie/20161108 0625 - SVT1 HD - Gomorron Sverige'
< 51240.605> [Navigation] recording service: <enigma.eServiceReference; proxy of <Swig Object of type 'eServiceReference *' at 0xb1b61a40> >
< 51240.606> [Config] getResolvedKey config.usage.remote_fallback empty variable.
< 51240.606> [eDVBResourceManager] allocate channel.. 0045:0046
< 51240.606> [eDVBFrontend] opening frontend 0
< 51240.609> [eDVBFrontend] (0)tune
< 51240.610> Unicable (EN50494)
< 51240.610> [eDVBSatelliteEquipmentControl] polarisation: V band: L position: 0 satcr: 0 tunerfreq: 1209MHz vco: 2380MHz tuningword 0x00f5
< 51240.610> [eDVBSatelliteEquipmentControl] polarisation: V band: L position: 0 satcr: 0 tunerfreq: 1208MHz vco: 2344MHz tuningword 0x00ec
< 51240.610> [eDVBSatelliteEquipmentControl] polarisation: V band: L position: 0 satcr: 0 tunerfreq: 1211MHz vco: 2364MHz tuningword 0x00f1
< 51240.610> tune timeout 5197ms
< 51240.610> [eDVBSatelliteEquipmentControl] RotorCmd ffffffff, lastRotorCmd ffffffff
< 51240.610> [eDVBFrontend] prepare_sat System 1 Freq 10903000 Pol 1 SR 25000000 INV 2 FEC 3 orbpos 3592 system 1 modulation 2 pilot 2, rolloff 1
< 51240.610> [eDVBFrontend] tuning to 1211 mhz
< 51240.610> [eDVBChannel] OURSTATE: tuner 0 tuning
< 51240.610> [eDVBServicePMTHandler] allocate Channel: res 0
< 51240.610> [eDVBCIInterfaces] addPMTHandler 1:0:19:ED9:45:46:E080000:0:0:0:
< 51240.610> [eDVBChannel] getDemux cap=00
< 51240.610> [eDVBResourceManager] allocate demux cap=00
< 51240.610> [eDVBResourceManager] allocating demux adapter=0, demux=0, source=0 fesource=0
< 51240.610> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51240.612> [eEPGCache] saveEventToFile epg event id 6de8
< 51240.614> [RecordTimer] prepare ok, waiting for begin
< 51240.616> [Trashcan] probing folders
< 51240.618> [Trashcan] found following trashcan's: ['/media/hdd/movie/.Trash']
< 51240.619> [Trashcan] looking in trashcan /media/hdd/movie/.Trash
< 51240.620> [eDVBFrontend] set static current limiting
< 51240.622> [eDVBFrontend] invalidate current switch params
< 51240.622> [eDVBFrontend] tuner 0 setVoltage 1
< 51240.624> [eDVBFrontend] tuner 0 sleep 200ms
< 51240.651> [Trashcan] /media/hdd/movie/.Trash: Size: 92839005135
< 51240.663> [Trashcan] /media/hdd/movie/.Trash: Size now: 92839005135
< 51240.665> [Task] job Components.Task.Job name=Rensar skräp #tasks=1 completed with [] in None
< 51240.824> [eDVBFrontend] tuner 0 setVoltage 2
< 51240.824> [eDVBFrontend] tuner 0 setTone 0
< 51240.825> [eDVBFrontend] tuner 0 sleep 20ms
< 51240.846> [eDVBFrontend] tuner unlocked .. goto 12
< 51240.846> [eDVBFrontend] set sequence pos 12
[eDVBFrontend] tuner 0 sendDiseqc: e0105a00f1
[eDVBFrontend] diseqc ioctl duration: 113 ms< 51240.960> [eDVBFrontend] tuner 0 sleep 50ms
< 51241.010> [eDVBFrontend] tuner 0 setVoltage 1
< 51241.011> [eDVBFrontend] update current switch params
< 51241.011> [eDVBFrontend] tuner 0 startTuneTimeout 5197
< 51241.011> [eDVBFrontend] tuner 0 setFrontend: events enabled
< 51241.011> [eDVBFrontend] setting frontend 0 events: on
< 51241.014> [eDVBFrontend] (0)fe event: status 0, inversion off, m_tuning 1
< 51241.014> [eDVBFrontend] tuner 0 sleep 500ms
< 51241.060> [eDVBFrontend] (0)fe event: status 7, inversion off, m_tuning 2
< 51241.212> [eDVBFrontend] (0)fe event: status 1f, inversion on, m_tuning 3
< 51241.212> [eDVBChannel] OURSTATE: tuner 0 ok
< 51241.212> [eDVBLocalTimerHandler] channel 0x19331e0 running
< 51241.212> [eDVBChannel] getDemux cap=00
< 51241.212> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.212> [eDVBSectionReader] DMX_SET_FILTER pid=20
< 51241.212> [eEPGCache] channel 0x19331e0 running
< 51241.212> [eDVBChannel] getDemux cap=00
< 51241.212> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.212> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.212> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.213> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.217> [eEPGCache] next update in 2 sec
< 51241.217> [eDVBResourceManager] stop release channel timer
< 51241.217> [eDVBChannel] getDemux cap=00
< 51241.217> [eDVBServicePMTHandler] ok ... now we start!!
< 51241.217> [eDVBServiceRecord] RECORD service event 5
< 51241.217> [eDVBCAService] new service 1:0:19:ED9:45:46:E080000:0:0:0:
< 51241.217> [eDVBCAService] add demux 0 to slot 0 service 1:0:19:ED9:45:46:E080000:0:0:0:
< 51241.219> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.219> [eDVBSectionReader] DMX_SET_FILTER pid=0
< 51241.219> [eDVBServiceRecord] RECORD service event 6
< 51241.219> [eDVBServiceRecord] tuned..
< 51241.219> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.219> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51241.220> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.220> [eDVBSectionReader] DMX_SET_FILTER pid=0
< 51241.221> [eDVBLocalTimerHandler] diff is 0
< 51241.221> [eDVBLocalTimerHandler] diff < 120 .. use Transponder Time
< 51241.221> [eDVBLocalTimerHandler] not changed
< 51241.221> [eDVBChannel] getDemux cap=00
< 51241.398> [eDVBServicePMTHandler] PATready
< 51241.398> [eDVBServicePMTHandler] use pmtpid 0100 for service_id 0ed9
< 51241.398> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.398> [eDVBSectionReader] DMX_SET_FILTER pid=256
< 51241.399> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.399> [eDVBSectionReader] DMX_SET_FILTER pid=0
< 51241.400> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.400> [eDVBSectionReader] DMX_SET_FILTER pid=420
< 51241.400> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.400> [eDVBSectionReader] DMX_SET_FILTER pid=17
< 51241.406> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.406> [eDVBSectionReader] DMX_SET_FILTER pid=256
< 51241.442> [eDVBServiceRecord] now running: Gomorron Sverige sammandrag (1200 seconds)
< 51241.442> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.443> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51241.463> [eDVBServicePMTHandler] sdt update done!
< 51241.508> [eDVBServiceRecord] RECORD service event 5
< 51241.509> [eDVBCIInterfaces] gotPMT
< 51241.509> [eDVBCAService] don't build/send the same CA PMT twice
< 51241.509> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51241.509> [eDVBSectionReader] DMX_SET_FILTER pid=256
< 51241.514> [eDVBFrontend] set dynamic current limiting
< 51243.223> [eEPGCache] start caching events(1478582682)
< 51243.225> [eDVBSectionReader] DMX_SET_FILTER pid=211
< 51243.225> [eDVBSectionReader] DMX_SET_FILTER pid=561
< 51243.226> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51243.226> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51243.227> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51250.228> [eEPGCache] abort non avail schedule reading
< 51250.230> [eEPGCache] abort non avail schedule other reading
< 51250.231> [eEPGCache] abort non avail mhw reading
< 51250.247> [eEPGCache] nownext finished(1478582689)
< 51250.248> [eEPGCache] stop caching events(1478582689)
< 51250.248> [eEPGCache] next update in 60 min
< 51260.618> [RecordTimer] activating state 2
< 51260.618> [RecordTimer] start recording
< 51260.620> [Config] getResolvedKey config.usage.blinking_rec_symbol_during_recording empty variable.
< 51260.620> [Config] getResolvedKey config.usage.blinking_rec_symbol_during_recording empty variable.
< 51260.620> [eDVBServiceRecord] Recording to /media/hdd/movie/20161108 0625 - SVT1 HD - Gomorron Sverige.ts...
< 51260.623> [eDVBServiceRecord] start recording...
< 51260.623> [eDVBServiceRecord] RECORD: have 1 video stream(s) (040a), and 2 audio stream(s) (0bd0, 1030), and the pcr pid is 040a, and the text pid is 183c
< 51260.623> [eDVBServiceRecord] ADD PID: 0000
< 51260.623> [eDVBServiceRecord] ADD PID: 0100
< 51260.623> [eDVBServiceRecord] ADD PID: 040a
< 51260.623> [eDVBServiceRecord] ADD PID: 0bd0
< 51260.623> [eDVBServiceRecord] ADD PID: 1030
< 51260.623> [eDVBServiceRecord] ADD PID: 183c
< 51260.625> [setIoPrio] realtime level 7 ok
< 51260.625> [eFilePushThreadRecorder] THREAD START
< 51264.430> [eDVBServiceRecord] pcr of eit change: 16b837f25
< 51264.430> [eDVBServiceRecord] now running: Gomorron Sverige (12900 seconds)
< 51264.431> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 51264.431> [eDVBSectionReader] DMX_SET_FILTER pid=18
< 51523.391> [SoftcamManager] oscam-latest already running
< 51523.392> [SoftcamManager] Checking if oscam-latest is frozen
< 51523.392> [eConsoleAppContainer] Starting /bin/sh
< 51523.516> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
< 51525.403> [SoftcamManager] oscam-latest is responding like it should
< 51525.405> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 51883.402> [SoftcamManager] oscam-latest already running
< 51883.404> [SoftcamManager] Checking if oscam-latest is frozen
< 51883.404> [eConsoleAppContainer] Starting /bin/sh
< 51885.456> [SoftcamManager] oscam-latest is responding like it should
< 51885.458> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 52141.221> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 52141.222> [eDVBSectionReader] DMX_SET_FILTER pid=20
< 52141.414> [eDVBLocalTimerHandler] diff is 0
< 52141.414> [eDVBLocalTimerHandler] diff < 120 .. use Transponder Time
< 52141.414> [eDVBLocalTimerHandler] not changed
< 52141.415> [eDVBChannel] getDemux cap=00
< 52236.430> [NetworkTime] setting E2 time: 1478583675.83
< 52243.413> [SoftcamManager] oscam-latest already running
< 52243.413> [SoftcamManager] Checking if oscam-latest is frozen
< 52243.414> [eConsoleAppContainer] Starting /bin/sh
< 52244.851> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
< 52246.719> [SoftcamManager] oscam-latest is responding like it should
< 52246.721> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 52414.238> [AutoTimer] Auto Poll
< 52414.239> [AutoTimer] Auto Poll Started
< 52414.239> [AutoTimer] No changes in configuration, won't parse
< 52414.239> [eEPGCache] event 22cb not found in epgcache
....and a number of the same kind of lines...
< 52418.118> [eEPGCache] lookup events with 'Marvels Agents of S.H.I.E.L.D.' as title (case sensitive)
....and all the other auto timers I have....
< 52420.833> [Task] job Components.Task.Job name=AutoTimer #tasks=13 completed with [] in None
< 52603.421> [SoftcamManager] oscam-latest already running
< 52603.422> [SoftcamManager] Checking if oscam-latest is frozen
< 52603.422> [eConsoleAppContainer] Starting /bin/sh
< 52603.583> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
< 52605.450> [SoftcamManager] oscam-latest is responding like it should
< 52605.452> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 52876.093> [eDVBRecordFileThread] wait: aio_return returned failure: Interrupted system call
< 52876.094> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Interrupted system call
< 52876.094> [eDVBRecordFileThread] waiting for aio to complete
< 52876.094> [eDVBRecordFileThread] Waiting for I/O to complete
< 52876.094> [eDVBServiceRecord] record write error
< 52876.094> [eDVBServiceRecord] stop recording!
< 52876.095> [eFilePushThreadRecorder] stopping thread.
< 52876.095> [eDVBRecordFileThread] Waiting for I/O to complete
< 52876.096> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
< 52876.096> [eDVBRecordFileThread] 0: 18
< 52876.096> [eDVBRecordFileThread] 1: 5559
< 52876.096> [eDVBRecordFileThread] 2: 5507
< 52876.248> [eFilePushThreadRecorder] THREAD STOP
< 52876.295> [eDVBTSTools] setSource loading streaminfo for /media/hdd/movie/20161108 0625 - SVT1 HD - Gomorron Sverige.ts
< 52876.304> [eDVBServiceRecord] fixed up 16b837f25 to 45241 (offset 0)
< 52876.309> [RecordTimer] WRITE ERROR on recording, disk full?
< 52876.310> [Notifications] RemovePopup, id = DiskFullMessage
< 52876.310> [Notifications] AddPopup, id = DiskFullMessage
< 52963.431> [SoftcamManager] oscam-latest already running
< 52963.432> [SoftcamManager] Checking if oscam-latest is frozen
< 52963.432> [eConsoleAppContainer] Starting /bin/sh
< 52963.555> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 64(16)
< 52965.435> [SoftcamManager] oscam-latest is responding like it should
< 52965.436> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
< 53041.415> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
< 53041.415> [eDVBSectionReader] DMX_SET_FILTER pid=20
< 53041.421> [eDVBLocalTimerHandler] diff is 0
< 53041.421> [eDVBLocalTimerHandler] diff < 120 .. use Transponder Time
< 53041.421> [eDVBLocalTimerHandler] not changed
< 53041.421> [eDVBChannel] getDemux cap=00

birdman
19-11-16, 03:06
There is one set of rows that looks a bit suspicions:
< NNNNNN.NNN> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd NN(16)I get several (lots?) of those reported, so I doubt that's the problem.

You say this happens intermittently, but fairly frequently for you. It's just that no-one else seems to have this issue (except, possibly, one other) so it could be something specific to your set-up.
At the moment there isn't much info to go on, and the standard enigma2 code isn't going to give any more.
I take it you are running a solo4k. I could build an enigma2 for that with some additional debug statements, just to find out where the interrupting signal is coming from, which might help.

B-N
19-11-16, 18:59
Yes, its a VU+ Solo 4K, with a minimum of plug ins I think. OSCAM, Autotimers, CrossEPG, EPGUpdate and maybe some more. When it was necessary a while ago to do a fresh install, without restoring settings, I think I only imported timers from old backup.

It has been quiet a while but this morning I got one write error again.

I think it may give us a clue if we knew which signal it is. Is it much work for you to add some signal debug statements?

I have started thinking a bit about where random signals may come from, like:
Is the handling of for example arithmetic exceptions different on a VU+ Solo 4K, ARM, than on other hardware?
What signals might be generated inside Python? Do they signal arithmetic exceptions via a SIGFPE and not setting SA_RESTART and is Python even running scripts inside the same process that get the EINTR?

B-N
19-11-16, 20:23
It would also be interesting to check the return value of aio_return (how much that really was written) compared to the aio_nbytes in the aio_write call. To check for partial/short write. Maybe not so easy as it depends on the len parameter in a couple of call levels?

birdman
19-11-16, 20:32
Is it much work for you to add some signal debug statements?Not once I've got the first one done, which has thrown up a problem, but that's just something else to fix/workaround...


Is the handling of for example arithmetic exceptions different on a VU+ Solo 4K, ARM, than on other hardware?No, and if these signals were being thrown you'd get a crash as there's no handler for them. The fact that there is no crash (just an error) indicates that it is SIGUSR1 being thrown by the program itself, as that's the only signal with a handler that does nothing.

birdman
30-11-16, 02:46
Sorry for the delay. I was re-doing my development environment to have more space for it. It's now tidy...

Here's an enigma2 executable (all that is needed) for a vusolo4k that contains some printed info when a SIGUSR1 signal is raised and handled.

This is for 4.2.018.

51515

To install it you'll need to


init 4
cd /usr/bin/
cp -p enigma2 enigma2.orig
cp {new-file} enigma2
chmod 755 enigma2
init 3
Then, if you get the error again we can see where the signal is being sent from.

If you're running a version other than 4.2.018, let me know.

B-N
30-11-16, 13:26
Thank you!

I had updated to 4.2.019 this morning, before reading this, but I installed it anyway and the receiver seems to be running fine.

Now I just hope there are new write errors, its been quiet for a a couple of days.

birdman
30-11-16, 13:37
I had updated to 4.2.019 this morning, before reading this, but I installed it anyway and the receiver seems to be running fine.Should be fine. I can't think of any changes that would cause a problem.


Now I just hope there are new write errors, its been quiet for a a couple of days.Odd to be hoping for something to break.

I haven't added any code to try to fix the errors - I thought I'd see where they were occurring first.

B-N
03-12-16, 16:59
Odd to be hoping for something to break.
That is true :-) But as an old programmer I believe that a bug that shows itself is better than one that is hiding.

And my wish came trough this morning :-/ Your SIGUSR1 handler seem to be in place but does not get activated. Implies some other signal? Is it easy to expand you handler to show more signals? I guess catching/tracing all possible signals may cause havoc.


<241374.855> [RecordTimer] activating state 2
<241374.855> [RecordTimer] start recording
<241374.857> [Config] getResolvedKey config.usage.blinking_rec_symbol_during_recording empty variable.
<241374.857> [Config] getResolvedKey config.usage.blinking_rec_symbol_during_recording empty variable.
<241374.857> [eDVBServiceRecord] Recording to /media/autofs/RATATOSK/movie/20161203 0755 - TV4 HD - Nyhetsmorgon.ts...
<241374.868> [eDVBServiceRecord] start recording...
<241374.868> [eDVBServiceRecord] RECORD: have 1 video stream(s) (0417), and 1 audio stream(s) (0be7), and the pcr pid is 0417, and the text pid is 179a
<241374.868> [eDVBServiceRecord] ADD PID: 0000
<241374.868> [eDVBServiceRecord] ADD PID: 004f
<241374.868> [eDVBServiceRecord] ADD PID: 0417
<241374.868> [eDVBServiceRecord] ADD PID: 0be7
<241374.868> [eDVBServiceRecord] ADD PID: 179a
<241374.870> [setIoPrio] realtime level 7 ok
<241374.870> [eFilePushThreadRecorder] THREAD START
<241374.870> [eFilePushThread] eFilePushThreadRecorder::thread setting SIGUSR1 handler
<241377.993> [eDVBServiceRecord] pcr of eit change: 40a51a3e
<241377.993> [eDVBServiceRecord] now running: Nyhetsmorgon (12900 seconds)
<241377.993> [eDVBDemux] open demux /dev/dvb/adapter0/demux0
<241377.993> [eDVBSectionReader] DMX_SET_FILTER pid=18
<241468.781> [SoftcamManager] oscam-latest already running
<241468.782> [SoftcamManager] Checking if oscam-latest is frozen
<241468.783> [eConsoleAppContainer] Starting /bin/sh
<241468.907> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
<241470.796> [SoftcamManager] oscam-latest is responding like it should
<241470.797> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
<241611.041> [eDVBRecordFileThread] wait: aio_return returned failure: Interrupted system call
<241611.041> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Interrupted system call
<241611.042> [eDVBRecordFileThread] waiting for aio to complete
<241611.042> [eDVBRecordFileThread] Waiting for I/O to complete
<241611.042> [eDVBServiceRecord] record write error
<241611.042> [eDVBServiceRecord] stop recording!
<241611.043> [eFilePushThreadRecorder] stopping thread.
<241611.043> [eFilePushThread] eFilePushThreadRecorder::stop raising SIGUSR1
<241611.044> [eFilePushThread] signal_handler entered for signal 10
<241611.047> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
<241611.048> [eDVBRecordFileThread] 0: 5
<241611.048> [eDVBRecordFileThread] 1: 721
<241611.048> [eDVBRecordFileThread] 2: 714
<241611.153> [eFilePushThreadRecorder] THREAD STOP
<241611.159> [eDVBTSTools] setSource loading streaminfo for /media/autofs/RATATOSK/movie/20161203 0755 - TV4 HD - Nyhetsmorgon.ts
<241611.161> [eDVBServiceRecord] fixed up 40a51a3e to 4187b (offset 0)
<241611.163> [RecordTimer] WRITE ERROR on recording, disk full?
<241611.163> [Notifications] RemovePopup, id = DiskFullMessage
<241611.163> [Notifications] AddPopup, id = DiskFullMessage
<241674.100> [CrossEPG_Auto] onTimer occured at lör 3 dec 2016 07.59.59
<241674.101> [CrossEPG_Auto] poll delaying as recording.
<241674.101> [CrossEPG_Auto] Enough Retries, delaying till next schedule. lör 3 dec 2016 07.59.59
<241674.105> [Skin] processing screen MessageBox:
<241674.113> [Skin] processing screen MessageBox_summary:
<241674.117> [CrossEPG_Auto] Time set to sön 4 dec 2016 08.00.00 (now=lör 3 dec 2016 07.59.59)
<241674.119> [CrossEPG_Auto] Time set to sön 4 dec 2016 08.00.00 (now=lör 3 dec 2016 07.59.59)

birdman
03-12-16, 23:00
And my wish came trough this morning :-/ Your SIGUSR1 handler seem to be in place but does not get activated.
Yes it does. That's these entries:

<241611.043> [eFilePushThread] eFilePushThreadRecorder::stop raising SIGUSR1
<241611.044> [eFilePushThread] signal_handler entered for signal 10
As far as I can see, in this case there has been a problem writing to the network file share, so the recording was stopped. That's this bit:


<241611.163> [RecordTimer] WRITE ERROR on recording, disk full?The "disk full" is just a guess.

So in this case it seems all has gone as intended once there is a write failure for the recording.
Was the remote server full?
Was there a network issue (router down?).
Was there anything in the server system logs about errors?

birdman
03-12-16, 23:10
I does not make a lot of sense to use synchronous writes as each write access will block the box for a while.Unless you specify that you really want a synchronous write (unusual) all writes are effectively asynchronous, as they just get written to the buffer cache, which is flushed to disk/remote server asynchronously.

The place where async I/O will make a difference is in reading, as you don't have to wait for the result so can post future reads and then get on with something else.

birdman
03-12-16, 23:20
So in this case it seems all has gone as intended once there is a write failure for the recording.Although it may still be reporting the "Success" as a failure - that may be after the end of the part of the log you reported.

birdman
04-12-16, 00:07
As far as I can see, in this case there has been a problem writing to the network file share, so the recording was stopped.I now see what you mean.
It's reporting Interrupted System Call before the code sends USR1 (which it does when the thread is shutdown).
So there must be another signal raised. But any other signal should result in enigma2 stopping....

...except SIGCHLD. Perhaps that's the next thing to look at...but if you re getting that and not handling it you should (I think) be seeing zombie process (<defunct> in ps -ef output).

B-N
04-12-16, 13:52
There IS actually a process that ends 141 seconds before the interrupt:


<241468.781> [SoftcamManager] oscam-latest already running
<241468.782> [SoftcamManager] Checking if oscam-latest is frozen
<241468.783> [eConsoleAppContainer] Starting /bin/sh
<241468.907> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
<241470.796> [SoftcamManager] oscam-latest is responding like it should
<241470.797> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
<241611.041> [eDVBRecordFileThread] wait: aio_return returned failure: Interrupted system call

I have checked some of the older logs, and there is the same situation. One or more checking if OSCAM is running before the interrupt. Different delays until the interrupt: 91, 278, 271, 198 seconds. Nothing strange as the check seem to be run around every 10 minutes. Probably has nothing to do with this problem, but just to make sure that the checking is not the problem I will test and disable softcam running check and see if that makes any difference.

By the way: server has 1.5 TB of free space and no errors in its logs. Server DO run some indexing of media files ones in a while. IF the indexing does something stupid, like locking the files it is indexing that would probably create some problems. But that should not be reported as an EINTR. Unless there is something wrong in the aio-code in libc, nfs or kernel. They should be reported as EAGAIN or EWOULDBLOCK, EIO, ENOSPC or something other than EINTR... But you never know for sure as most programs will probably handle EINTR (and partial writes) by just restarting the call so such an error might be plausible. I could disable the indexing.

birdman
04-12-16, 14:09
There IS actually a process that ends 141 seconds before the interrupt:I noticed that, but any signal should be virtually instantaneous.


Server DO run some indexing of media files ones in a while. IF the indexing does something stupid, like locking the files it is indexing that would probably create some problems. But that should not be reported as an EINTR. More importantly to get EINTR there must have been a signal generated, but there is no sign of one, and no particular reason for one.

I'm rebuilding enigma2 with SIGPIPE and SIGCHLD handled in the same way as SIGUSR1 - to see if that is what it is.

It will take an hour or so (as I improved my build set-up, so it's had to start from scratch again for the whole image.

birdman
04-12-16, 15:34
I'm rebuilding enigma2 with SIGPIPE and SIGCHLD handled in the same way as SIGUSR1 - to see if that is what it is.

It will take an hour or so (as I improved my build set-up, so it's had to start from scratch again for the whole image.OK. Here is is. Perhaps this one will track down why EINTR gets returned.

51585

B-N
04-12-16, 16:30
Sorry, something is wrong with the attachment, I get this error:


vBulletin Message
Invalid Attachment specified. If you followed a valid link, please notify the administrator

B-N
04-12-16, 16:48
A thought: What about SIGALRM? Is there any place it might not get handled and SA_RESTART not set on sa_flags?

birdman
04-12-16, 17:49
Sorry, something is wrong with the attachment, I get this error:Odd. So do I now, but it worked OK when I first posted it (I checked as it was name as "Attachment.." rather than the name of the file which I uploaded.)


A thought: What about SIGALRM? Is there any place it might not get handled and SA_RESTART not set on sa_flags?Well, the relevant thing to ask is what would happen if it didn't get handled - and I think that the default handler would crash the program. Also, it's not whether it gets handled: it's the raising of a signal that is causing the EINTR.

Here's that attachment again:

51588

I could build one to check for SIGALRM too, but in order to test for it I have to replace any current handler for it (not that I can find any) and hence it wodul be ignored.
(A pity the PL/1 condition mechanism doesn't exists here...)

B-N
07-12-16, 11:34
I just got one of those "Success" errors, but nothing interesting just before the error? Or does "<592506.062> [eFilePushThread] signal_handler entered for signal 17" say anything?

<592505.902> [SoftcamManager] oscam-latest already running
<592505.903> [SoftcamManager] Checking if oscam-latest is frozen
<592505.903> [eConsoleAppContainer] Starting /bin/sh
<592506.062> [eFilePushThread] signal_handler entered for signal 17
<592506.062> [eFilePushThread] signal_handler - reaped 18062, state:00000000
<592506.062> [eMainloop::processOneEvent] unhandled POLLERR/HUP/NVAL for fd 67(16)
<592506.063> [SoftcamManager] oscam-latest is responding like it should
<592506.071> [Task] job Components.Task.Job name=SoftcamKontroll #tasks=1 completed with [] in None
<592657.917> [eDVBRecordFileThread] wait: aio_return returned failure: Success
<592657.917> [eFilePushThreadRecorder] WRITE ERROR, aborting thread: Success
<592657.917> [eDVBRecordFileThread] waiting for aio to complete
<592657.917> [eDVBRecordFileThread] Waiting for I/O to complete
<592657.917> [eDVBServiceRecord] record write error
<592657.917> [eDVBServiceRecord] stop recording!
<592657.918> [eDVBRecordFileThread] Waiting for I/O to complete
<592657.919> [eDVBRecordFileThread] buffer usage histogram (40 buffers of 188 kB)
<592657.920> [eDVBRecordFileThread] 0: 420
<592657.920> [eDVBRecordFileThread] 1: 39534
<592657.920> [eDVBRecordFileThread] 2: 38674
<592657.920> [eFilePushThreadRecorder] stopping thread.
<592657.920> [eFilePushThread] eFilePushThreadRecorder::stop raising SIGUSR1
<592658.036> [eFilePushThread] signal_handler entered for signal 10
<592658.037> [eFilePushThreadRecorder] THREAD STOP
<592658.056> [eDVBTSTools] setSource loading streaminfo for /media/autofs/RATATOSK/movie/20161207 0625 - SVT1 HD - Gomorron Sverige.ts

birdman
07-12-16, 11:44
I just got one of those "Success" errors, but nothing interesting just before the error? Or does "<592506.062> [eFilePushThread] signal_handler entered for signal 17" say anything?
Yes - it does. Although what,exactly, I can't say as I'm not sure what signal 17 is on your box.
It's either a SIGPIPE or a SIGCHLD.
Can you login to the box and run:

kill -l
so we can figure out which it is? (I suspect it's SIGCHLD).
You shouldn't be getting either (as far as I can see) so this is the cause of the problem.
I'm also interested that an unhandled POLLERR/HUP/NVAL is reported at the same time.

B-N
07-12-16, 13:05
Yes! 17 is SIGCHLD.

root@karakal:~# kill -l
1) HUP
2) INT
3) QUIT
4) ILL
5) TRAP
6) ABRT
7) BUS
8) FPE
9) KILL
10) USR1
11) SEGV
12) USR2
13) PIPE
14) ALRM
15) TERM
16) STKFLT
17) CHLD
18) CONT
19) STOP
20) TSTP
21) TTIN
22) TTOU
23) URG
24) XCPU
25) XFSZ
26) VTALRM
27) PROF
28) WINCH
29) POLL
30) PWR
31) SYS
32) RTMIN
64) RTMAX

birdman
07-12-16, 19:32
Yes! 17 is SIGCHLD.Good.
It seems to be a mips thing which makes 12 SYS, so 17 becomes USR2 and CHLD moves to 18.
The arm ones look like the x86 ones.

So, that signal would cause your problem.

Now we just have to find out what sent it and how it can be sent when the running code doesn't expect to see any signals (except the USR1 ones it sends itself to abort async I/O).

All of this is asynchronous and can be the result of anything anywhere in the program. Indeed SIGCHLD is the result of some other process...

B-N
16-02-17, 19:14
Thought I should report a bit what happened next with my recording problems.

Figured it was important to decide if it was a software problem or a real hardware problem. To test this I started thinking it might be easiest to just test different images, but trying images that does not have to close heritage to ViX. The part of the code that does recordings seem to be the same in ViX, OpenPLI, VTI and some others. I thought that maybe Blackhole was different enough. So I have now tested Blackhole for more than a month and I have no recording problems!

I am still not sure if the rouge signals comes from inside ViX or are provoked by some behaviour of my local network. It is quite possible that deep down in libc io routines signals are generated and truncated writes return in some special cases. So I still believe ViX recording code is much to sensitive to disturbances and need to be rewritten to check and use all return values and restart write calls.

The downside of Blackhole is that I miss some of the nice features of ViX, but working recordings are much more important to me :-)