You appear to be performing test based on some observations I made in my last post. I questioned if what I was seeing prior to the similarity testing was what was expected such as duplication of the short description and/or a missing short description and a populated extended description. Is there another bug to be found before attempting to compensate elsewhere?
You may be making assumptions about the format of the EPG data based only on UK over the air collection and maybe not valid if the EPG data is collected in a different way – from the Internet. Other sources of EPG may have more data.
Perhaps its better to establish if missing description data is a bug that needs to be fixed and/or is duplication in all cases expected when data is obtained in a specific way[1].
The danger of using one example of a problem EPG description, as you say, is that you can miss the bigger picture. In the UK identical programs/repeats may have a slightly different EPG description. For instance the broadcaster may include a marker for a subtitle or signed for the deaf on one showing but not the next. Other non-UK broadcasters may add other such data on certain showings.
You have already indicated that different platforms (for instance Freeview and Freesat) may have different EPGs for the same programs and many people have both terrestrial and satellite tuners and may have recorded a series on one service and for further broadcast on another service, and want the checking to consider already recorded episodes. I personally haven’t seen too much of a difference between the bulk wording between the UK services but have seen it in the Series/Episode part. Does [S1, Ep02] = S1, Ep2 or does [S1, Ep02] = S1, Ep 2/8 or does [HD] = Also in HD?
Giving any preference to equality checking rather than similarity checking may result in many more unwanted repeats than with the current code.
I’m not sure how easily it would be to check your new code was creating better results than the existing code? First you would have to have a large data base of EPG data (perhaps excluding the example of the problem EPG description described in this thread) and run it through the existing similarity checking. Then perform the same test with the same data through your new code. If the results are very similar you may conclude that may not have broken anything (for UK based EPG data that is obtained over the air). If there are difference you need to establish why because in general the exiting code does work in 99+% of cases (at least for me, where my settings may differ from other users).
I don’t know the answer so some, questions...
Is equality checking in the revised code case dependant? (is new the same as NEW?)
You indicate some of the testing is UK dependant (test for New) but what if the EPG string also contain “foreign” characters such as those with umlauts or similar characters in other languages? Would your new code break under these circumstances?
If you replace punctuation and spaces etc. with, say, an underscore and the two identically worded description with the same series/episode ended up being 1 character length different because one had an additional space would your equality testing fail?
[1]
In the grid EPG view occasionally when scrolling through the programs a certain program will have no description whereas those adjacent to it will have a description. However, if leaving the EPG view and then going immediately back in the program with the missing description now has it, Possibly because on entering the EPG for the first time it wasn’t read correctly and on the second time it was a refreshed read. [Wild Speculation] Maybe the first time around the information for that program was being updated over the air and so “busy” and not accessible for both reading and writing. Could the reason for the missing data in the EPG view be what is being seen with the missing description data prior to the similarity checking?