This follows the issue #3180 .
I suggest a new way of handling timesteps produced by the CTC decoder. There is no strange heuristic, and I think the logic is clear : when fusing two different paths leading to the same prefix, not only fuse the probabilities (the probabilities are added), but also fuse the timestep sequences (for the last letter in the sequence, choose the timestep from the most probable path).
The place where two different paths leading to the same prefix are fused are the places where log_sum_exp
is called, because this function fuses the probabilities. So, timesteps would now be fused at the same places.
The other change is that each PathTrie
node would now store the full sequence of timesteps. This is because one prefix can be an ancestor of another and their timesteps on a given node can differ. Having the full sequence of timesteps in each node, we have no need to duplicate a node with different timesteps, and it is much simpler like that.
Moreover, it makes sense to store the full sequence of timesteps, because the combined probabilities are also stored there. The total probability is not the sum of the probability of each output token, and, in the same way, the correct sequence of timesteps is not the concatenation of the timestep of each output token.
Since I need to compare the probability of different paths (to keep the timesteps of the most probable one), it is important to compare paths of the same length (eg. paths from the beginning up to the current time). So, exactly the same way as it is done for the probabilities, I need to know the timesteps of the previous time, and store the timesteps of the current time separately.
In the end, timesteps are handled in a way very similar to the way probabilities are handled.
Results on an example
To evaluate the resulting timesteps, I first take the argmax of my logits. In my example, it gives :
tou_________________________________________________ss les _a__mouurreuux de se_p_ort__ diiivverrr_ ss'enn__ _r_é___j_uuii__rr__on_t____ aa_v_eecc_ l''aap__p_rroo___chhee_____ de_ ll'hhi__v_e_rr____ et la rre__p_rri_ssee dee la c_ouppee ddu mon_deee ss__kk_i___ less ii_mm_a__ggees_ de _g_ll_i_ss__ssee____________________ ree__t_rrou_vveennt uunee __pllaa___cee____ de _cchhooixx__ d_ans_ lles _pp_a____ggeess ssspp_o_r_t_ii_vees_ de v_o_s_ _jourrnnaauxx ttéé_l_é___v_ii___ss_é__s__ ddeeu_x_ __é___pprreeuu_vvees___________________ _auu_jjoouurrdd''hhuuii___ _o____nno___rr_o_____d__a___mm________ ___s___a_n______t__a______ _q__a___tt__e___rr_i____n__a_____ __p_r____mmie_r__ s__a___l__o___m___ _g_é____ant__ de lla _c_ou_ppee ddu m_on___deee____________________________________________________
As the logits are the only input of the decoder, I base my evaluation on them instead of comparing with the audio file directly. It is known that the CTC loss does not guarantee alignment between the audio file and the logits, so the best thing the decoder can do is to fit the logits as best as it can. This is reasonable because, in practice, the logits are aligned quite well with the audio file.
Then, for each word, I take the part of the logits corresponding to the output timesteps, take the argmax (as said above), and print the corresponding decoded text.
Finally, I assume that good timesteps should lead to a good match between the word and its corresponding text decoded from the argmax of the logits.
Before this PR, the result in my example is (text between slashes is from the logits argmax, spaces are trimmed) :
[WordScoreRange(word=tous /tou/, score=None, ranges=((0, 4),)),
WordScoreRange(word=les /les/, score=None, ranges=((55, 59),)),
WordScoreRange(word=amoureux /amoureu/, score=None, ranges=((60, 74),)),
WordScoreRange(word=de /d/, score=None, ranges=((75, 78),)),
WordScoreRange(word=sport /seport/, score=None, ranges=((80, 90),)),
WordScoreRange(word=divers /diver/, score=None, ranges=((91, 102),)),
WordScoreRange(word=s'en /s'en/, score=None, ranges=((103, 111),)),
WordScoreRange(word=réjouiront /réjuiron/, score=None, ranges=((112, 136),)),
WordScoreRange(word=avec /avec/, score=None, ranges=((141, 153),)),
WordScoreRange(word=l'approche /l'approche/, score=None, ranges=((155, 179),)),
WordScoreRange(word=de /de/, score=None, ranges=((185, 191),)),
WordScoreRange(word=l'hiver /l'hive/, score=None, ranges=((192, 206),)),
WordScoreRange(word=et /e/, score=None, ranges=((208, 215),)),
WordScoreRange(word=la /la/, score=None, ranges=((217, 221),)),
WordScoreRange(word=reprise /reprise/, score=None, ranges=((222, 238),)),
WordScoreRange(word=de /de/, score=None, ranges=((240, 243),)),
WordScoreRange(word=la /la/, score=None, ranges=((244, 248),)),
WordScoreRange(word=coupe /coupe/, score=None, ranges=((249, 257),)),
WordScoreRange(word=du /d/, score=None, ranges=((258, 261),)),
WordScoreRange(word=monde /monde/, score=None, ranges=((263, 270),)),
WordScoreRange(word=de /e/, score=None, ranges=((271, 275),)),
WordScoreRange(word=ski /ski/, score=None, ranges=((276, 286),)),
WordScoreRange(word=les /les/, score=None, ranges=((290, 298),)),
WordScoreRange(word=images /image/, score=None, ranges=((300, 316),)),
WordScoreRange(word=de /d/, score=None, ranges=((318, 322),)),
WordScoreRange(word=glisse /glisse/, score=None, ranges=((324, 341),)),
WordScoreRange(word=retrouvent /retrouvent/, score=None, ranges=((363, 394),)),
WordScoreRange(word=une /une/, score=None, ranges=((395, 402),)),
WordScoreRange(word=place /place/, score=None, ranges=((404, 419),)),
WordScoreRange(word=de /de/, score=None, ranges=((425, 432),)),
WordScoreRange(word=choix /choix/, score=None, ranges=((433, 445),)),
WordScoreRange(word=dans /dans/, score=None, ranges=((448, 455),)),
WordScoreRange(word=les /les/, score=None, ranges=((457, 462),)),
WordScoreRange(word=pages /pages/, score=None, ranges=((463, 478),)),
WordScoreRange(word=sportives /sportives/, score=None, ranges=((479, 506),)),
WordScoreRange(word=de /de/, score=None, ranges=((508, 511),)),
WordScoreRange(word=vos /vos/, score=None, ranges=((512, 518),)),
WordScoreRange(word=journaux /journaux/, score=None, ranges=((520, 532),)),
WordScoreRange(word=télévisés /télévisé/, score=None, ranges=((533, 559),)),
WordScoreRange(word=deux /deux/, score=None, ranges=((563, 575),)),
WordScoreRange(word=épreuves /épreuve/, score=None, ranges=((577, 598),)),
WordScoreRange(word=aujourd'hui /aujourd'hu/, score=None, ranges=((618, 649),)),
WordScoreRange(word=on /o/, score=None, ranges=((654, 665),)),
WordScoreRange(word=a /am/, score=None, ranges=((685, 699),)),
WordScoreRange(word=santa /santa/, score=None, ranges=((703, 726),)),
WordScoreRange(word=caterina /qaterina/, score=None, ranges=((729, 756),)),
WordScoreRange(word=premier /prmier/, score=None, ranges=((762, 783),)),
WordScoreRange(word=salon /salom/, score=None, ranges=((785, 803),)),
WordScoreRange(word=géant /géant/, score=None, ranges=((805, 818),)),
WordScoreRange(word=de /d/, score=None, ranges=((819, 823),)),
WordScoreRange(word=la /l/, score=None, ranges=((825, 829),)),
WordScoreRange(word=coupe /coupe/, score=None, ranges=((831, 841),)),
WordScoreRange(word=du /d/, score=None, ranges=((842, 846),)),
WordScoreRange(word=monde /mon/, score=None, ranges=((848, 857),))]
After this PR, the result in my example is :
[WordScoreRange(word=tous /tous/, score=None, ranges=((0, 54),)),
WordScoreRange(word=les /les/, score=None, ranges=((56, 59),)),
WordScoreRange(word=amoureux /amoureux/, score=None, ranges=((62, 75),)),
WordScoreRange(word=de /de/, score=None, ranges=((77, 79),)),
WordScoreRange(word=sport /seport/, score=None, ranges=((81, 91),)),
WordScoreRange(word=divers /diver/, score=None, ranges=((92, 103),)),
WordScoreRange(word=s'en /s'en/, score=None, ranges=((105, 113),)),
WordScoreRange(word=réjouiront /réjuiront/, score=None, ranges=((114, 140),)),
WordScoreRange(word=avec /avec/, score=None, ranges=((145, 155),)),
WordScoreRange(word=l'approche /l'approche/, score=None, ranges=((158, 185),)),
WordScoreRange(word=de /de/, score=None, ranges=((189, 192),)),
WordScoreRange(word=l'hiver /l'hiver/, score=None, ranges=((194, 212),)),
WordScoreRange(word=et /et/, score=None, ranges=((214, 216),)),
WordScoreRange(word=la /la/, score=None, ranges=((219, 221),)),
WordScoreRange(word=reprise /reprise/, score=None, ranges=((224, 239),)),
WordScoreRange(word=de /de/, score=None, ranges=((241, 244),)),
WordScoreRange(word=la /la/, score=None, ranges=((246, 248),)),
WordScoreRange(word=coupe /coupe/, score=None, ranges=((250, 258),)),
WordScoreRange(word=du /du/, score=None, ranges=((259, 262),)),
WordScoreRange(word=monde /monde/, score=None, ranges=((264, 272),)),
WordScoreRange(word=de //, score=None, ranges=((273, 275),)),
WordScoreRange(word=ski /ski/, score=None, ranges=((278, 289),)),
WordScoreRange(word=les /les/, score=None, ranges=((295, 299),)),
WordScoreRange(word=images /images/, score=None, ranges=((303, 318),)),
WordScoreRange(word=de /de/, score=None, ranges=((321, 323),)),
WordScoreRange(word=glisse /glisse/, score=None, ranges=((327, 362),)),
WordScoreRange(word=retrouvent /retrouvent/, score=None, ranges=((375, 394),)),
WordScoreRange(word=une /une/, score=None, ranges=((398, 403),)),
WordScoreRange(word=place /place/, score=None, ranges=((409, 424),)),
WordScoreRange(word=de /de/, score=None, ranges=((430, 432),)),
WordScoreRange(word=choix /choix/, score=None, ranges=((437, 448),)),
WordScoreRange(word=dans /dans/, score=None, ranges=((450, 456),)),
WordScoreRange(word=les /les/, score=None, ranges=((458, 462),)),
WordScoreRange(word=pages /pages/, score=None, ranges=((465, 479),)),
WordScoreRange(word=sportives /sportives/, score=None, ranges=((487, 507),)),
WordScoreRange(word=de /de/, score=None, ranges=((509, 511),)),
WordScoreRange(word=vos /vos/, score=None, ranges=((513, 519),)),
WordScoreRange(word=journaux /journaux/, score=None, ranges=((521, 533),)),
WordScoreRange(word=télévisés /télévisés/, score=None, ranges=((535, 562),)),
WordScoreRange(word=deux /deux/, score=None, ranges=((568, 576),)),
WordScoreRange(word=épreuves /épreuves/, score=None, ranges=((581, 618),)),
WordScoreRange(word=aujourd'hui /aujourd'hui/, score=None, ranges=((629, 654),)),
WordScoreRange(word=on /onoro/, score=None, ranges=((662, 680),)),
WordScoreRange(word=a /am/, score=None, ranges=((685, 699),)),
WordScoreRange(word=santa /santa/, score=None, ranges=((703, 726),)),
WordScoreRange(word=caterina /qaterina/, score=None, ranges=((729, 761),)),
WordScoreRange(word=premier /prmier/, score=None, ranges=((768, 783),)),
WordScoreRange(word=salon /salom/, score=None, ranges=((785, 803),)),
WordScoreRange(word=géant /géant/, score=None, ranges=((808, 820),)),
WordScoreRange(word=de /de/, score=None, ranges=((822, 824),)),
WordScoreRange(word=la /l/, score=None, ranges=((826, 829),)),
WordScoreRange(word=coupe /coupe/, score=None, ranges=((833, 842),)),
WordScoreRange(word=du /d/, score=None, ranges=((843, 846),)),
WordScoreRange(word=monde /mond/, score=None, ranges=((850, 858),))]
We can see that before this PR, there are 17 words where timesteps are too early (about one letter shift, it is visible at the end but not at the begining of words because I have trimed spaces).
After this PR, the fit is almost prefect. For some reason, there are still 3 remaining errors, all in the 4 last words.