David Gerard@awful.systemsM to TechTakes@awful.systemsEnglish · 2 years agocome see all the popular super-duper-autocomplete systems failing hard at really simple reasoning questions and babbling nonsense from latent space!arxiv.orgexternal-linkmessage-square12linkfedilinkarrow-up151arrow-down10
arrow-up151arrow-down1external-linkcome see all the popular super-duper-autocomplete systems failing hard at really simple reasoning questions and babbling nonsense from latent space!arxiv.orgDavid Gerard@awful.systemsM to TechTakes@awful.systemsEnglish · 2 years agomessage-square12linkfedilink
minus-squaresinedpick@awful.systemslinkfedilinkEnglisharrow-up9·2 years agoThis all but confirms that all those benchmark evals are in the training set right?
minus-squareDavid Gerard@awful.systemsOPMlinkfedilinkEnglisharrow-up13·2 years agoSome forms are - but many are not! The fun stuff is in Appendix 2, the responses.
This all but confirms that all those benchmark evals are in the training set right?
Some forms are - but many are not! The fun stuff is in Appendix 2, the responses.