# chinese -> english finetuning datasets train.en and train.zh are from [here](https://www.dropbox.com/scl/fo/dtrf3pe1vfbo5nse16648/ANLqlv3ascANpkdnYF_w4Jk/V1/TRAIN?dl=0&rlkey=486vbn17qra1ez91btj0n4xu2&subfolder_nav_tracking=1) the [actual dataset and .sqlite file](https://mega.nz/folder/byoFHRST#Mcn6-mU5spHxPg0nMlRS3w) It's missing the epubs dir I used for paragraph rebuilding... I accidentally deleted the dir, sorry :c What I did was Google a sentence from the chapter 1 of a novel and just scrape 50-60 chapters from either Webnovel or some aggregator, then unzip it into epub with the directory name set to `book_id`. GuoFeng dataset chapter spread: ```sql select book_id, count(*) as chapter_count from chapters group by book_id order by chapter_count desc; ``` | book_id | chapter_count | en name | | | ---------------- | ------------- | ------------------------------------------------------------ | --- | | 45-jsys | 2262 | Unrivaled Medicine God | o | | 93-yzsslfmmd | 1733 | Beauty and the Beast: Wolf Hubby XOXO | o | | 2-xzltq | 1718 | Cultivation Chat Group | o | | 19-ysmmjwn | 1546 | The Rest Of My Life Is For You | o | | 52-mfwz | 1254 | End of the Magic Era | o | | 86-wzxajddyx | 1188 | Let Me Game in Peace | o | | 34-xwdrcsh | 1172 | The Daily Life Of The Immortal King | o | | 25-dgfsngm | 942 | When A Mage Revolts | o | | 53-gmzz | 798 | Lord of Mysteries | o | | 6-yh1frhjqjysy | 763 | A Husband and Wife | x | | 141-fyyysndy | 745 | Mages Are Too Op | o | | 37-scrj | 539 | A World Worth Protecting | o | | 95-cjjyyhy | 516 | Super Gene Optimization Fluid | o | | 99-jjl | 220 | Jun Jiuling | o | | 100-jdxx | 100 | Absolute Choice | p | | 149-ajnszwj9csan | 100 | Living With a Temperamental Adonis: 99 Proclamations of Love | o | | 151-gfsy | 100 | Invincible Kungfu Healer | o | | 152-dwyx | 100 | Low Dimensional Game | o | | 153-ldyb98k | 100 | Kar98K Upon Touchdown! | o | | 154-nsxhn | 100 | Back Then, I Adored You | o | There are 21 more with 60chs and the rest are 50 or less. However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about [GuoFeng-Webnovel ](https://github.com/longyuewangdcu/GuoFeng-Webnovel) dataset is the Chinese raws and more or less _decent_ mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.