2025-02-09 14:30:25 +06:00
2025-02-09 04:04:55 +06:00
2025-02-09 04:04:55 +06:00
2025-02-09 14:30:25 +06:00
2025-02-09 14:30:25 +06:00
2025-02-09 14:30:25 +06:00
2025-02-09 03:07:07 +06:00
2025-02-09 03:07:07 +06:00
2025-02-09 04:04:55 +06:00
2025-02-09 14:30:25 +06:00

chinese -> english finetuning datasets

train.en and train.zh are from here
the actual dataset and .sqlite file It's missing the epubs dir I used for paragraph rebuilding... I accidentally deleted the dir, sorry :c
What I did was Google a sentence from the chapter 1 of a novel and just scrape 50-60 chapters from either Webnovel or some aggregator, then unzip it into epub with the directory name set to book_id.

GuoFeng dataset chapter spread:

select book_id, count(*) as chapter_count
from chapters
group by book_id
order by chapter_count desc;
book_id chapter_count
45-jsys 2262
93-yzsslfmmd 1733
2-xzltq 1718
19-ysmmjwn 1546
52-mfwz 1254
86-wzxajddyx 1188
34-xwdrcsh 1172
25-dgfsngm 942
53-gmzz 798
6-yh1frhjqjysy 763
141-fyyysndy 745
37-scrj 539
95-cjjyyhy 516
99-jjl 220

There are 21 more with 60chs and the rest are 50 or less.

However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about GuoFeng-Webnovel dataset is the Chinese raws and more or less decent mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.

Description
No description provided
Readme 184 MiB
Languages
HTML 99.9%