chinese -> english finetuning datasets
dataset_v3.0_alpaca_noinstr.json
- 487M
- Dataset size: 37243 samples
- Maximum sequence length: 13760
- Average sequence length: 3123.26
train.en and train.zh are from here
the actual dataset and .sqlite file
GuoFeng dataset chapter spread:
select book_id, count(*) as chapter_count
from chapters
group by book_id
order by chapter_count desc;
| book_id | chapter_count | en name | |
|---|---|---|---|
| 45-jsys | 2262 | Unrivaled Medicine God | o |
| 93-yzsslfmmd | 1733 | Beauty and the Beast: Wolf Hubby XOXO | o |
| 2-xzltq | 1718 | Cultivation Chat Group | o |
| 19-ysmmjwn | 1546 | The Rest Of My Life Is For You | o |
| 52-mfwz | 1254 | End of the Magic Era | o |
| 86-wzxajddyx | 1188 | Let Me Game in Peace | o |
| 34-xwdrcsh | 1172 | The Daily Life Of The Immortal King | o |
| 25-dgfsngm | 942 | When A Mage Revolts | o |
| 53-gmzz | 798 | Lord of Mysteries | o |
| 6-yh1frhjqjysy | 763 | A Husband and Wife | x |
| 141-fyyysndy | 745 | Mages Are Too Op | o |
| 37-scrj | 539 | A World Worth Protecting | o |
| 95-cjjyyhy | 516 | Super Gene Optimization Fluid | o |
| 99-jjl | 220 | Jun Jiuling | o |
| 100-jdxx | 100 | Absolute Choice | p |
| 149-ajnszwj9csan | 100 | Living With a Temperamental Adonis: 99 Proclamations of Love | o |
| 151-gfsy | 100 | Invincible Kungfu Healer | o |
| 152-dwyx | 100 | Low Dimensional Game | o |
| 153-ldyb98k | 100 | Kar98K Upon Touchdown! | o |
| 154-nsxhn | 100 | Back Then, I Adored You | o |
There are 21 more with 60chs and the rest are 50 or less.
However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about GuoFeng-Webnovel dataset is the Chinese raws and more or less decent mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.
Description
Languages
HTML
99.9%
