2025-02-13 18:37:06 +06:00
2025-02-13 17:25:01 +06:00
2025-02-11 13:28:12 +06:00
2025-02-09 04:04:55 +06:00
2025-02-13 18:37:06 +06:00
2025-02-13 17:25:01 +06:00
2025-02-11 20:30:49 +06:00
2025-02-11 20:30:49 +06:00
2025-02-11 20:30:49 +06:00
2025-02-11 13:28:12 +06:00
2025-02-09 14:30:25 +06:00
2025-02-12 23:22:19 +06:00
2025-02-11 20:30:49 +06:00
2025-02-09 14:30:25 +06:00
2025-02-10 17:42:04 +06:00
2025-02-09 03:07:07 +06:00
2025-02-13 18:37:06 +06:00
2025-02-09 14:30:25 +06:00
2025-02-13 18:37:06 +06:00
2025-02-11 03:25:03 +06:00
2025-02-13 17:25:01 +06:00
2025-02-12 00:36:34 +06:00

chinese -> english finetuning datasets

dataset_v3.0_alpaca_noinstr.json

sequence distribution

  • 487M
  • Dataset size: 37243 samples
  • Maximum sequence length: 13760
  • Average sequence length: 3123.26

train.en and train.zh are from here
the actual dataset and .sqlite file

GuoFeng dataset chapter spread:

select book_id, count(*) as chapter_count
from chapters
group by book_id
order by chapter_count desc;
book_id chapter_count en name
45-jsys 2262 Unrivaled Medicine God o
93-yzsslfmmd 1733 Beauty and the Beast: Wolf Hubby XOXO o
2-xzltq 1718 Cultivation Chat Group o
19-ysmmjwn 1546 The Rest Of My Life Is For You o
52-mfwz 1254 End of the Magic Era o
86-wzxajddyx 1188 Let Me Game in Peace o
34-xwdrcsh 1172 The Daily Life Of The Immortal King o
25-dgfsngm 942 When A Mage Revolts o
53-gmzz 798 Lord of Mysteries o
6-yh1frhjqjysy 763 A Husband and Wife x
141-fyyysndy 745 Mages Are Too Op o
37-scrj 539 A World Worth Protecting o
95-cjjyyhy 516 Super Gene Optimization Fluid o
99-jjl 220 Jun Jiuling o
100-jdxx 100 Absolute Choice p
149-ajnszwj9csan 100 Living With a Temperamental Adonis: 99 Proclamations of Love o
151-gfsy 100 Invincible Kungfu Healer o
152-dwyx 100 Low Dimensional Game o
153-ldyb98k 100 Kar98K Upon Touchdown! o
154-nsxhn 100 Back Then, I Adored You o

There are 21 more with 60chs and the rest are 50 or less.

However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about GuoFeng-Webnovel dataset is the Chinese raws and more or less decent mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.

Description
No description provided
Readme 184 MiB
Languages
HTML 99.9%