chore: readme a bit
This commit is contained in:
35
README.md
35
README.md
@@ -1,4 +1,37 @@
|
||||
# chinese -> english finetuning datasets
|
||||
|
||||
train.en and train.zh are from [here](https://www.dropbox.com/scl/fo/dtrf3pe1vfbo5nse16648/ANLqlv3ascANpkdnYF_w4Jk/V1/TRAIN?dl=0&rlkey=486vbn17qra1ez91btj0n4xu2&subfolder_nav_tracking=1)
|
||||
TODO: mirror
|
||||
the [actual dataset and .sqlite file](https://mega.nz/folder/byoFHRST#Mcn6-mU5spHxPg0nMlRS3w)
|
||||
It's missing the epubs dir I used for paragraph rebuilding... I accidentally deleted the dir, sorry :c
|
||||
What I did was Google a sentence from the chapter 1 of a novel and just scrape 50-60 chapters from either Webnovel or some aggregator, then unzip it into epub with the directory name set to `book_id`.
|
||||
|
||||
GuoFeng dataset chapter spread:
|
||||
|
||||
```sql
|
||||
select book_id, count(*) as chapter_count
|
||||
from chapters
|
||||
group by book_id
|
||||
order by chapter_count desc;
|
||||
```
|
||||
|
||||
| book_id | chapter_count |
|
||||
| -------------- | ------------- |
|
||||
| 45-jsys | 2262 |
|
||||
| 93-yzsslfmmd | 1733 |
|
||||
| 2-xzltq | 1718 |
|
||||
| 19-ysmmjwn | 1546 |
|
||||
| 52-mfwz | 1254 |
|
||||
| 86-wzxajddyx | 1188 |
|
||||
| 34-xwdrcsh | 1172 |
|
||||
| 25-dgfsngm | 942 |
|
||||
| 53-gmzz | 798 |
|
||||
| 6-yh1frhjqjysy | 763 |
|
||||
| 141-fyyysndy | 745 |
|
||||
| 37-scrj | 539 |
|
||||
| 95-cjjyyhy | 516 |
|
||||
| 99-jjl | 220 |
|
||||
|
||||
There are 21 more with 60chs and the rest are 50 or less.
|
||||
|
||||
However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about [GuoFeng-Webnovel
|
||||
](https://github.com/longyuewangdcu/GuoFeng-Webnovel) dataset is the Chinese raws and more or less _decent_ mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.
|
||||
|
||||
Reference in New Issue
Block a user