chore: readme
This commit is contained in:
15
README.md
15
README.md
@@ -1,9 +1,20 @@
|
||||
# chinese -> english finetuning datasets
|
||||
|
||||
## dataset_v3.0_alpaca_noinstr.json
|
||||

|
||||
|
||||
- 487M
|
||||
- Dataset size: 37243 samples
|
||||
- Maximum sequence length: 13760
|
||||
- Average sequence length: 3123.26
|
||||
|
||||
|
||||
##
|
||||
|
||||
train.en and train.zh are from [here](https://www.dropbox.com/scl/fo/dtrf3pe1vfbo5nse16648/ANLqlv3ascANpkdnYF_w4Jk/V1/TRAIN?dl=0&rlkey=486vbn17qra1ez91btj0n4xu2&subfolder_nav_tracking=1)
|
||||
the [actual dataset and .sqlite file](https://mega.nz/folder/byoFHRST#Mcn6-mU5spHxPg0nMlRS3w)
|
||||
It's missing the epubs dir I used for paragraph rebuilding... I accidentally deleted the dir, sorry :c
|
||||
What I did was Google a sentence from the chapter 1 of a novel and just scrape 50-60 chapters from either Webnovel or some aggregator, then unzip it into epub with the directory name set to `book_id`.
|
||||
|
||||
|
||||
|
||||
GuoFeng dataset chapter spread:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user