chore: readme a bit

2025-02-09 04:04:55 +06:00
parent d060cdba14
commit 94babaa7aa
4 changed files with 193 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,37 @@
 # chinese -> english finetuning datasets

 train.en and train.zh are from [here](https://www.dropbox.com/scl/fo/dtrf3pe1vfbo5nse16648/ANLqlv3ascANpkdnYF_w4Jk/V1/TRAIN?dl=0&rlkey=486vbn17qra1ez91btj0n4xu2&subfolder_nav_tracking=1)  
-TODO: mirror
+the [actual dataset and .sqlite file](https://mega.nz/folder/byoFHRST#Mcn6-mU5spHxPg0nMlRS3w)
+It's missing the epubs dir I used for paragraph rebuilding... I accidentally deleted the dir, sorry :c  
+What I did was Google a sentence from the chapter 1 of a novel and just scrape 50-60 chapters from either Webnovel or some aggregator, then unzip it into epub with the directory name set to `book_id`.
+
+GuoFeng dataset chapter spread:
+
+```sql
+select book_id, count(*) as chapter_count
+from chapters
+group by book_id
+order by chapter_count desc;
+```
+
+| book_id        | chapter_count |
+| -------------- | ------------- |
+| 45-jsys        | 2262          |
+| 93-yzsslfmmd   | 1733          |
+| 2-xzltq        | 1718          |
+| 19-ysmmjwn     | 1546          |
+| 52-mfwz        | 1254          |
+| 86-wzxajddyx   | 1188          |
+| 34-xwdrcsh     | 1172          |
+| 25-dgfsngm     | 942           |
+| 53-gmzz        | 798           |
+| 6-yh1frhjqjysy | 763           |
+| 141-fyyysndy   | 745           |
+| 37-scrj        | 539           |
+| 95-cjjyyhy     | 516           |
+| 99-jjl         | 220           |
+
+There are 21 more with 60chs and the rest are 50 or less.
+
+However, I didn't import many epubs, there are 153 books in the dataset in total and the most important part about [GuoFeng-Webnovel
+](https://github.com/longyuewangdcu/GuoFeng-Webnovel) dataset is the Chinese raws and more or less _decent_ mapping between paragraphs (there are some mistakes which sucks). I used 19 epubs and not many of the paragraphs actually matched.