臺灣植物名彙數位化
最近更新
為何
ChhoeTaigi的現有數位化在輸入法上很好用,但是它只有記錄台語白話字和漢字,而沒有包含前言、索引、學名列表等等。作為 ChhoeTaigi 的一部分這很合理,但我覺得想試試看把整本所有內容都數位化。
與 ChhoeTaigi 版的差異:
-
包含學名、日語、客語、族語。Caveat:
- 學名有不少是過時的。使用前要記得去其他地方先查過
- 日語是舊的假名用法(像是
sou
會拼「サウ」、niitaka
會拼「ニヒタカ」等等) - 客語條目很少
- 各族語拼字與現代拼字相差很多
- 內文的斜體白話字應該是客家話;ChhoeTaigi 版會跳過,但有些時候沒有跳過 (像是 223 頁
Hiong-si-siū
)。 - TODO: 完成後可以直接自動跟 ChhoeTaigi 版內容做比對
感謝
- 台語文記憶的原始掃描與線上發表
- ChhoeTaigi的數位化版本
- 教育部異體字字典(找尋漢字)
- 英文維基詞典(找尋漢字)
- 意傳台文輸入法
編輯原則
- English hyphenation is removed.
- Uses
'
as the apostrophe. - Text will be in an Asciidoc file first, then an HTML file later.
- The dictionary part will be well-structured YAML.
- 原本的漢字在Unicode裡的話用原本的漢字,Unicode裡找不到的話會用Unicode裡有的變體。不確定的會先用?代替。
- Syntactic misspellings (like the
n.n.
thing is misspelled once asn..n
on page 18) are corrected without extra markings - Minor typos are corrected and marked with a comment
- Errata corrections are applied and marked with a comment
- Misspelled words are preserved with another
<…>-orig
key. -
Some entries map a single POJ to multiple Han character words. There are two possible ways to deal with this:
- write them as two entries but use a
-orig
key like a misspell - write them as two entries but use a comment marker
I think I'll go with the second option. Use this marker:
- write them as two entries but use a
plants.yaml schema
- title: "scientific name"
by: "person"
names?:
- romaji: "Romaji"
kana: "ロマジ"
n.n.?: true # Only present if the "n.n." thing exists
- poj: "Pe̍h-ōe-jī"
han: "白話字"
han-orig?: "白話字" # Only present if there is an original typo
- poj: ""
han: ""
hakka: true # for hakka words
- native: "..."
group: "..."
- note: "..." # See page 247
where?: "全島"
page: 10
family: "Polypodiaceae"
indigenous: true # if false, the plant is cultivated or introduced
Markers
Typos from the original text are marked with original typo
, with the original being preserved in an -orig
key.
Some entries have multiple Han character versions of one single POJ, for instance Chhiⁿ-poàn-hā 生半夏、青半夏
. This becomes two mappings in plants.yaml, accompanied by this marker:
# Was listed as […]
Unclear entries are marked with UNCLEAR
.
Missing characters are marked as Missing Han character: […]
.
Han characters that are in Unicode but which are rare enough to appear as a tofu in my Emacs are marked with its parts using IDEOGRAPH DESCRIPTION characters, with a glyphwiki link afterwards.
一些感想
- The English introduction is so easy to read since it's basically the same as modern English.
- I was not expecting to learn about Taiwanese tones in POJ here, of all places.
- The way I type the old Japanese Kana usage and the old Kanji is quite horrendous, even if it's reasonably fast: I type the Han characters using Bopomofo as if it's Mandarin (Traditional Chinese), and kana with the Emacs
japanese-katakana
input method. - I'm not really using any OCR because I don't believe any of them is able to handle a not-prefectly-clear scan of a mix of 1920s Japanese, POJ, and scientific names.
其他筆記
The book uses POJ for Taiwanese (and even includes an introduction to POJ and Taiwanese tones).
Each entry is:
- scientific name (the word(s) after the comma are the name of the person who published the scientific name; this convention is still alive to this day)
- Indigenous (full-face) or introduced / cultivated (italics)
- Japanese name (in Romaji)
- Taiwanese name (in POJ)
- “Kanton dialect” (actually Hakka, in POJ) (italics)
- Aboriginals name (including which people)
- Where it's found
- [category and such]