臺灣植物名彙數位化

為何

ChhoeTaigi的現有數位化在輸入法上很好用,但是它只有記錄台語白話字和漢字,而沒有包含前言、索引、學名列表等等。作為 ChhoeTaigi 的一部分這很合理,但我覺得想試試看把整本所有內容都數位化。

與 ChhoeTaigi 版的差異:

  • 包含學名、日語、客語、族語。Caveat:

    • 學名有不少是過時的。使用前要記得去其他地方先查過
    • 日語是舊的假名用法(像是 sou 會拼「サウ」、 niitaka 會拼「ニヒタカ」等等)
    • 客語條目很少
    • 各族語拼字與現代拼字相差很多
  • 內文的斜體白話字應該是客家話;ChhoeTaigi 版會跳過,但有些時候沒有跳過 (像是 223 頁 Hiong-si-siū)。
  • TODO: 完成後可以直接自動跟 ChhoeTaigi 版內容做比對

感謝

編輯原則

  • English hyphenation is removed.
  • Uses ' as the apostrophe.
  • Text will be in an Asciidoc file first, then an HTML file later.
  • The dictionary part will be well-structured YAML.
  • 原本的漢字在Unicode裡的話用原本的漢字,Unicode裡找不到的話會用Unicode裡有的變體。不確定的會先用?代替。
  • Syntactic misspellings (like the n.n. thing is misspelled once as n..n on page 18) are corrected without extra markings
  • Minor typos are corrected and marked with a comment
  • Errata corrections are applied and marked with a comment
  • Misspelled words are preserved with another <…>-orig key.
  • Some entries map a single POJ to multiple Han character words. There are two possible ways to deal with this:

    • write them as two entries but use a -orig key like a misspell
    • write them as two entries but use a comment marker

    I think I'll go with the second option. Use this marker:

plants.yaml schema

- title: "scientific name"
  by: "person"
  names?:
    - romaji: "Romaji"
      kana: "ロマジ"
      n.n.?: true # Only present if the "n.n." thing exists
    - poj: "Pe̍h-ōe-jī"
      han: "白話字"
      han-orig?: "白話字" # Only present if there is an original typo
    - poj: ""
      han: ""
      hakka: true # for hakka words
    - native: "..."
      group: "..."
    - note: "..." # See page 247
  where?: "全島"
  page: 10
  family: "Polypodiaceae"
  indigenous: true # if false, the plant is cultivated or introduced
Markers

Typos from the original text are marked with original typo, with the original being preserved in an -orig key.

Some entries have multiple Han character versions of one single POJ, for instance Chhiⁿ-poàn-hā 生半夏、青半夏. This becomes two mappings in plants.yaml, accompanied by this marker:

# Was listed as […]

Unclear entries are marked with UNCLEAR.

Missing characters are marked as Missing Han character: […].

Han characters that are in Unicode but which are rare enough to appear as a tofu in my Emacs are marked with its parts using IDEOGRAPH DESCRIPTION characters, with a glyphwiki link afterwards.

一些感想

  • The English introduction is so easy to read since it's basically the same as modern English.
  • I was not expecting to learn about Taiwanese tones in POJ here, of all places.
  • The way I type the old Japanese Kana usage and the old Kanji is quite horrendous, even if it's reasonably fast: I type the Han characters using Bopomofo as if it's Mandarin (Traditional Chinese), and kana with the Emacs japanese-katakana input method.
  • I'm not really using any OCR because I don't believe any of them is able to handle a not-prefectly-clear scan of a mix of 1920s Japanese, POJ, and scientific names.

其他筆記

The book uses POJ for Taiwanese (and even includes an introduction to POJ and Taiwanese tones).

Each entry is:

  • scientific name (the word(s) after the comma are the name of the person who published the scientific name; this convention is still alive to this day)
  • Indigenous (full-face) or introduced / cultivated (italics)
  • Japanese name (in Romaji)
  • Taiwanese name (in POJ)
  • “Kanton dialect” (actually Hakka, in POJ) (italics)
  • Aboriginals name (including which people)
  • Where it's found
  • [category and such]