Lookup Cangjie encoding in Emacs with cangjie.el

2019/01/21

Table of Contents

In August 2018, I wanted to start learning Cangjie input method, but I didn't know of an easy way to look up the encoding for particular characters. Up to this point I had been using Chinese Wiktionary, which contains the encoding in the pages about individual characters.

curl Wiktionary | grep "仓颉"

Getting that data from Wiktionary is simple: curl "https://zh.wiktionary.org/wiki/<character>" | grep "仓颉". I initially had this saved as a shell script, but as I use Emacs a lot, I wanted to have the function available in Emacs. The cangjie function receives the character and passes it to the pipeline, written in Emacs Lisp with dash.el’s threading macro.

This approach has a problem. Even though most Wiktionary pages do contain the information I need, that’s not guaranteed in any way. Plus, reaching for a server everytime I want to look up a character means it’s going to be slow. So I wanted to look for a Cangjie dictionary, and the best option that came to mind was to use the dictionary from RIME.

Using RIME’s Cangjie dictionary

RIME is an IME for Chinese languages. It has built-in support for Cangjie (all Han characters), Pinyin, Zhuyin (Mandarin), Jyutping (Cantonese), among other input methods. For my use case, it has a Cangjie dictionary available on GitHub that uses a much more stable format than Wiktionary entries.

The first version I committed to the cangjie.el repository already had both of these approaches; in this version, which approach to use is controlled by whether a valid RIME dictionary exists or not.

RIME dictionaries (official documentation) are \t-seperated values with a YAML header in front, using # as the comment character. To get the Cangjie encoding for a character from the dictionary, we grab the lines matching the character, remove any matches starting with a #, then select the second element delimited by \t. This encoding is written in the alphabetical representation of the Cangjie code (like jnd), so we convert that to the Han character representation (like 十弓木), implemented with cangjie--abc-to-han.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
(defun cangjie (han)
  "Retrieve Cangjie code for the HAN character."
  (cond ((cangjie--valid-rime-dict? cangjie-source)
         ;; take cangjie encoding from RIME dictionary
         (->> (cangjie--grep-line cangjie-source han)
              (--filter (not (s-prefix? "#" it)))
              (s-join "")
              (s-split "\t")
              second
              cangjie--abc-to-han))
        ((eq cangjie-source 'wiktionary)
         ;; Try to extract encoding from grep'd wiktionary text
         (->> (shell-command-to-string
               (concat "curl --silent https://zh.wiktionary.org/wiki/" han
                       " | grep 仓颉"))
              (s-replace-regexp "^.*:" "")
              s-trim
              (s-replace-regexp "<.*>$" "")
              cangjie--abc-to-han))
        (t
         ;; Fallback
         (shell-command-to-string
          (concat "curl --silent https://zh.wiktionary.org/wiki/" han
                  " | grep 仓颉")))))

After this, I added code to automatically download the RIME dictionary, a customize option to control whether to use Wiktionary or RIME, and rewrote some code for taste. This is when I considered the package essentially complete.

Submitting to MELPA

In October, I came across Take Your Emacs to the Next Level by Writing Custom Packages, where the author writes about their experience writing their first package. In particular, there’s a section about how they submitted their package onto MELPA, and it made me consider submitting to MELPA as well.

I’ve set up Flycheck a long time ago, and have generally always agreed with the warnings it gave me about docstrings and code style. After going through MELPA’s official guide on submission, I opened a pull request and got accepted. It was a pleasant experience.