Note: the description is updated with comments and changes requested in comments.
The goal is to rework the script
module to allow more flexibility and clearly separate concerns.
First, about the module name: script
. It has been decided to change to wikidict
.
Overview
I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily).
This will also help leveraging multithreading to speed-up the whole process.
- [x] Download the data (#466)
- [x] Parse and store raw data (#469)
- [x] Render templates and store results (#469)
- [ ] Output to the proper eBook reader format
I have in mind a SQLite database where raw data will be stored and updated when needed.
Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.
Then, each and every part will have its own CLI:
$ python -m wikidict --download ...
$ python -m wikidict --parse ...
$ python -m wikidict --render ...
$ python -m wikidict --output ...
And the all-in-one operation would be:
$ python -m wikidict --run ...
Side note: we could use an entry point to only having to type wikidict
instead of python -m wikidict
.
Splitting get.py
Here we are talking about parts 1 and 2.
Part 1 is already almost fine as-is, we just need to move the code into its own submodule.
We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.
Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:
table: Word
fields:
- word: varchar(256)
- code: text
index on: word
table: Render
fields:
- word_id: int
- nature: varchar(16)
- text: text
foreign key: word_id (Word._rowid_)
- The
Word
table will contain raw data from the Wiktionary.
- The
Render
table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, ...).
We will have one database per locale, located at data/$LOCALE/$WIKIDUMP_DATE.db
.
At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries.
This is a cool thing IMO: everyone will have the good and up-to-date local database.
Of course, we will have options to skip it if the local file already exists or if we would like to force the download.
At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump.
I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer.
It would be set only after the full parsing is done with success.
Splitting convert.py
Here we are talking about parts 3 and 4.
Part 3 will call clean()
and process_templates()
on the wikicode
. And store the result into the rendered
field. This is the most time and CPU consuming part. It will be parallelized.
Part 4 will rethink how we are handling dictionary output to easily add more formats.
I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):
class BaseFormat:
__slots__ = {"locale", "output_dir"}
def __init__(self, locale: str, output_dir: Path) -> None:
self.locale = locale
self.output_dir = output_dir
def process(self) -> None:
raise NotImplementedError()
def save(self) -> None:
raise NotImplementedError()
class KoboFormat(BaseFormat):
def process(self, words) -> None:
groups = self.make_groups(self.words)
variants = self.make_variants(self.words)
wordlist = []
for word in words:
wordlist.append(self.process_word(word))
self.save(wordlist, groups, variants)
def save(self, ...) -> None:
...
That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:
# Get all registered formats
formaters = get_formaters()
# Get all words from the database
words = get_words()
# And distribute the workload
from multiprocessing import Pool
def run(cls):
formater = cls(locale, output_dir)
formater.process(words)
with Pool(len(formaters)) as pool:
pool.map(run_formatter, formaters))