Extract XML from the OS X dictionaries.

Joshua Olson

Last update: Dec 11, 2022

Related tags

General Utilities apple-peeler

Overview

Before You Start

Apple-peeler was written using python 3.9 (but it should be trivial to support earlier versions of python 3.5+).

Installation

pip install apple-peeler

Dependencies

BeautifulSoup 4, lxml, and click

Usage

Apple likes to move around the dictionaries location from macOS version to macOS version. So if the dictionaries are no longer at the path below you can tell apple-peeler where to look by exporting DICT_BASE in your environment or using the --base option below.

export DICT_BASE="/System/Library/AssetsV2/com_apple_MobileAsset_DictionaryServices_dictionaryOSX/"

After that, useage is straightforward.

Usage: apple-peeler [OPTIONS]

Extract XML from Apple Dictionary files.

Options:
--base DIRECTORY                The root directory of the OS X dictionaries.
                                (Default: /System/Library/AssetsV2/com_apple
                                _MobileAsset_DictionaryServices_dictionaryOS
                                X/) [Env var DICT_BASE]
--out DIRECTORY                 The path to place extracted XML files.
-d, --dictionary [
    all|Arabic - English|Danish|Duden Dictionary Data Set I|Dutch|
    Dutch - English|French|French - English|French - German|German - English|
    Hebrew|Hindi|Hindi - English|Indonesian - English|Italian|
    Italian - English|Korean|Korean - English|New Oxford American Dictionary|
    Norwegian|Oxford American Writer's Thesaurus|
    Oxford Dictionary of English|Oxford Thesaurus of English|
    Polish - English|Portuguese|Portuguese - English|Russian|
    Russian - English|Sanseido Super Daijirin|
    Sanseido The WISDOM English-Japanese Japanese-English Dictionary|
    Simplified Chinese - English|Simplified Chinese - Japanese|Spanish|
    Spanish - English|Swedish|Thai|Thai - English|
    The Standard Dictionary of Contemporary Chinese|Traditional Chinese|
    Traditional Chinese - English|Turkish|Vietnamese - English]
                                The dictionary to extract or 'all'.
                                (Default: all) [Accepts multiple]
--format-xml / --no-format-xml  Format the XML files using BeautifulSoup.
                                (Default: False)
--debug                         Output debug information to STDERR.
                                (Default: False)
--help                          Show this message and exit.

Introduction

I need a ton of dictionary data for prototyping my learning a language tool, Parsnip, and licensing 40 dictionaries seems too expensive for a bootstrapper prototyping / working on an MVP (I look forward to the day this is no longer true). [Note: I am not planning to redistribute or otherwise use the data in an unlicensed manner.]

Parsnip uses Natural Language Processing and Dictionaries to decouple the word <-> sentence tug-of-war that's existed as long as flashcards have been used for language learning. I.e., should I make a word (concept) or a sentence (example) flashcard?

I care about what words I know for tracking purposes, but I want those words in context when I'm practicing. So the learning system breaks down sentences into lemmas (or dictionary form of a word) and a database of example sentences that the words appear in. This resolves the conceptual tug-of-war for flashcards.

But by removing reference data from the flashcards themselves, I need to integrate reference material directly into Parsnip's UI. JMDict is a great open source project for this, but that only covers a single language. So, I've been keeping my eyes open for people working on extracting the data from Apple's bundled dictionaries.

This has been a community effort that's spanned several years. My contribution is to collect the results, clear up some details about the file format, and package it into a general command-line tool.

References

This is inspired by Reverse-Engineering Apple Dictionary. And the discussion on Hacker News Hacker News: Reverse-Engineering Apple Dictionary (2020). Special thanks to tim-- and enragedcacti who introduced me to binwalk. And dunham who mentioned the random bytes looking like ints of payload sizes.

Additionally, I've found these posts informative:

extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTsite python Package extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install Get

7 Nov 21, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

CLI tool and python library that converts the output of popular command-line tools and file-types to JSON or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

jc JSONifies the output of many CLI tools and file-types for easier parsing in scripts

5.8k Jan 3, 2023

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

163 Dec 31, 2022

Generates password lists/dictionaries based on keywords written in python3.

dicbyru Introduction Generates password lists/dictionaries based on keywords. It uses the keywords and adds capital letters, numbers and special chara

2 Oct 31, 2022

This utility synchronises spelling dictionaries from various tools with each other.

This utility synchronises spelling dictionaries from various tools with each other. This way the words that have been trained on MS Office are also correctly checked in vim or Firefox. And vice versa of course.

2 Feb 11, 2022

Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

514 Dec 31, 2022

Converts XML to Python objects

untangle Documentation Converts XML to a Python object. Siblings with similar names are grouped into a list. Children can be accessed with parent.chil

567 Nov 30, 2022

Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": print(json.dumps(xmltod

5k Jan 4, 2023

Create Open XML PowerPoint documents in Python

python-pptx is a Python library for creating and updating PowerPoint (.pptx) files. A typical use would be generating a customized PowerPoint presenta

1.7k Jan 5, 2023

The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

2.3k Jan 2, 2023

PubMed Mapper: A Python library that map PubMed XML to Python object

pubmed-mapper: A Python Library that map PubMed XML to Python object 中文文档 1. Philosophy view UML Programmatically access PubMed article is a common ta

33 Dec 8, 2022

Simple app for visual editing of Page XML files

Name nw-page-editor - Simple app for visual editing of Page XML files. Version: 2021.02.22 Description nw-page-editor is an application for viewing/ed

27 Jun 20, 2022

PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

46 Nov 14, 2022

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

15 May 17, 2022

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]

PINTO_model_zoo Please read the contents of the LICENSE file located directly under each folder before using the model. My model conversion scripts ar

2.4k Jan 5, 2023

Extract XML from the OS X dictionaries.

Related tags

Overview

Before You Start

Installation

Dependencies

Usage

Introduction

References

You might also like...

extract gene TSS/TES site form gencode/ensembl/gencode database GTF file and export bed format file.

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

CLI tool and python library that converts the output of popular command-line tools and file-types to JSON or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

Generates password lists/dictionaries based on keywords written in python3.

This utility synchronises spelling dictionaries from various tools with each other.

Safely add untrusted strings to HTML/XML markup.

Converts XML to Python objects

Python module that makes working with XML feel like you are working with JSON

Create Open XML PowerPoint documents in Python

The lxml XML toolkit for Python

PubMed Mapper: A Python library that map PubMed XML to Python object

Simple app for visual editing of Page XML files

PAGE XML format collection for document image page content and more

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

Releases(v0.1.1)

v0.1.1(Sep 24, 2021)

Owner

Joshua Olson

Script to generate a massive volume of data in sql, csv, json or xml format

Extract the download URL from OneDrive or SharePoint share link and push it to aria2

Utility to extract Fantasy Grounds Unity Line-of-sight and lighting files from a Univeral VTT file exported from Dungeondraft

Handy Tool to check the availability of onion site and to extract the title of submitted onion links.

About Library for extract infomation from thai personal identity card.

Python based tool to extract forensic info from EventTranscript.db (Windows Diagnostic Data)

A python module for extract domains

A simple tool to extract python code from a Jupyter notebook, and then run pylint on it for static analysis.

A hashtag from string extract python module

Program to extract signatures from documents.