Essential Document Generator

Shane C Mason

Last update: Nov 11, 2022

Related tags

Documentation essential-generators

Overview

Essential Document Generator

Dead Simple Document Generation

Whether it's testing database performance or a new web interface, we've all needed a dead simple solution that'a flexible enough to generate a complex data set. If this is one of those times, you've come to the right place. Essential Generators uses Markov chains to generate 'realistic' data - and you can train them on your own data to make it even more real.

Install

Using pip:

pip install essential_generators

Use case: Get some random values

Simple interface:

>>> from essential_generators import DocumentGenerator

>>> gen = DocumentGenerator()

>>> gen.email()
'[email protected]'

>>> gen.url()
''https://ver.co.uk/has/pron/sing/th.ablica-attrob79'

>>> gen.phone()
'547-922-3848'

>>> gen.slug()
'ehillote-henaiour-ebemaice-qsiat76-heheellti'

>>> gen.word()
'choleg'

>>> gen.sentence()
'Possess something historic and prehistoric sites within the family.'

>>> gen.paragraph()
"Country's total that roll clouds of gas that can affect. Officers: lieutenant aquifer system under
the alaska supreme court, 14. About reality. perfect. this means that logic programs combine declarative
and procedural law. some. 20.3% of other nations. during the meiji constitution, and assembled the imperial
estates. Reduce visibility work during the regime withdrew from the crow in. Divert recyclable at 100.
Applications. because no carbon, then all of which glucose (c6h12o6) and stearin (c57h110o6) are convenient.
In french. forms can each be divided into: information theory. Therapeutic orientation. around haines. steven
seagal's 1994 on deadly ground, starring michael caine.. Lakes and economic assistance (comecon). the states
and 72 dependent. D.f.: comisión campaign tracking, allowing the companies running these. Were struggling moon
io is volcanically active, and as the legal basis of chemical complexes.'

Use case: Make lots of complex documents

Let's say we are building a database for a new social media site. We have a preliminary schema and want to test the server with some examples like this:

{
    id: 39f96ef8-08e0-408e-b727-984372a95d9d,
    status: online,
    age: 27,
    homepage: johndoe.github.io,
    name: John Doe,
    headline: A Really Cool Guy
    about: Some longer profile text. Several Sentences.
}

Document Templates

Now let's say we want to generate hundreds of thousands of these records. For making documents, we first need to define the template:

gen = DocumentGenerator()

template = {
     'id': 'guid',
     'status': ['online', 'offline', 'dnd', 'anonymous'],
     'age': 'small_int',
     'homepage': 'url',
     'name': 'name',
     'headline': 'sentence',
     'about': 'paragraph'
 }

 gen.set_template(template)
 documents = gen.documents(1000)

The template gives the structure and type for each field in the document. Note that status has a list and not a single type; when a list is provided as the type, one of the items in the list will be randomly selected for each generated documents using random.choice(list)

Custom Fields

Now we want to implement a new feature where users can rate each other between 1-5 stars and we want to keep track of the average rating (a float between 1 and 5). We can do this by passing in a function as the type, like so:

def gen_rating():
    return random.uniform(1, 5)

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'rating': gen_rating,
}

In this case, when each document is created, gen_rating is called and the returned value is added to the document.

Nested Documents

Now that users are rating each other, of course they'll want to get in contact with each other. The schema gets extended to include a nested contact object. Just like any custom field, we can generate nested documents using generator functions as the type:

def gen_contact():
    return {
        'email': gen.email(),
        'phone': gen.phone()
    }

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'contact': gen_contact
}

Word & Sentence Caching

Creating word and sentence cache's serves two purposes: it resticts the possible space of generated elements to a discreet size (for instance, the average American's vocabulary is between 5k and 10k words) and it greatly speeds subsequent document generation. Use them like this:

gen.init_word_cache(5000)
gen.init_sentence_cache(5000)

In the first line, 5000 words are generated. In the second line, 5000 sentences made up of 5 to 15 words from the word cache will be generated. subsequent call to gen.word() and gen.sentence() will be selected from the caches. If you want to generate a new to a word or sentence not in the cache, call gen.gen_word() and gen.gen_sentence() respectively. If you want finer grain control, gen.word_cache and gen.sentence_cache are arrays of strings that can be directly manipulated.

Unique Fields

In this case, we want to gaurantee that the fields are unique. You can accomplish this by choosing 'guid' as the field types, but that isn't good enough if you want the field to still look like an email address or a number. For this case, we introduce the unique field:

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'primary_email': {'typemap': 'email', 'unique': True, 'tries': 10}
}

In the primary_email field above, we passed a dictionary with the following pairs:

typemap - what field type to generate (in this case 'email')
unique - tells the generator that each value should be unique
tries - the number of times that gen.email() will be called to try and get a unique entry. If a unique item can not
generated in _tries_ iterations, the same number of iterations will be tried by generating a value and then adding
1-5 random chars appended. If a unique value still isn't generated, then GUIDs are generated until a unique one is
found.

The generator does its honest best to try and honor the type sent, but it prioritizes uniqueness. The default number of tries is 10, so from our example above:

10 attempts with 'generator.email()'
10 attempts with 'generator.email() + generator.gen_chars()'
infinite attempts with generator.guid()

Finer Grained Control

Now we want the user to be able to set a link to their current favorite post. You could do this by adding a field called 'favpost' and settings its type to 'slug' (like the ones used to url-encode blog post ids while keeping them human readable). The problem is, this would likely generate a unique favpost for each document, but in the real world there would be a finite set of posts.

You can control this behaviour by using python lists as the type. In this example, we use a list comprehension to generate a list of 1000 slugs that will be randomly seletected from when the documents are generated:

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'favpost': [gen.slug() for n in range(1000)]
}

So, what did we end up with?

This is one result:

{
    'name': 'Ster Ev',
    'age': 87,
    'status': 'anonymous',
    'favpost': 'anre-regtehcie57',
    'headline': 'ilrendna anr mo inttuonth anuir',
    'homepage': 'http://enar692.com/ten/erst/eresnn.heotiatin-neworwnti54-atnd',
    'id': 'ced10e96-b02c-4292-9be8-22dd8772c64e',
    'rating': 1.9779484996288086,
    'contact': {
                   'email': '[email protected]',
                   'phone': '695-323-8276'
               }
    'about': 'Yeormftd or an on authar hei po heheat este ler hearain hethe
    hetiarte ti oren. Oncs yemf edhe inhe th bain thfin nanfee st. Thheannd
    chenes hein thin. Edrdth ttind te uearedor heoea hehaeren seonstth tith
    vemoal an rein gel don in. Anao is fecttrr.',

}

Documents are basic Python dictionaries, so you can use the directly in your program or convert them to json or any other serialization format for testing anywhere.

Word and Text Generation

Essential generators come with 3 builtin word and text generators:

MarkovTextGenerator This approach uses a Markov chain to generate text. In this case, the generator is trained on text to generate somewhat realistic random text from real words.

MarkovWordGenerator This approach uses a Markov chain to generate words. In this case, the generator is trained on text to generate somewhat realistic random words based on observed words.

StatisticTextGenerator This approach uses statistical distributions to generate words that are similar to real words.

MarkovTextGenerator

MarkovTextGenerator generates random text from real words using word level bigram frequency. This is the default for generating sentences and paragraphs.

Example Word:

fifteen

Example Text:

reports the its citizens holding a tertiary education degree. Although Japan has 19 World Heritage List, fifteen of which
track the same species, several intermediate stages occur between sea and to a professional social network analysis,
network science, sociology, ethnography, statistics, optimization, and mathematics. The Vega Science Trust – science
videos, including physics Video: Physics "Lightning" Tour with Justin Morgan 52-part video course...

MarkovWordGenenerator

MarkovWordGenenerator generates random words from real letters using letter level bigram frequency. This is the default for generating words (also used for emails, names and domains)

Example Word:

groboo

Example Text:

Remes way by entrun co. Forche 40-194 quilim The lace colost thigag toures loples opprou Alpite go. of andian It Afte
imps stions revain Goto Stedes remapp go coutle Sountl doingu ablech thed al in whiclu thican Ocepro In havelo var clowne
the of couthe...

StatisticWordGenerator

StatisticWordGenerator generates random words from statistical distributions observed in a large corpus.

Example Word:

anamer

Example Text:

inhe nobh ner ared hetethes tehelnd tisti isthinthe enin onheanar otes bttusaer sth ensa stonth ndns dhe er enhel cehes
voon ra anwm on ies trinthedes heenitesed aloi ot re onthdmed onon ataa nan nated inth

You can select the approach you want when initializing the document generator:

#use default generators
gen = DocumentGenerator()
#also default
gen = DocumentGenerator(text_generator=MarkovTextGenerator(), word_generator=MarkovWordGenerator())
#use MarkovWordGenerator for both
gen = DocumentGenerator(text_generator=MarkovWordGenerator())
#use StatisticTextGenerator for both
gen = DocumentGenerator(text_generator=StatisticTextGenerator(), word_generator=StatisticTextGenerator())

Creating New Models

Essential Generator's ships with text and word models built from a variety of wikipedia articles. There are three scripts included to help you generate new models:

build_corpus.py - Retrieves specified articles from wikipedia to use when training the models. Default output is 'corpus.txt'. build_text_model.py - Uses corpus.txt to output markov_textgen.json as the text model for sentences and paragraphs. build_word_model.py - Uses corpus.txt to output markov_wordgen.json as the word model (for words, email, domains etc)

Disclaimer

The purpose of this module is to quickly generate data for use cases like load testing and performance evaluations. It attempts to mimic real data, but will not have the frequency or statistical qualities of real world data. There are no warranties and this shouldn't be used for scientific, health or industrial purposes and so on...

Why did I build this?

There are several great python module out there that generate fake data, so why did I make this? Two reasons really:

1. I wanted a dead simple way to generate data to test other projects and I just wasn't finding the flexibility I was looking for. 2. One of my problems with the existing approaches was the limited number of 'lorem ipsum' style words that were available to generate text. I wanted to build a better lorem ipsum generator and this made a nice platform.

🍭 epub generator for lightnovel.us 轻之国度 epub 生成器

lightnovel_epub 本工具用于基于轻之国度网页生成epub小说。注意：本工具仅作学习交流使用，作者不对内容和使用情况付任何责任！原理直接抓取 HTML，然后将其中的图片下载至本地，随后打包成 EPUB。

188 Dec 30, 2022

Swagger Documentation Generator for Django REST Framework: deprecated

Django REST Swagger: deprecated (2019-06-04) This project is no longer being maintained. Please consider drf-yasg as an alternative/successor. I haven

2.6k Jan 3, 2023

Markdown documentation generator from Google docstrings

mkgendocs A Python package for automatically generating documentation pages in markdown for Python source files by parsing Google style docstring. The

44 Dec 18, 2022

An essential implementation of BYOL in PyTorch + PyTorch Lightning

Essential BYOL A simple and complete implementation of Bootstrap your own latent: A new approach to self-supervised Learning in PyTorch + PyTorch Ligh

48 Sep 27, 2022

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Data Structures and Algorithms Python INDEX 1. Resources - Books Data Structures - Reema Thareja competitiveCoding Big-O Cheat Sheet DAA Syllabus Inte

129 Dec 15, 2022

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

A 2D Visual Localization Framework based on Essential Matrices This repository provides implementation of our paper accepted at ICRA: To Learn or Not

27 Nov 7, 2022

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

8 Jun 7, 2022

A simple but useful Discord Selfbot with essential features, made with discord.py-self.

Discord Selfbot Xyno Discord Selfbot Xyno is a simple but useful selfbot for Discord. It has currently limited useful features but it will be updated

7 Apr 24, 2022

A code base for python programs the goal is to integrate all the useful and essential functions

Base Dev EN This GitHub will be available in French and English FR Ce GitHub sera disponible en français et en anglais Author License Screen EN 🇬🇧 D

1 Mar 7, 2022

Extracts essential Mediapipe face landmarks and arranges them in a sequenced order.

simplified_mediapipe_face_landmarks Extracts essential Mediapipe face landmarks and arranges them in a sequenced order. The default 478 Mediapipe face

13 Oct 4, 2022

I³ Tracker for Essential Open Innovation Datasets

I³ Tracker for Essential Open Innovation Datasets This repository is set up to track, version, and contribute updates to the I³ Essential Open Innovat

1 Feb 8, 2022

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

API spec validator and OpenAPI document generator for Python web frameworks.

249 Dec 22, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

A lightweight and fast-to-use Markdown document generator based on Python

1 Jan 10, 2022

Word document generator with python

In this study, real world data is anonymized. The content is completely different, but the structure is the same. It was a script I prepared for the backend of a work using UiPath.

3 Jan 30, 2022

Prime Path Generator is a prime path generator used to generate prime paths.

1 Nov 6, 2021

ANKIT-OS/TG-SESSION-GENERATOR-BOTbisTG-SESSION-GENERATOR-BOT a special repository. Its Is A Telegram Bot To Generate String Session

1 Dec 26, 2021

QR-Generator - An awesome QR Generator to create or customize your QR's

QR Generator An awesome QR Generator to create or customize your QR's! Table of

1 Jan 28, 2022

Comments

Feature Request: set a random seed, get deterministic output (kinda works already)

Hi, I'd like to use your (excellent) library to generate test datasets in a deterministic way.

I'm puzzled by my findings: It appears to be deterministic for short strings, but after a certain length, the random seed gets lost. I wrote a small script to show you what I mean:

import random
from hashlib import md5
from essential_generators import DocumentGenerator

g = DocumentGenerator()

def compare(depth, seed=None):

    print(f"{depth} characters deep:")
    left = []
    right = []

    def go():
        payload = []

        # reset the random seed if specified
        if seed:
            print(f"using seed {seed}:")
            random.seed(a=seed)

        # generage five paragraphs
        # but only analyize their leading substrings
        for _ in range(5):
            m = md5()
            m.update(g.paragraph()[0:depth].encode())
            payload.append(m.hexdigest())
        return payload

    left = go()
    right = go()

    for l, r in zip(left, right):
        print(l, r)

# no expectation of equivalence
print('no seed init:')
compare(15)

# expect equivalence and find it always
print('\nwith seed init:')
compare(15, seed=1)

# expect equivalence and find it sometimes
print('\nwith seed init:')
compare(35, seed=1)

# expect equivalence and find it never
print('\nwith seed init:')
compare(100, seed=1)

Here is the output:

no seed init:
15 characters deep:
99cd370a256a08c0935df07588a9d149 5712ca694a784b7de849030127b5f8bf
51f9589d4c3eeaac7beaffab1bd4aabe 302ef08d6af8864527c166c50b41e9b6
c8f6a33ab25cc4d36c1929325d10bd1e 1b8736fc0574498b0485242c2037c433
36af5f84a2564a557bfab9bfb43d0aee c53d1f3bb0684b2fc648a28daf954e56
ef2386e0decfcf4275fa11497d72a934 3744e106265e39d25d9b4b9b1860674f

with seed init:
15 characters deep:
using seed 1:
using seed 1:
c91ab6b5cf81131dabe56fb1128d2819 c91ab6b5cf81131dabe56fb1128d2819
52503d52c197f3e09698885a295f6944 52503d52c197f3e09698885a295f6944
57befb45c97732d0847ad512d75a4310 57befb45c97732d0847ad512d75a4310
adeae47e78fdba591212906ca5b13eb2 adeae47e78fdba591212906ca5b13eb2
19df57795b6f5020885bf04f8a677d9d 19df57795b6f5020885bf04f8a677d9d

with seed init:
35 characters deep:
using seed 1:
using seed 1:
b4e90b4567af47ca36b26093d7ee0d45 b4e90b4567af47ca36b26093d7ee0d45
1dc7d702222970c395ee7124326938b9 1dc7d702222970c395ee7124326938b9
ecc3955fc825be75aaa4100dca837f46 03ff6e24ecf360ef080cee5ee83439e6 <--- huh?
f98d0f54fc27985d055bc959d395955c f98d0f54fc27985d055bc959d395955c
452d9069d8e02fcf1b85fac3afa730ee bbaf39bccf8db61ca515b7821a5ea34e <--- huh?

with seed init:
100 characters deep:
using seed 1:
using seed 1:
5b642f729efb10cefa35dcb3241d8f9e b50950a765dfd6fbc73a07627924129e
e0e63f320af5c37d018e048406a8c653 76580ed2a943c5e92ee4ed9bcbf738c0
14e2456f05ff42248d5a39239771bc3b 8d35636e43182866dce4ce9e761b7f64
186e426a961868df3b7942f3d1fca404 5ffa91b250be0e69a35ac44995a02f5f
f0088047f353cd394a0667fa5dfac253 0a1b79ed08951dad813aa7ce1b399a49

The first batch has no matches, which is expected because I didn't seed the RNG.
The second batch has all matches because I did seed the RNG, but I only consumed the first 15 characters.
Three of the third batch matched. Presumably 35 characters is enough for the "problem" to happen sometimes.
The last batch has no matches, Presumably 100 characters is long enough for the "problem" to happen always.

So I guess this is a feature request: Can you add a way to supply a randomness seed so that it can be made to produce the same pseudorandom output every time?

If you don't feel like it, can you give me a hint about where the seed is being forgotten? Then I'll take a crack at it in a fork.

Thank you.

opened by MatrixManAtYrService 3

SUGGESTION: slug generation -> don't allow apostrophes, commas, etc.
Just a suggestion. I am using this package for generating dummy Django text. Because this package allows slugs like:

bas'-The-jorthe-of-val-lected 2005)-sethe-cat-mand-two8-onts Egy.652369-pealti18-sq in.-beis-ch

Django throws errors. Example error:

django.urls.exceptions.NoReverseMatch: Reverse for 'service_detail' with keyword arguments '{'username': 'admin', 'topic1': 'general', 'pk': 9, 'slug': "bas'-The-jorthe-of-val-lected"}' not found.

My knit-pick suggestion is to add argument to the slug generator to exclude punctuation and make it all lowercase. Simple for me to do but from a package-perspective, this could benefit your package and help more than just me. I don't know why anyone would want upper-case and punctuation in slugs but I'd argue that is not the most-common use-case.
opened by jaradc 0
current corpus

Thanks for sharing this.

Can I ask what is the source in use corpus in the build package. I see it's some Wikipedia like text there. Can you confirm these are just random Wikipedia?

opened by sadransh 0

Essential Document Generator

Related tags

Overview

Essential Document Generator

Dead Simple Document Generation

Install

Use case: Get some random values

Use case: Make lots of complex documents

Document Templates

Custom Fields

Nested Documents

Word & Sentence Caching

Unique Fields

Finer Grained Control

So, what did we end up with?

Word and Text Generation

MarkovTextGenerator

MarkovWordGenenerator

StatisticWordGenerator

Creating New Models

Disclaimer

Why did I build this?

You might also like...

🍭 epub generator for lightnovel.us 轻之国度 epub 生成器

Swagger Documentation Generator for Django REST Framework: deprecated

Markdown documentation generator from Google docstrings

An essential implementation of BYOL in PyTorch + PyTorch Lightning

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

A simple but useful Discord Selfbot with essential features, made with discord.py-self.

A code base for python programs the goal is to integrate all the useful and essential functions

Extracts essential Mediapipe face landmarks and arranges them in a sequenced order.

I³ Tracker for Essential Open Innovation Datasets

API spec validator and OpenAPI document generator for Python web frameworks.

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

A lightweight and fast-to-use Markdown document generator based on Python

Word document generator with python

Prime Path Generator is a prime path generator used to generate prime paths.

ANKIT-OS/TG-SESSION-GENERATOR-BOTbisTG-SESSION-GENERATOR-BOT a special repository. Its Is A Telegram Bot To Generate String Session

QR-Generator - An awesome QR Generator to create or customize your QR's

Comments

Feature Request: set a random seed, get deterministic output (kinda works already)

SUGGESTION: slug generation -> don't allow apostrophes, commas, etc.

current corpus

Owner

Shane C Mason

xeuledoc - Fetch information about a public Google document.

Layout Parser is a deep learning based tool for document image layout analysis tasks.

Mayan EDMS is a document management system.

A document format conversion service based on Pandoc.

A simple document management REST based API for collaboratively interacting with documents

Searches a document for hash tags. Support multiple natural languages. Works in various contexts.

A curated list of awesome tools for Sphinx Python Documentation Generator

Literate-style documentation generator.

Documentation generator for C++ based on Doxygen and mosra/m.css.

Dynamic Resume Generator