Essential Document Generator

Overview

Essential Document Generator

Dead Simple Document Generation

Whether it's testing database performance or a new web interface, we've all needed a dead simple solution that'a flexible enough to generate a complex data set. If this is one of those times, you've come to the right place. Essential Generators uses Markov chains to generate 'realistic' data - and you can train them on your own data to make it even more real.

Install

Using pip:

pip install essential_generators

Use case: Get some random values

Simple interface:

>>> from essential_generators import DocumentGenerator

>>> gen = DocumentGenerator()

>>> gen.email()
'[email protected]'

>>> gen.url()
''https://ver.co.uk/has/pron/sing/th.ablica-attrob79'

>>> gen.phone()
'547-922-3848'

>>> gen.slug()
'ehillote-henaiour-ebemaice-qsiat76-heheellti'

>>> gen.word()
'choleg'

>>> gen.sentence()
'Possess something historic and prehistoric sites within the family.'

>>> gen.paragraph()
"Country's total that roll clouds of gas that can affect. Officers: lieutenant aquifer system under
the alaska supreme court, 14. About reality. perfect. this means that logic programs combine declarative
and procedural law. some. 20.3% of other nations. during the meiji constitution, and assembled the imperial
estates. Reduce visibility work during the regime withdrew from the crow in. Divert recyclable at 100.
Applications. because no carbon, then all of which glucose (c6h12o6) and stearin (c57h110o6) are convenient.
In french. forms can each be divided into: information theory. Therapeutic orientation. around haines. steven
seagal's 1994 on deadly ground, starring michael caine.. Lakes and economic assistance (comecon). the states
and 72 dependent. D.f.: comisión campaign tracking, allowing the companies running these. Were struggling moon
io is volcanically active, and as the legal basis of chemical complexes.'

Use case: Make lots of complex documents

Let's say we are building a database for a new social media site. We have a preliminary schema and want to test the server with some examples like this:

{
    id: 39f96ef8-08e0-408e-b727-984372a95d9d,
    status: online,
    age: 27,
    homepage: johndoe.github.io,
    name: John Doe,
    headline: A Really Cool Guy
    about: Some longer profile text. Several Sentences.
}

Document Templates

Now let's say we want to generate hundreds of thousands of these records. For making documents, we first need to define the template:

gen = DocumentGenerator()

template = {
     'id': 'guid',
     'status': ['online', 'offline', 'dnd', 'anonymous'],
     'age': 'small_int',
     'homepage': 'url',
     'name': 'name',
     'headline': 'sentence',
     'about': 'paragraph'
 }

 gen.set_template(template)
 documents = gen.documents(1000)

The template gives the structure and type for each field in the document. Note that status has a list and not a single type; when a list is provided as the type, one of the items in the list will be randomly selected for each generated documents using random.choice(list)

Custom Fields

Now we want to implement a new feature where users can rate each other between 1-5 stars and we want to keep track of the average rating (a float between 1 and 5). We can do this by passing in a function as the type, like so:

def gen_rating():
    return random.uniform(1, 5)

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'rating': gen_rating,
}

In this case, when each document is created, gen_rating is called and the returned value is added to the document.

Nested Documents

Now that users are rating each other, of course they'll want to get in contact with each other. The schema gets extended to include a nested contact object. Just like any custom field, we can generate nested documents using generator functions as the type:

def gen_contact():
    return {
        'email': gen.email(),
        'phone': gen.phone()
    }

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'contact': gen_contact
}

Word & Sentence Caching

Creating word and sentence cache's serves two purposes: it resticts the possible space of generated elements to a discreet size (for instance, the average American's vocabulary is between 5k and 10k words) and it greatly speeds subsequent document generation. Use them like this:

gen.init_word_cache(5000)
gen.init_sentence_cache(5000)

In the first line, 5000 words are generated. In the second line, 5000 sentences made up of 5 to 15 words from the word cache will be generated. subsequent call to gen.word() and gen.sentence() will be selected from the caches. If you want to generate a new to a word or sentence not in the cache, call gen.gen_word() and gen.gen_sentence() respectively. If you want finer grain control, gen.word_cache and gen.sentence_cache are arrays of strings that can be directly manipulated.

Unique Fields

In this case, we want to gaurantee that the fields are unique. You can accomplish this by choosing 'guid' as the field types, but that isn't good enough if you want the field to still look like an email address or a number. For this case, we introduce the unique field:

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'primary_email': {'typemap': 'email', 'unique': True, 'tries': 10}
}

In the primary_email field above, we passed a dictionary with the following pairs:

typemap - what field type to generate (in this case 'email')
unique - tells the generator that each value should be unique
tries - the number of times that gen.email() will be called to try and get a unique entry. If a unique item can not
generated in _tries_ iterations, the same number of iterations will be tried by generating a value and then adding
1-5 random chars appended. If a unique value still isn't generated, then GUIDs are generated until a unique one is
found.

The generator does its honest best to try and honor the type sent, but it prioritizes uniqueness. The default number of tries is 10, so from our example above:

10 attempts with 'generator.email()'
10 attempts with 'generator.email() + generator.gen_chars()'
infinite attempts with generator.guid()

Finer Grained Control

Now we want the user to be able to set a link to their current favorite post. You could do this by adding a field called 'favpost' and settings its type to 'slug' (like the ones used to url-encode blog post ids while keeping them human readable). The problem is, this would likely generate a unique favpost for each document, but in the real world there would be a finite set of posts.

You can control this behaviour by using python lists as the type. In this example, we use a list comprehension to generate a list of 1000 slugs that will be randomly seletected from when the documents are generated:

template = {
    'id': 'guid',
    'status': ['online', 'offline', 'dnd', 'anonymous'],
    'age': 'small_int',
    'homepage': 'url',
    'name': 'name',
    'headline': 'sentence',
    'about': 'paragraph',
    'favpost': [gen.slug() for n in range(1000)]
}

So, what did we end up with?

This is one result:

{
    'name': 'Ster Ev',
    'age': 87,
    'status': 'anonymous',
    'favpost': 'anre-regtehcie57',
    'headline': 'ilrendna anr mo inttuonth anuir',
    'homepage': 'http://enar692.com/ten/erst/eresnn.heotiatin-neworwnti54-atnd',
    'id': 'ced10e96-b02c-4292-9be8-22dd8772c64e',
    'rating': 1.9779484996288086,
    'contact': {
                   'email': '[email protected]',
                   'phone': '695-323-8276'
               }
    'about': 'Yeormftd or an on authar hei po heheat este ler hearain hethe
    hetiarte ti oren. Oncs yemf edhe inhe th bain thfin nanfee st. Thheannd
    chenes hein thin. Edrdth ttind te uearedor heoea hehaeren seonstth tith
    vemoal an rein gel don in. Anao is fecttrr.',

}

Documents are basic Python dictionaries, so you can use the directly in your program or convert them to json or any other serialization format for testing anywhere.

Word and Text Generation

Essential generators come with 3 builtin word and text generators:

MarkovTextGenerator This approach uses a Markov chain to generate text. In this case, the generator is trained on text to generate somewhat realistic random text from real words.

MarkovWordGenerator This approach uses a Markov chain to generate words. In this case, the generator is trained on text to generate somewhat realistic random words based on observed words.

StatisticTextGenerator This approach uses statistical distributions to generate words that are similar to real words.

MarkovTextGenerator

MarkovTextGenerator generates random text from real words using word level bigram frequency. This is the default for generating sentences and paragraphs.

Example Word:

fifteen

Example Text:

reports the its citizens holding a tertiary education degree. Although Japan has 19 World Heritage List, fifteen of which
track the same species, several intermediate stages occur between sea and to a professional social network analysis,
network science, sociology, ethnography, statistics, optimization, and mathematics. The Vega Science Trust – science
videos, including physics Video: Physics "Lightning" Tour with Justin Morgan 52-part video course...

MarkovWordGenenerator

MarkovWordGenenerator generates random words from real letters using letter level bigram frequency. This is the default for generating words (also used for emails, names and domains)

Example Word:

groboo

Example Text:

Remes way by entrun co. Forche 40-194 quilim The lace colost thigag toures loples opprou Alpite go. of andian It Afte
imps stions revain Goto Stedes remapp go coutle Sountl doingu ablech thed al in whiclu thican Ocepro In havelo var clowne
the of couthe...

StatisticWordGenerator

StatisticWordGenerator generates random words from statistical distributions observed in a large corpus.

Example Word:

anamer

Example Text:

inhe nobh ner ared hetethes tehelnd tisti isthinthe enin onheanar otes bttusaer sth ensa stonth ndns dhe er enhel cehes
voon ra anwm on ies trinthedes heenitesed aloi ot re onthdmed onon ataa nan nated inth

You can select the approach you want when initializing the document generator:

#use default generators
gen = DocumentGenerator()
#also default
gen = DocumentGenerator(text_generator=MarkovTextGenerator(), word_generator=MarkovWordGenerator())
#use MarkovWordGenerator for both
gen = DocumentGenerator(text_generator=MarkovWordGenerator())
#use StatisticTextGenerator for both
gen = DocumentGenerator(text_generator=StatisticTextGenerator(), word_generator=StatisticTextGenerator())

Creating New Models

Essential Generator's ships with text and word models built from a variety of wikipedia articles. There are three scripts included to help you generate new models:

build_corpus.py - Retrieves specified articles from wikipedia to use when training the models. Default output is 'corpus.txt'. build_text_model.py - Uses corpus.txt to output markov_textgen.json as the text model for sentences and paragraphs. build_word_model.py - Uses corpus.txt to output markov_wordgen.json as the word model (for words, email, domains etc)

Disclaimer

The purpose of this module is to quickly generate data for use cases like load testing and performance evaluations. It attempts to mimic real data, but will not have the frequency or statistical qualities of real world data. There are no warranties and this shouldn't be used for scientific, health or industrial purposes and so on...

Why did I build this?

There are several great python module out there that generate fake data, so why did I make this? Two reasons really:

1. I wanted a dead simple way to generate data to test other projects and I just wasn't finding the flexibility I was looking for. 2. One of my problems with the existing approaches was the limited number of 'lorem ipsum' style words that were available to generate text. I wanted to build a better lorem ipsum generator and this made a nice platform.

You might also like...
🍭 epub generator for lightnovel.us 轻之国度 epub 生成器

lightnovel_epub 本工具用于基于轻之国度网页生成epub小说。 注意:本工具仅作学习交流使用,作者不对内容和使用情况付任何责任! 原理 直接抓取 HTML,然后将其中的图片下载至本地,随后打包成 EPUB。

Swagger Documentation Generator for Django REST Framework: deprecated

Django REST Swagger: deprecated (2019-06-04) This project is no longer being maintained. Please consider drf-yasg as an alternative/successor. I haven

Markdown documentation generator from Google docstrings
Markdown documentation generator from Google docstrings

mkgendocs A Python package for automatically generating documentation pages in markdown for Python source files by parsing Google style docstring. The

An essential implementation of BYOL in PyTorch + PyTorch Lightning
An essential implementation of BYOL in PyTorch + PyTorch Lightning

Essential BYOL A simple and complete implementation of Bootstrap your own latent: A new approach to self-supervised Learning in PyTorch + PyTorch Ligh

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Data Structures and Algorithms Python INDEX 1. Resources - Books Data Structures - Reema Thareja competitiveCoding Big-O Cheat Sheet DAA Syllabus Inte

A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]
A 2D Visual Localization Framework based on Essential Matrices [ICRA2020]

A 2D Visual Localization Framework based on Essential Matrices This repository provides implementation of our paper accepted at ICRA: To Learn or Not

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

BasicVSR BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond Ported from https://github.com/xinntao/BasicSR Dependencie

A simple but useful Discord Selfbot with essential features, made with discord.py-self.
A simple but useful Discord Selfbot with essential features, made with discord.py-self.

Discord Selfbot Xyno Discord Selfbot Xyno is a simple but useful selfbot for Discord. It has currently limited useful features but it will be updated

A code base for python programs the goal is to integrate all the useful and essential functions
A code base for python programs the goal is to integrate all the useful and essential functions

Base Dev EN This GitHub will be available in French and English FR Ce GitHub sera disponible en français et en anglais Author License Screen EN 🇬🇧 D

Extracts essential Mediapipe face landmarks and arranges them in a sequenced order.
Extracts essential Mediapipe face landmarks and arranges them in a sequenced order.

simplified_mediapipe_face_landmarks Extracts essential Mediapipe face landmarks and arranges them in a sequenced order. The default 478 Mediapipe face

 I³ Tracker for Essential Open Innovation Datasets
I³ Tracker for Essential Open Innovation Datasets

I³ Tracker for Essential Open Innovation Datasets This repository is set up to track, version, and contribute updates to the I³ Essential Open Innovat

API spec validator and OpenAPI document generator for Python web frameworks.

API spec validator and OpenAPI document generator for Python web frameworks.

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

A lightweight and fast-to-use Markdown document generator based on Python

A lightweight and fast-to-use Markdown document generator based on Python

Word document generator with python
Word document generator with python

In this study, real world data is anonymized. The content is completely different, but the structure is the same. It was a script I prepared for the backend of a work using UiPath.

Prime Path Generator is a prime path generator used to generate prime paths.

Prime Path Generator is a prime path generator used to generate prime paths.

ANKIT-OS/TG-SESSION-GENERATOR-BOTbisTG-SESSION-GENERATOR-BOT a special repository. Its Is A Telegram Bot To Generate String Session
ANKIT-OS/TG-SESSION-GENERATOR-BOTbisTG-SESSION-GENERATOR-BOT a special repository. Its Is A Telegram Bot To Generate String Session

ANKIT-OS/TG-SESSION-GENERATOR-BOTbisTG-SESSION-GENERATOR-BOT a special repository. Its Is A Telegram Bot To Generate String Session

QR-Generator - An awesome QR Generator to create or customize your QR's
QR-Generator - An awesome QR Generator to create or customize your QR's

QR Generator An awesome QR Generator to create or customize your QR's! Table of

Comments
  • Feature Request: set a random seed, get deterministic output (kinda works already)

    Feature Request: set a random seed, get deterministic output (kinda works already)

    Hi, I'd like to use your (excellent) library to generate test datasets in a deterministic way.

    I'm puzzled by my findings: It appears to be deterministic for short strings, but after a certain length, the random seed gets lost. I wrote a small script to show you what I mean:

    import random
    from hashlib import md5
    from essential_generators import DocumentGenerator
    
    g = DocumentGenerator()
    
    def compare(depth, seed=None):
    
        print(f"{depth} characters deep:")
        left = []
        right = []
    
        def go():
            payload = []
    
            # reset the random seed if specified
            if seed:
                print(f"using seed {seed}:")
                random.seed(a=seed)
    
            # generage five paragraphs
            # but only analyize their leading substrings
            for _ in range(5):
                m = md5()
                m.update(g.paragraph()[0:depth].encode())
                payload.append(m.hexdigest())
            return payload
    
        left = go()
        right = go()
    
        for l, r in zip(left, right):
            print(l, r)
    
    # no expectation of equivalence
    print('no seed init:')
    compare(15)
    
    # expect equivalence and find it always
    print('\nwith seed init:')
    compare(15, seed=1)
    
    # expect equivalence and find it sometimes
    print('\nwith seed init:')
    compare(35, seed=1)
    
    # expect equivalence and find it never
    print('\nwith seed init:')
    compare(100, seed=1)
    

    Here is the output:

    no seed init:
    15 characters deep:
    99cd370a256a08c0935df07588a9d149 5712ca694a784b7de849030127b5f8bf
    51f9589d4c3eeaac7beaffab1bd4aabe 302ef08d6af8864527c166c50b41e9b6
    c8f6a33ab25cc4d36c1929325d10bd1e 1b8736fc0574498b0485242c2037c433
    36af5f84a2564a557bfab9bfb43d0aee c53d1f3bb0684b2fc648a28daf954e56
    ef2386e0decfcf4275fa11497d72a934 3744e106265e39d25d9b4b9b1860674f
    
    with seed init:
    15 characters deep:
    using seed 1:
    using seed 1:
    c91ab6b5cf81131dabe56fb1128d2819 c91ab6b5cf81131dabe56fb1128d2819
    52503d52c197f3e09698885a295f6944 52503d52c197f3e09698885a295f6944
    57befb45c97732d0847ad512d75a4310 57befb45c97732d0847ad512d75a4310
    adeae47e78fdba591212906ca5b13eb2 adeae47e78fdba591212906ca5b13eb2
    19df57795b6f5020885bf04f8a677d9d 19df57795b6f5020885bf04f8a677d9d
    
    with seed init:
    35 characters deep:
    using seed 1:
    using seed 1:
    b4e90b4567af47ca36b26093d7ee0d45 b4e90b4567af47ca36b26093d7ee0d45
    1dc7d702222970c395ee7124326938b9 1dc7d702222970c395ee7124326938b9
    ecc3955fc825be75aaa4100dca837f46 03ff6e24ecf360ef080cee5ee83439e6 <--- huh?
    f98d0f54fc27985d055bc959d395955c f98d0f54fc27985d055bc959d395955c
    452d9069d8e02fcf1b85fac3afa730ee bbaf39bccf8db61ca515b7821a5ea34e <--- huh?
    
    with seed init:
    100 characters deep:
    using seed 1:
    using seed 1:
    5b642f729efb10cefa35dcb3241d8f9e b50950a765dfd6fbc73a07627924129e
    e0e63f320af5c37d018e048406a8c653 76580ed2a943c5e92ee4ed9bcbf738c0
    14e2456f05ff42248d5a39239771bc3b 8d35636e43182866dce4ce9e761b7f64
    186e426a961868df3b7942f3d1fca404 5ffa91b250be0e69a35ac44995a02f5f
    f0088047f353cd394a0667fa5dfac253 0a1b79ed08951dad813aa7ce1b399a49
    
    • The first batch has no matches, which is expected because I didn't seed the RNG.
    • The second batch has all matches because I did seed the RNG, but I only consumed the first 15 characters.
    • Three of the third batch matched. Presumably 35 characters is enough for the "problem" to happen sometimes.
    • The last batch has no matches, Presumably 100 characters is long enough for the "problem" to happen always.

    So I guess this is a feature request: Can you add a way to supply a randomness seed so that it can be made to produce the same pseudorandom output every time?

    If you don't feel like it, can you give me a hint about where the seed is being forgotten? Then I'll take a crack at it in a fork.

    Thank you.

    opened by MatrixManAtYrService 3
  • SUGGESTION: slug generation -> don't allow apostrophes, commas, etc.

    SUGGESTION: slug generation -> don't allow apostrophes, commas, etc.

    Just a suggestion. I am using this package for generating dummy Django text. Because this package allows slugs like:

    bas'-The-jorthe-of-val-lected
    2005)-sethe-cat-mand-two8-onts
    Egy.652369-pealti18-sq
    in.-beis-ch
    

    Django throws errors. Example error:

    django.urls.exceptions.NoReverseMatch: Reverse for 'service_detail' with keyword arguments '{'username': 'admin', 'topic1': 'general', 'pk': 9, 'slug': "bas'-The-jorthe-of-val-lected"}' not found.
    

    My knit-pick suggestion is to add argument to the slug generator to exclude punctuation and make it all lowercase. Simple for me to do but from a package-perspective, this could benefit your package and help more than just me. I don't know why anyone would want upper-case and punctuation in slugs but I'd argue that is not the most-common use-case.

    opened by jaradc 0
  • current corpus

    current corpus

    Thanks for sharing this.

    Can I ask what is the source in use corpus in the build package. I see it's some Wikipedia like text there. Can you confirm these are just random Wikipedia?

    opened by sadransh 0
Owner
Shane C Mason
Shane C Mason
xeuledoc - Fetch information about a public Google document.

xeuledoc - Fetch information about a public Google document.

Malfrats Industries 651 Dec 27, 2022
layout-parser 3.4k Dec 30, 2022
Mayan EDMS is a document management system.

Mayan EDMS is a document management system. Its main purpose is to store, introspect, and categorize files, with a strong emphasis on preserving the contextual and business information of documents. It can also OCR, preview, label, sign, send, and receive thoses files.

null 3 Oct 2, 2021
A document format conversion service based on Pandoc.

reformed Document format conversion service based on Pandoc. Usage The API specification for the Reformed server is as follows: GET /api/v1/formats: L

David Lougheed 3 Jul 18, 2022
A simple document management REST based API for collaboratively interacting with documents

documan_api A simple document management REST based API for collaboratively interacting with documents.

Shahid Yousuf 1 Jan 22, 2022
Searches a document for hash tags. Support multiple natural languages. Works in various contexts.

ht-getter Searches a document for hash tags. Supports multiple natural languages. Works in various contexts. This package uses a non-regex approach an

Rairye 1 Mar 1, 2022
A curated list of awesome tools for Sphinx Python Documentation Generator

Awesome Sphinx (Python Documentation Generator) A curated list of awesome extra libraries, software and resources for Sphinx (Python Documentation Gen

Hyunjun Kim 831 Dec 27, 2022
Literate-style documentation generator.

888888b. 888 Y88b 888 888 888 d88P 888 888 .d8888b .d8888b .d88b. 8888888P" 888 888 d88P" d88P" d88""88b 888 888 888

Pycco 808 Dec 27, 2022
Documentation generator for C++ based on Doxygen and mosra/m.css.

mosra/m.css is a Doxygen-based documentation generator that significantly improves on Doxygen's default output by controlling some of Doxygen's more unruly options, supplying it's own slick HTML+CSS generation and adding a fantastic live search feature.

Mark Gillard 109 Dec 7, 2022
Dynamic Resume Generator

Dynamic Resume Generator

Quinten Lisowe 15 May 19, 2022