Quickly download, clean up, and install public datasets into a database management system

Overview

Retriever logo

Python package Build Status (windows) Research software impact codecov.io Documentation Status License Join the chat at https://gitter.im/weecology/retriever DOI JOSS Publication Anaconda-Server Badge Anaconda-Server Badge Version NumFOCUS

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

Installing the Current Release

If you have Python installed you can install the current release using either pip:

pip install retriever

or conda after adding the conda-forge channel (conda config --add channels conda-forge):

conda install retriever

Depending on your system configuration this may require sudo for pip:

sudo pip install retriever

Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.

List of Available Datasets

Installing From Source

To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:

  • xlrd

The following packages are optionally needed to interact with associated database management systems:

  • PyMySQL (for MySQL)
  • sqlite3 (for SQLite)
  • psycopg2-binary (for PostgreSQL), previously psycopg2.
  • pyodbc (for MS Access - this option is only available on Windows)
  • Microsoft Access Driver (ODBC for windows)

To install from source

Either use pip to install directly from GitHub:

pip install git+https://[email protected]/weecology/retriever.git

or:

  1. Clone the repository
  2. From the directory containing setup.py, run the following command: pip install .. You may need to include sudo at the beginning of the command depending on your system (i.e., sudo pip install .).

More extensive documentation for those that are interested in developing can be found here

Using the Command Line

After installing, run retriever update to download all of the available dataset scripts. To see the full list of command line options and datasets run retriever --help. The output will look like this:

usage: retriever [-h] [-v] [-q]
                 {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                 ...

positional arguments:
  {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                        sub-command help
    download            download raw data files for a dataset
    install             download and install dataset
    defaults            displays default options
    update              download updated versions of scripts
    new                 create a new sample retriever script
    new_json            CLI to create retriever datapackage.json script
    edit_json           CLI to edit retriever datapackage.json script
    delete_json         CLI to remove retriever datapackage.json script
    ls                  display a list all available dataset scripts
    citation            view citation
    reset               reset retriever: removes configuration settings,
                        scripts, and cached data
    help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           suppress command-line output

To install datasets, use retriever install:

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
  {mysql,postgres,sqlite,msaccess,csv,json,xml}
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Examples

These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.

Using Install

retriever install -h   (gives install options)

Using specific database engine, retriever install {Engine}

retriever install mysql -h     (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris

install data into an sqlite database named iris.db you would use:

retriever install sqlite iris -f iris.db

Using download

retriever download -h    (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents

Using citation

retriever citation   (citation of the retriever engine)
retriever citation iris  (citation for the iris data)

Spatial Dataset Installation

Set up Spatial support

To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.

retriever install postgres harvard-forest # Vector data
retriever install postgres bioclim # Raster data
# Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074

Website

For more information see the Data Retriever website.

Acknowledgments

Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

Comments
  • [WIP] Allow consuming JSON data

    [WIP] Allow consuming JSON data

    NOTE: I closed this PR by mistake. I'll re-open this. This pull request is for catering to this [issue](Allow consuming JSON data #1334). Currently, we support 2 kinds of json datasets:

    1. Where the dataset's rows are present in a certain key of json. For example refer to this example, here the certain_key is data.
    2. Where the dataset's rows are present in certain key of differnent parts of the json. For example refer to this example, here the certain key is laureates.
    3. [WIP] Where the json is in the form of list, example. Current implementation for this is commented out, since its getting stuck in a recursion loop.
    opened by DumbMachine 37
  • Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage #860

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    

    @henrykironde I have made the changes after updating it with the master branch

    Under Review and Tests 
    opened by jainamritanshu 37
  • Working on : Expansion of Spatial Data Support to the Data Retriever

    Working on : Expansion of Spatial Data Support to the Data Retriever

    I have started to look into and working on the project : Expansion of Spatial Data Support to the Data Retriever.

    @ethanwhite @henrykironde Please let me know of any aspects in particular I should be prioritising over others.

    Would it be okay if I carried on the discussion through this Issue?

    opened by ss-is-master-chief 31
  • Test OSX .app file

    Test OSX .app file

    I'm in the process of trying to get the EcoData Retriever fully functional on OSX (since so many of the awesome ecological informaticsy people I know use Macs). As I've mentioned elsewhere it looks like building from source now works, at least when using homebrew (http://ecodataretriever.org/getting_started.html).

    What I'm working on now is getting the .app working so that you don't need to be comfortable in the shell (and have XCode installed) to use the Retriever. I have a version that is working on the machine that I built it on (our only Mac) and was wondering if some kind Mac folks like @karthik, @sckott, @emhart, @sarahsupp and @dfalster might have a few minutes to give it a trial run.

    The file is here: https://www.dropbox.com/s/26b1pj91mqucc0l/retriever.zip

    Basically I'm just looking for folks to unzip it, double click on it, and see if:

    1. It opens at all.
    2. You can install things successfully when setting the database management system to CSV and sqlite (these don't have any external dependencies).
    3. If you have either MySQL or PostgreSQL installed if it works with them. (MySQL is a bit fragile at the moment. It is currently working for most datasets, but not all, so just try a few if you get errors).
    4. Report back.

    Thanks in advance. And, yes, I wrote this issue... on a Mac.

    opened by ethanwhite 31
  • Allow consuming JSON data

    Allow consuming JSON data

    Currently we only support ingesting delimited tabular data. It is increasingly common for tabular style data to be distributed in JSON files and it would be nice to also be able to consume this. We would probably just convert it to CSV as a starting point and then process it using our standard pipeline.

    There are a few not particularly active packages for doing this, but the code to do it is so simple enough that since none of the packages seem to be widely adopted we might be better off just writing and maintaining it ourselves.

    (no rush on this, just a thought while looking at a cool dataset that's only available in JSON: http://data.unpaywall.org/products/snapshot)

    Feature Request 
    opened by ethanwhite 30
  • Update internals in reference to issue #765

    Update internals in reference to issue #765

    @ethanwhite @henrykironde I have made the following changes

    • name -> title
    • shortname -> name
    • tags -> keywords
    • nulls -> missingValues

    I wanted to ask to change missingValues to missing_values as the original one seems to be in camel case and not in pep8 naming convention , if you allow I would change it. I am still finding such internal name cases which could be updated. I guess the code is clean. Kindly go through it once and if there are any suggestion I would start working on them

    Changes Requested 
    opened by jainamritanshu 30
  • Removed default encoding in reference to #716

    Removed default encoding in reference to #716

    Sir I have removed the hard coded assignments for encoding and added a field of encoding. I haven't edited the existing scripts for the existing data. Should I do them manually? Kindly review the code and tell me any changes needed for the code. @ethanwhite @henrykironde

    opened by jainamritanshu 29
  • an eBird Basic Dataset workflow

    an eBird Basic Dataset workflow

    Hey all,

    I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.

    If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.

    Let me know if you have any questions!

    Dave

    PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...

    -- Data file available via http://ebird.org/ebird/data/download
    
    -- commands to extract the text file from the tarball:
       -- tar xvf ebd_relMay-2013.tar
       -- gunzip ebd_relMay-2013.txt.gz
    -- WARNING: The resulting file is almost 50 gigabytes!
    
    -- In retrospect, there's probably some premature optimization for some of these columns: if the data set changes,
    -- it might be safer to use longer varchar arguments.
    CREATE TABLE eBird (
      GLOBAL_UNIQUE_IDENTIFIER     char(50),      -- always 45-47 characters needed (so far)
      TAXONOMIC_ORDER              numeric,       -- Probably not needed
      CATEGORY                     varchar(20),   -- Probably 10 would be safe
      COMMON_NAME                  varchar(70),   -- Some hybrids have really long names
      SCIENTIFIC_NAME              varchar(70),   --  ''
      SUBSPECIES_COMMON_NAME       varchar(70),   --  ''
      SUBSPECIES_SCIENTIFIC_NAME   varchar(70),   --  ''
      OBSERVATION_COUNT            varchar(8),    -- Someone saw 1.3 million Auklets.
                                                  -- Unfortunately, it can't be an integer 
                                                  -- because some are just presence/absence
      BREEDING_BIRD_ATLAS_CODE     char(2),       -- need to confirm that these are always length 2
      AGE_SEX                      text,          -- Potentially long, but almost always blank
      COUNTRY                      varchar(50),   -- long enough for "Saint Helena, Ascension and Tristan da Cunha"
      COUNTRY_CODE                 char(2),       -- alpha-2 codes
      STATE_PROVINCE               varchar(50),   -- no idea if this is long enough? U.S. Virgin Islands may be almost 30
      SUBNATIONAL1_CODE            char(10),      -- looks standardized at 5 characters?
      COUNTY                       varchar(50),   -- who knows how long it could be
      SUBNATIONAL2_CODE            char(12),      -- looks standardized at 9 characters?
      IBA_CODE                     char(16),
      LOCALITY                     text,          -- unstructured/potentially long
      LOCALITY_ID                  char(10),      -- maximum observed so far is 8
      LOCALITY_TYPE                char(2),       -- short codes
      LATITUDE                     real,          -- Is this the appropriate level of precision?
      LONGITUDE                    real,          --    ''
      OBSERVATION_DATE             date,          -- Do I need to specify YMD somehow?
      TIME_OBSERVATIONS_STARTED    time,          -- How do I make this a time?
      TRIP_COMMENTS                text,          -- Comments are long, unstructured, 
      SPECIES_COMMENTS             text,          --    and inconsistent, but sometimes interesting
      OBSERVER_ID                  char(12),      -- max of 9 in the data I've seen so far
      FIRST_NAME                   text,          -- Already have observer IDs
      LAST_NAME                    text,          -- ''
      SAMPLING_EVENT_IDENTIFIER    char(12),      -- Probably want to index on this.
      PROTOCOL_TYPE                varchar(50),   -- Needs to be at least 30 for sure.
      PROJECT_CODE                 varchar(20),   -- Needs to be at least 10 for sure.
      DURATION_MINUTES             int,           -- bigint?
      EFFORT_DISTANCE_KM           real,          -- precision?
      EFFORT_AREA_HA               real,          -- precision?
      NUMBER_OBSERVERS             int,           -- just a small int
      ALL_SPECIES_REPORTED         int,           -- Seems to always be 1 or 0.  Maybe I could make this Boolean?
      GROUP_IDENTIFIER             varchar(10),   -- Appears to be max of 7 or 8
      APPROVED                     int,           -- Can be Boolean?
      REVIEWED                     int,           -- Can be Boolean?
      REASON                       char(17),      -- May need to be longer if data set includes unvetted data
      X                            text           -- Blank
    );
    
    
    COPY eBird
      FROM '/home/dharris/eBird/ebd_relMay-2013.txt' 
      HEADER
      CSV
      QUOTE E'\5'       -- The file has unbalanced quotes. Using an obscure character as a quote mark instead.
      DELIMITER E'\t';
    
    
    -- Note: it's probably slightly faster to load postgis and add a geographic column first (see below).
    -- I'm keeping the original ordering in this document for accuracy's sake.
    CREATE INDEX ON eBird (sampling_event_identifier)
    
    -- Example query: SELECT SCIENTIFIC_NAME FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    -- Example query: SELECT count(SCIENTIFIC_NAME) FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
    
    
    CREATE EXTENSION postgis;
    ALTER TABLE eBird ADD COLUMN geog geography(POINT,4326); -- I hope 4326 is correct...
    UPDATE eBird SET geog = ST_GeogFromText('POINT(' || longitude || ' ' ||  latitude || ')');
    CREATE INDEX geog_index ON eBird USING GIST (geog); 
    
    -- Example query: find all the species within 1000 of my dorm:
    -- SELECT SCIENTIFIC_NAME FROM eBird WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.6972 34.4208)'), 1000);
    
    -- Slightly fancier version:
    -- SELECT DISTINCT SCIENTIFIC_NAME, COMMON_NAME FROM eBird 
    --   WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.855385 34.417239)'), 1000) 
    --   ORDER BY SCIENTIFIC_NAME;
    

    (Edited to add some amazing PostGIS queries and some better commets, etc.)

    PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.

    opened by davharris 28
  • Gracefully handle failed downloads

    Gracefully handle failed downloads

    It is not uncommon for a data source to go down (e.g. #902) or for a download to fail for some reason (e.g., #863). We should catch these, not cache the data that comes down (which is sometimes a corrupt file and sometimes a 404 html page), and report to the user that the source appears to be down and that they should try again and if it still fails later let us know.

    opened by ethanwhite 26
  • Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage

    Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

    tags -> keywords
    nulls -> missingValues
    name -> title
    shortname -> name
    
    The changes were done in the following files -
    
    retriever/lib/compile.py
    retriever/lib/datapackage.py
    retriever/lib/engine.py
    retriever/lib/parse_script_to_json.py
    retriever/lib/templates.py
    retriever/lib/tools.py
    scripts/bioclim.py
    scripts/biomass_allometry_db.py
    scripts/breed_bird_survey.py
    scripts/breed_bird_survey_50stop.py
    scripts/forest_inventory_analysis.py
    scripts/gentry_forest_transects.py
    scripts/npn.py
    scripts/plant_life_hist_eu.py
    scripts/prism_climate.py
    scripts/vertnet.py
    scripts/wood_density.py
    scripts/*.json(almost all datapackages) transition missingValues -> missing_values
    test/test_retriever.py
    retriever/__main__.py
    
    Changes Requested 
    opened by henrykironde 25
  • Add fetch to python Interface

    Add fetch to python Interface

    Hi @henrykironde Sorry, I was off schedule last days so I couldn't work on this issue as I told you. This should solve #1019 but is this the right place for the method?

    Changes Requested 
    opened by adhaamehab 23
  • hacktoberfest guide

    hacktoberfest guide

    For contributors who want to take part in the hacktoberfest, please check the issue lists from the various projects

    Retriever: https://github.com/weecology/retriever/issues Retriever-recipes: https://github.com/weecology/retriever-recipes/issues Rdataretriever: https://github.com/ropensci/rdataretriever/issues Retriever.jl: https://github.com/weecology/Retriever.jl/issues

    opened by henrykironde 0
  • Downloading fails for files with no Content-Disposition

    Downloading fails for files with no Content-Disposition

    Example packages:
    1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

    2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

    opened by henrykironde 1
  •  display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names in rdatasets takes a list of package_name

    display_all_rdatasets_names takes list of package_name insted of taking a string of package_name as a parameter

    >>> display_all_rdataset_names("aer")
    List of all available Rdatasets in packages: aer
    No package named 'a' found in Rdatasets
    No package named 'e' found in Rdatasets
    No package named 'r' found in Rdatasets
    
    >>> display_all_rdataset_names(["aer"])
    List of all available Rdatasets in packages: ['aer']
    Package: aer              Dataset: affairs                   Script Name: rdataset-aer-affairs
    Package: aer              Dataset: argentinacpi              Script Name: rdataset-aer-argentinacpi
    Package: aer              Dataset: bankwages                 Script Name: rdataset-aer-bankwages
    Package: aer              Dataset: benderlyzwick             Script Name: rdataset-aer-benderlyzwick
    Package: aer              Dataset: bondyield                 Script Name: rdataset-aer-bondyield
    Package: aer              Dataset: cartelstability           Script Name: rdataset-aer-cartelstability
    Package: aer              Dataset: caschools                 Script Name: rdataset-aer-caschools
    Package: aer              Dataset: chinaincome               Script Name: rdataset-aer-chinaincome
    Package: aer              Dataset: cigarettesb               Script Name: rdataset-aer-cigarettesb
    Package: aer              Dataset: cigarettessw              Script Name: rdataset-aer-cigarettessw
    Package: aer              Dataset: collegedistance           Script Name: rdataset-aer-collegedistance
    Package: aer              Dataset: consumergood              Script Name: rdataset-aer-consumergood
    Package: aer              Dataset: cps1985                   Script Name: rdataset-aer-cps1985
    Package: aer              Dataset: cps1988                   Script Name: rdataset-aer-cps1988
    ....
    opened by Nageshbansal 1
  • not able to use gdal==3.3.2 while working with

    not able to use gdal==3.3.2 while working with ".shp" files

    NOTES

    Expected behavior and actual behavior.

    While I am having gdal 3.2.2, if I try to import ogr in a script dealing with ".shp" files, it doesn't import, but if I downgrade my gdal to 3.0.2 I'm able to import ogr and script run successfully

    ogr_not_working

    ogr_not_defined

    ogr_working

    ogr_working

    Operating system

    Ubuntu 20.04 bit

    GDAL version and provenance

    GDAL 3.3.2 version from ubuntugis-unstable PPA

    opened by Nageshbansal 0
  • Make sure that the the R api dataset are run on the retrieverdash

    Make sure that the the R api dataset are run on the retrieverdash

    We have added some API to the retriever. Some of the APIs, like Tidycensus, can be run and tested on the retriever dashboard.

    You can clone the retrieverdash project and test locally using the developer docs for the dashboard https://retrieverdash.readthedocs.io/developer.html#setting-up-locally.

    When working locally, first you will need to have the APIs working well on the retriever. Use the DEV LIST in the retriever dashboard to test only the required scripts.

    opened by henrykironde 0
Releases(v3.1.0)
  • v3.1.0(Apr 26, 2022)

    v3.1.0

    Major changes

    Remove Travis and use GitHub actions Improve autocreate script template creation tool Update Server setup docs Change default branch from master to main Update Kaggle API function Add Anaconda badges Update BBS breed bird survey ADD hdf5 to CSV files conversion test ADD HDF5 engine XML to CSV conversion test JSON to CSV function with tests SQLite to CSV files conversion test Geojson to CSV conversion test script Added tidycensus dataset improve Dockerfile and automate Docker push to the registry Add support for clipping images Add Socrata API Added RDatasets API Add auto publish to testPyPi and PyPi

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Jul 16, 2020)

    v3.0.0

    Major changes

    Add provenance support to the Data Retriever Use utf-8 as default Move scripts from Retriever to retriever-recipes repository Adapt google code style and add linters, use yapf. Test linters Extend CSV field size limit Improve output when connection is not made Add version to the interface Prompt user if a newer version of script is available Add all the recipes datasets Add test for installation of committed dataset Add function to commit dataset

    Minor changes

    Improve "argcomplete-command" Add NUMFOCUS logo in README

    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Jun 10, 2019)

  • v2.3.0(May 1, 2019)

    Retriever v2.3.0

    Major changes

    Change Psycopg2 to psycopg2-binary Add Spatial data testing on Docker Add option for pretty json keep order of fetched tables and order of processing resources Add reset to specific dataset and script function Use tqdm 4.30.0 Install data into custom director using data_dir option Download data into custom directory using sub_dir

    Minor changes

    Add tests for reset script Add smaller samples of GIS data for testing Reactivate MySQL tests on Travis Allow custom arguments for psql Add docs and examples for Postgis support Change testdb name to testdb_retriever Improve Pypi retriever description Update documentation for passwordless setup of Postgres on Windows Setting up infrastructure for automating script creation

    New datasets

    USA eco legions, ecoregions-us LTREB Prairie-forest ecotone of eastern Kansas/Foster Lab dataset Sonoran Desert, sonoran-desert Adding Acton Lake dataset acton-lake

    Dataset changes

    MammalSuperTree.py to mammal_super_tree.py lakecats_finaltables.json to lakecats_final_tables harvard_forests.json to harvard_forest.json macroalgal_communities to macroalgal-communities

    Source code(tar.gz)
    Source code(zip)
    mac.zip(77.86 MB)
    python3-retriever_2.3.0-1_all.deb(43.29 KB)
    RetrieverSetup.exe(22.84 MB)
  • v.2.2.0(Nov 6, 2018)

    Major changes

    Using requests package to fetch data. Add postgis, a Spatial support for postgres. Update ls, include more details about the scripts. update license lookup for datasets Update keywords lookup for datasets Use tqdm for all progress tracking. Changed all "-" in JSON files to "_"

    Minor changes

    Documention refinement. Connect to MySQL using preferred encoding. License search and keyword search added. Conda_Forge docs Add Zenodo badge to link to archive Add test for extracting data

    New datasets

    Add Noaa Fisheries trade, noaa-fisheries-trade. Add Fishery Statistical Collections data, fao-global-capture-product. Add bupa liver disorders dataset, bupa-liver-disorders. Add GLOBI interactions data. globi-interaction. Addition of the National Aquatic Resource Surveys (NARS), nla. Addition of partners in flight dataset, partners-in-flight. Add the ND-GAIN Country Index. nd-gain. Add world GDP in current US Dollars. dgp. Add airports dataset, airports. Repair aquatic animal excretion. Add Biotime dataset. Add lakecats final tables dataset, lakecats-final-tables. Add harvard forests data, harvard forests. Add USGS elevation data, usgs-elevation.

    Source code(tar.gz)
    Source code(zip)
    python-retriever_2.2.0-1_all.deb(38.28 KB)
    retriever-2.2.0.tar.gz(55.76 KB)
    retriever.app.zip(65.22 MB)
    RetrieverSetup.exe(28.16 MB)
  • v2.1.0(Oct 27, 2017)

    v2.1.0

    Major changes

    • Add Python interface
    • Add Retriever to conda
    • Auto complete of Retriever commands on Unix systems

    Minor changes

    • Add license to datasets
    • Change the structure of raw data from string to list
    • Add testing on any modified dataset
    • Improve memory usage in cross-tab processing
    • Add capabilitiy for datasets to use custom Encoding
    • Use new Python interface for regression testing
    • Use Frictionless Data specification terminology for internals

    New datasets

    • Add ant dataset and weather data to the portal dataset
    • NYC TreesCount
    • PREDICTS
    • aquatic_animal_excretion
    • biodiversity_response
    • bird_migration_data
    • chytr_disease_distr
    • croche_vegetation_data
    • dicerandra_frutescens
    • flensburg_food_web
    • great_basin_mammal_abundance
    • macroalgal_communities
    • macrocystis_variation
    • marine_recruitment_data
    • mediter_basin_plant_traits
    • nematode_traits
    • ngreatplains-flowering-dates
    • portal-dev
    • portal
    • predator_prey_body_ratio
    • predicts
    • socean_diet_data
    • species_exctinction_rates
    • streamflow_conditions
    • tree_canopy_geometries
    • turtle_offspring_nesting
    • Add vertnet individual datasets vertnet_amphibians vertnet_birds vertnet_fishes vertnet_mammals vertnet_reptiles
    Source code(tar.gz)
    Source code(zip)
    retriever.app.zip(10.16 MB)
    RetrieverSetup.exe(11.64 MB)
    retriever_2.1.0.deb(33.99 KB)
  • v2.0.0(Feb 24, 2017)

    v2.0.0

    Major changes

    • Add Python 3 support, python 2/3 compatibility
    • Add json and xml as output formats
    • Switch to using the frictionless data datapackage json standard. This a backwards incompatible change as the form of dataset description files the retriever uses to describe the location and processing of simple datasets has changed.
    • Add CLI for creating, editing, deleting datapackage.json scripts
    • Broaden scope to include non-ecological data and rename to Data Retriever
    • Major expansion of documentation and move documentation to Read the Docs
    • Add developer documentation
    • Remove the GUI
    • Use csv module for reading of raw data to improve handling of newlines in fields
    • Major expansion of integration testing
    • Refactor regression testing to produce a single hash for a dataset regardless of output format
    • Add continuous integration testing for Windows

    Minor changes

    • Use pyinstaller for creating exe for windows and app for mac and remove py2app
    • Use 3 level semantic versioning for both scripts and core code
    • Rename datasets with more descriptive names
    • Add a retriever minimum version for each dataset
    • Rename dataset description files to follow python modules conventions
    • Switch to py.test from nose
    • Expand unit testing
    • Add version requirements for sqlite and postgresql
    • Default to latin encoding
    • Improve UI for updating user on downloading and processing progress

    New datasets

    • Added machine Learning datasets from UC Irvine's machine learning data sets
    Source code(tar.gz)
    Source code(zip)
    python3-retriever_2.0.0-1_all.deb(33.13 KB)
    retriever-OSX.zip(10.41 MB)
    RetrieverSetup.exe(11.16 MB)
  • v1.8.3(Feb 12, 2016)

    v1.8.3

    • Fixed regression in GUI

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.3-1_all.deb(96.11 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.8.2(Feb 12, 2016)

    This is the 1.8 release of the EcoData Retriever.

    v1.8.2

    • Improved cleaning of column names
    • Fixed thread bug causing Gentry dataset to hang when installed via GUI
    • Removed support for 32-bit only Macs in binaries
    • Removed unused code

    v1.8.0

    • Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles
    • Added reset command to allow resetting database configuration settings, scripts, and cached raw data
    • Added Dockerfile for building docker containers of each version of the software for reproducibility
    • Added support for wxPython 3.0
    • Added support for tar and gz archives
    • Added support for archive files whose contents don't fit in memory
    • Added checks for and use of system proxies
    • Added ability to download archives from web services
    • Added tests for regressions in download engine
    • Added citation command to provide information on citing datasets
    • Improved column name cleanup
    • Improved whitespace consistency
    • Improved handling of Excel files
    • Improved function documentation
    • Improved unit testing and added coverage analysis
    • Improved the sample script by adding a url field
    • Improved script loading behavior by only loading a script the first time it is discovered
    • Improved operating system identification
    • Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)
    • Improved cross-platform directory and line ending handling
    • Improved testing across platforms
    • Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available
    • Improved metadata in setup.py
    • Fixed type issues in Portal dataset
    • Fixed GUI always downloading scripts instead of checking if it needed to
    • Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation
    • Fixed issues with downloading files to specific paths
    • Fixed BBS50 script to match newer structure of the data
    • Fixed bug where csv files were not being closed after installation
    • Fixed errors when closing the GUI
    • Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring
    • Fixed bug causing v1.6 to break when newer scripts were added to version.txt
    • Fixed Bioclim script to include hdr files
    • Fixed missing icon images on Windows
    • Removed unused code
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.8.2-1_all.deb(96.08 KB)
    retriever.zip(29.07 MB)
    RetrieverSetup.exe(8.32 MB)
  • v1.7.0(Oct 5, 2014)

    This is the v1.7.0 release of the EcoData Retriever.

    • Added ability to download files directly for non-tabular data
    • Added scripts to download Bioclim and Mammal Supertree data
    • Added a script for the MammalDIET database
    • Fixed bug where some nationally standardized FIA surveys where not included
    • Added check for wxpython on installation to allow non-gui installs
    • Fixed several minor issues with Gentry script including a missing site and a column in one file that was misnamed
    • Windows install now adds the retriever to the path to facilitate command line use
    • Fixed a bug preventing installation from PyPI
    • Added icons to installers
    • Fixed the retriever failing when given a script it couldn't handle
    Source code(tar.gz)
    Source code(zip)
    python-retriever_1.7.0-1_all.deb(96.21 KB)
    retriever-app.zip(17.61 MB)
    RetrieverSetup.exe(6.73 MB)
  • v1.6.0(Feb 11, 2014)

Template repo to quickly make a tested and documented GitHub action in Python with Poetry

Python + Poetry GitHub Action Template Getting started from the template Rename the src/action_python_poetry package. Globally replace instances of ac

Kevin Duff 89 Dec 25, 2022
Highlight Translator can help you translate the words quickly and accurately.

Highlight Translator can help you translate the words quickly and accurately. By only highlighting, copying, or screenshoting the content you want to translate anywhere on your computer (ex. PDF, PPT, WORD etc.), the translated results will then be automatically displayed before you.

Coolshan 48 Dec 21, 2022
xeuledoc - Fetch information about a public Google document.

xeuledoc - Fetch information about a public Google document.

Malfrats Industries 651 Dec 27, 2022
Explicit, strict and automatic project version management based on semantic versioning.

Explicit, strict and automatic project version management based on semantic versioning. Getting started End users Semantic versioning Project version

Dmytro Striletskyi 6 Jan 25, 2022
A simple document management REST based API for collaboratively interacting with documents

documan_api A simple document management REST based API for collaboratively interacting with documents.

Shahid Yousuf 1 Jan 22, 2022
A tutorial for people to run synthetic data replica's from source healthcare datasets

Synthetic-Data-Replica-for-Healthcare Description What is this? A tailored hands-on tutorial showing how to use Python to create synthetic data replic

null 11 Mar 22, 2022
Python Programming (Practical) (1-25) Download πŸ‘‡πŸΌ

BCA-603 : Python Programming (Practical) (1-25) Download zip ?? ?? How to run programs : Clone or download this repo to your computer. Unzip (If you d

Milan Jadav 2 Jun 2, 2022
Some code that takes a pipe-separated input and converts that into a table!

tablemaker A program that takes an input: a | b | c # With comments as well. e | f | g h | i |jk And converts it to a table: β”Œβ”€β”€β”€β”¬β”€β”€β”€β”¬β”€β”€β”€β”€β” β”‚ a β”‚ b β”‚

CodingSoda 2 Aug 30, 2022
A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

FREE Reverse Engineering Self-Study Course HERE Fundamental Python The book and code repo for the FREE Fundamental Python book by Kevin Thomas. FREE B

Kevin Thomas 7 Mar 19, 2022
Build documentation in multiple repos into one site.

mkdocs-multirepo-plugin Build documentation in multiple repos into one site. Setup Install plugin using pip: pip install git+https://github.com/jdoiro

Joseph Doiron 47 Dec 28, 2022
Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

This repository contains code that gets data regarding financial disclosures from the Court Listener API main.py: contains driver code that interacts

Ali Rastegar 2 Aug 6, 2022
graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elliptical orbits. you can change timestamp value or scale from source code idc.

solarSystemOrbitalSimulation graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elli

Mega 3 Mar 3, 2022
Software engineering course project. Secondhand trading system.

PigeonSale Software engineering course project. Secondhand trading system. Documentation API doumenatation: list of APIs Backend documentation: notes

Harry Lee 1 Sep 1, 2022
Build AGNOS, the operating system for your comma three

agnos-builder This is the tool to build AGNOS, our Ubuntu based OS. AGNOS runs on the comma three devkit. NOTE: the edk2_tici and agnos-firmare submod

comma.ai 21 Dec 24, 2022
BakTst_Org is a backtesting system for quantitative transactions.

BakTst_Org δΈ­ζ–‡reademeοΌšδΌ ι€ι—¨ Introduction: BakTst_Org is a prototype of the backtesting system used for BTC quantitative trading. This readme is mainly di

null 18 May 8, 2021
Code for our SIGIR 2022 accepted paper : P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning

P3 Ranker Implementation for our SIGIR2022 accepted paper: P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-bas

null 14 Jan 4, 2023
πŸ† A ranked list of awesome python developer tools and libraries. Updated weekly.

Best-of Python Developer Tools ?? A ranked list of awesome python developer tools and libraries. Updated weekly. This curated list contains 250 awesom

Machine Learning Tooling 646 Jan 7, 2023
ReStructuredText and Sphinx bridge to Doxygen

Breathe Packagers: PGP signing key changes for Breathe >= v4.23.0. https://github.com/michaeljones/breathe/issues/591 This is an extension to reStruct

Michael Jones 643 Dec 31, 2022
Watch a Sphinx directory and rebuild the documentation when a change is detected. Also includes a livereload enabled web server.

sphinx-autobuild Rebuild Sphinx documentation on changes, with live-reload in the browser. Installation sphinx-autobuild is available on PyPI. It can

Executable Books 440 Jan 6, 2023