Quickly download, clean up, and install public datasets into a database management system

Weecology

Last update: Jan 4, 2023

Related tags

Documentation python data-science data dataset datasets data-retrieval

Overview

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.

Installing the Current Release

If you have Python installed you can install the current release using either pip:

pip install retriever

or conda after adding the conda-forge channel (conda config --add channels conda-forge):

conda install retriever

Depending on your system configuration this may require sudo for pip:

sudo pip install retriever

Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.

List of Available Datasets

Installing From Source

To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:

xlrd

The following packages are optionally needed to interact with associated database management systems:

PyMySQL (for MySQL)
sqlite3 (for SQLite)
psycopg2-binary (for PostgreSQL), previously psycopg2.
pyodbc (for MS Access - this option is only available on Windows)
Microsoft Access Driver (ODBC for windows)

To install from source

Either use pip to install directly from GitHub:

pip install git+https://[email protected]/weecology/retriever.git

or:

Clone the repository
From the directory containing setup.py, run the following command: pip install .. You may need to include sudo at the beginning of the command depending on your system (i.e., sudo pip install .).

More extensive documentation for those that are interested in developing can be found here

Using the Command Line

After installing, run retriever update to download all of the available dataset scripts. To see the full list of command line options and datasets run retriever --help. The output will look like this:

usage: retriever [-h] [-v] [-q]
                 {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                 ...

positional arguments:
  {download,install,defaults,update,new,new_json,edit_json,delete_json,ls,citation,reset,help}
                        sub-command help
    download            download raw data files for a dataset
    install             download and install dataset
    defaults            displays default options
    update              download updated versions of scripts
    new                 create a new sample retriever script
    new_json            CLI to create retriever datapackage.json script
    edit_json           CLI to edit retriever datapackage.json script
    delete_json         CLI to remove retriever datapackage.json script
    ls                  display a list all available dataset scripts
    citation            view citation
    reset               reset retriever: removes configuration settings,
                        scripts, and cached data
    help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           suppress command-line output

To install datasets, use retriever install:

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
  {mysql,postgres,sqlite,msaccess,csv,json,xml}
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Examples

These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.

Using Install

retriever install -h   (gives install options)

Using specific database engine, retriever install {Engine}

retriever install mysql -h     (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris

install data into an sqlite database named iris.db you would use:

retriever install sqlite iris -f iris.db

Using download

retriever download -h    (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents

Using citation

retriever citation   (citation of the retriever engine)
retriever citation iris  (citation for the iris data)

Spatial Dataset Installation

Set up Spatial support

To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.

retriever install postgres harvard-forest # Vector data
retriever install postgres bioclim # Raster data
# Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074

Website

For more information see the Data Retriever website.

Acknowledgments

Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.

Comments

[WIP] Allow consuming JSON data
NOTE: I closed this PR by mistake. I'll re-open this. This pull request is for catering to this [issue](Allow consuming JSON data #1334). Currently, we support 2 kinds of json datasets:

Where the dataset's rows are present in a certain key of json. For example refer to this example, here the certain_key is data.

Where the dataset's rows are present in certain key of differnent parts of the json. For example refer to this example, here the certain key is laureates.

[WIP] Where the json is in the form of list, example. Current implementation for this is commented out, since its getting stuck in a recursion loop.
opened by DumbMachine 37

Updated internal variable names to match that of datapackage #860

Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

tags -> keywords
nulls -> missingValues
name -> title
shortname -> name

The changes were done in the following files -

retriever/lib/compile.py
retriever/lib/datapackage.py
retriever/lib/engine.py
retriever/lib/parse_script_to_json.py
retriever/lib/templates.py
retriever/lib/tools.py
scripts/bioclim.py
scripts/biomass_allometry_db.py
scripts/breed_bird_survey.py
scripts/breed_bird_survey_50stop.py
scripts/forest_inventory_analysis.py
scripts/gentry_forest_transects.py
scripts/npn.py
scripts/plant_life_hist_eu.py
scripts/prism_climate.py
scripts/vertnet.py
scripts/wood_density.py
scripts/*.json(almost all datapackages) transition missingValues -> missing_values
test/test_retriever.py
retriever/__main__.py

@henrykironde I have made the changes after updating it with the master branch

Under Review and Tests

opened by jainamritanshu 37

Working on : Expansion of Spatial Data Support to the Data Retriever

I have started to look into and working on the project : Expansion of Spatial Data Support to the Data Retriever.

@ethanwhite @henrykironde Please let me know of any aspects in particular I should be prioritising over others.

Would it be okay if I carried on the discussion through this Issue?

opened by ss-is-master-chief 31
Test OSX .app file
I'm in the process of trying to get the EcoData Retriever fully functional on OSX (since so many of the awesome ecological informaticsy people I know use Macs). As I've mentioned elsewhere it looks like building from source now works, at least when using homebrew (http://ecodataretriever.org/getting_started.html).

What I'm working on now is getting the .app working so that you don't need to be comfortable in the shell (and have XCode installed) to use the Retriever. I have a version that is working on the machine that I built it on (our only Mac) and was wondering if some kind Mac folks like @karthik, @sckott, @emhart, @sarahsupp and @dfalster might have a few minutes to give it a trial run.

The file is here: https://www.dropbox.com/s/26b1pj91mqucc0l/retriever.zip

Basically I'm just looking for folks to unzip it, double click on it, and see if:

It opens at all.

You can install things successfully when setting the database management system to CSV and sqlite (these don't have any external dependencies).

If you have either MySQL or PostgreSQL installed if it works with them. (MySQL is a bit fragile at the moment. It is currently working for most datasets, but not all, so just try a few if you get errors).

Report back.

Thanks in advance. And, yes, I wrote this issue... on a Mac.
opened by ethanwhite 31
Allow consuming JSON data

Currently we only support ingesting delimited tabular data. It is increasingly common for tabular style data to be distributed in JSON files and it would be nice to also be able to consume this. We would probably just convert it to CSV as a starting point and then process it using our standard pipeline.

There are a few not particularly active packages for doing this, but the code to do it is so simple enough that since none of the packages seem to be widely adopted we might be better off just writing and maintaining it ourselves.

(no rush on this, just a thought while looking at a cool dataset that's only available in JSON: http://data.unpaywall.org/products/snapshot)
Feature Request

opened by ethanwhite 30
Update internals in reference to issue #765
@ethanwhite @henrykironde I have made the following changes

name -> title

shortname -> name

tags -> keywords

nulls -> missingValues

I wanted to ask to change missingValues to missing_values as the original one seems to be in camel case and not in pep8 naming convention , if you allow I would change it. I am still finding such internal name cases which could be updated. I guess the code is clean. Kindly go through it once and if there are any suggestion I would start working on them
Changes Requested
opened by jainamritanshu 30
Removed default encoding in reference to #716

Sir I have removed the hard coded assignments for encoding and added a field of encoding. I haven't edited the existing scripts for the existing data. Should I do them manually? Kindly review the code and tell me any changes needed for the code. @ethanwhite @henrykironde

opened by jainamritanshu 29

an eBird Basic Dataset workflow

Hey all,

I've mostly gotten the eBird data into a PostgreSQL/PostGIS database, and I thought I'd share my code with you in case you wanted to integrate it into something more robust with EcoDataRetriever. If you know how to optimize it better, I'd love to hear what you come up with.

If you do decide to include it, please acknowledge Matt Jones and Jim Regetz, since they helped me through this.

Let me know if you have any questions!

Dave

PS the "world" data set unzips to be 50 gigabytes, so you'll probably want to work with something smaller...

-- Data file available via http://ebird.org/ebird/data/download

-- commands to extract the text file from the tarball:
   -- tar xvf ebd_relMay-2013.tar
   -- gunzip ebd_relMay-2013.txt.gz
-- WARNING: The resulting file is almost 50 gigabytes!

-- In retrospect, there's probably some premature optimization for some of these columns: if the data set changes,
-- it might be safer to use longer varchar arguments.
CREATE TABLE eBird (
  GLOBAL_UNIQUE_IDENTIFIER     char(50),      -- always 45-47 characters needed (so far)
  TAXONOMIC_ORDER              numeric,       -- Probably not needed
  CATEGORY                     varchar(20),   -- Probably 10 would be safe
  COMMON_NAME                  varchar(70),   -- Some hybrids have really long names
  SCIENTIFIC_NAME              varchar(70),   --  ''
  SUBSPECIES_COMMON_NAME       varchar(70),   --  ''
  SUBSPECIES_SCIENTIFIC_NAME   varchar(70),   --  ''
  OBSERVATION_COUNT            varchar(8),    -- Someone saw 1.3 million Auklets.
                                              -- Unfortunately, it can't be an integer 
                                              -- because some are just presence/absence
  BREEDING_BIRD_ATLAS_CODE     char(2),       -- need to confirm that these are always length 2
  AGE_SEX                      text,          -- Potentially long, but almost always blank
  COUNTRY                      varchar(50),   -- long enough for "Saint Helena, Ascension and Tristan da Cunha"
  COUNTRY_CODE                 char(2),       -- alpha-2 codes
  STATE_PROVINCE               varchar(50),   -- no idea if this is long enough? U.S. Virgin Islands may be almost 30
  SUBNATIONAL1_CODE            char(10),      -- looks standardized at 5 characters?
  COUNTY                       varchar(50),   -- who knows how long it could be
  SUBNATIONAL2_CODE            char(12),      -- looks standardized at 9 characters?
  IBA_CODE                     char(16),
  LOCALITY                     text,          -- unstructured/potentially long
  LOCALITY_ID                  char(10),      -- maximum observed so far is 8
  LOCALITY_TYPE                char(2),       -- short codes
  LATITUDE                     real,          -- Is this the appropriate level of precision?
  LONGITUDE                    real,          --    ''
  OBSERVATION_DATE             date,          -- Do I need to specify YMD somehow?
  TIME_OBSERVATIONS_STARTED    time,          -- How do I make this a time?
  TRIP_COMMENTS                text,          -- Comments are long, unstructured, 
  SPECIES_COMMENTS             text,          --    and inconsistent, but sometimes interesting
  OBSERVER_ID                  char(12),      -- max of 9 in the data I've seen so far
  FIRST_NAME                   text,          -- Already have observer IDs
  LAST_NAME                    text,          -- ''
  SAMPLING_EVENT_IDENTIFIER    char(12),      -- Probably want to index on this.
  PROTOCOL_TYPE                varchar(50),   -- Needs to be at least 30 for sure.
  PROJECT_CODE                 varchar(20),   -- Needs to be at least 10 for sure.
  DURATION_MINUTES             int,           -- bigint?
  EFFORT_DISTANCE_KM           real,          -- precision?
  EFFORT_AREA_HA               real,          -- precision?
  NUMBER_OBSERVERS             int,           -- just a small int
  ALL_SPECIES_REPORTED         int,           -- Seems to always be 1 or 0.  Maybe I could make this Boolean?
  GROUP_IDENTIFIER             varchar(10),   -- Appears to be max of 7 or 8
  APPROVED                     int,           -- Can be Boolean?
  REVIEWED                     int,           -- Can be Boolean?
  REASON                       char(17),      -- May need to be longer if data set includes unvetted data
  X                            text           -- Blank
);


COPY eBird
  FROM '/home/dharris/eBird/ebd_relMay-2013.txt' 
  HEADER
  CSV
  QUOTE E'\5'       -- The file has unbalanced quotes. Using an obscure character as a quote mark instead.
  DELIMITER E'\t';


-- Note: it's probably slightly faster to load postgis and add a geographic column first (see below).
-- I'm keeping the original ordering in this document for accuracy's sake.
CREATE INDEX ON eBird (sampling_event_identifier)

-- Example query: SELECT SCIENTIFIC_NAME FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';
-- Example query: SELECT count(SCIENTIFIC_NAME) FROM eBird WHERE SAMPLING_EVENT_IDENTIFIER = 'S9605852';


CREATE EXTENSION postgis;
ALTER TABLE eBird ADD COLUMN geog geography(POINT,4326); -- I hope 4326 is correct...
UPDATE eBird SET geog = ST_GeogFromText('POINT(' || longitude || ' ' ||  latitude || ')');
CREATE INDEX geog_index ON eBird USING GIST (geog); 

-- Example query: find all the species within 1000 of my dorm:
-- SELECT SCIENTIFIC_NAME FROM eBird WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.6972 34.4208)'), 1000);

-- Slightly fancier version:
-- SELECT DISTINCT SCIENTIFIC_NAME, COMMON_NAME FROM eBird 
--   WHERE ST_DWithin(geog, ST_GeographyFromText('SRID=4326;POINT(-119.855385 34.417239)'), 1000) 
--   ORDER BY SCIENTIFIC_NAME;

(Edited to add some amazing PostGIS queries and some better commets, etc.)

PS: After poking around a bit more, it looks like I should have used doubles rather than reals to store lat/lon. I had misread the documentation about how much precision was used for reals.

opened by davharris 28

Gracefully handle failed downloads

It is not uncommon for a data source to go down (e.g. #902) or for a download to fail for some reason (e.g., #863). We should catch these, not cache the data that comes down (which is sometimes a corrupt file and sometimes a 404 html page), and report to the user that the source appears to be down and that they should try again and if it still fails later let us know.

opened by ethanwhite 26

Updated internal variable names to match that of datapackage

Updated internal variable names to match that of datapackage spec #765 The following changes were done for the variable names

tags -> keywords
nulls -> missingValues
name -> title
shortname -> name

The changes were done in the following files -

retriever/lib/compile.py
retriever/lib/datapackage.py
retriever/lib/engine.py
retriever/lib/parse_script_to_json.py
retriever/lib/templates.py
retriever/lib/tools.py
scripts/bioclim.py
scripts/biomass_allometry_db.py
scripts/breed_bird_survey.py
scripts/breed_bird_survey_50stop.py
scripts/forest_inventory_analysis.py
scripts/gentry_forest_transects.py
scripts/npn.py
scripts/plant_life_hist_eu.py
scripts/prism_climate.py
scripts/vertnet.py
scripts/wood_density.py
scripts/*.json(almost all datapackages) transition missingValues -> missing_values
test/test_retriever.py
retriever/__main__.py

Changes Requested

opened by henrykironde 25

Add fetch to python Interface

Hi @henrykironde Sorry, I was off schedule last days so I couldn't work on this issue as I told you. This should solve #1019 but is this the right place for the method?
Changes Requested

opened by adhaamehab 23
hacktoberfest guide

For contributors who want to take part in the hacktoberfest, please check the issue lists from the various projects

Retriever: https://github.com/weecology/retriever/issues Retriever-recipes: https://github.com/weecology/retriever-recipes/issues Rdataretriever: https://github.com/ropensci/rdataretriever/issues Retriever.jl: https://github.com/weecology/Retriever.jl/issues

opened by henrykironde 0
Downloading fails for files with no Content-Disposition

Example packages:
1: Package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/usda_agriculture_plants_database.py Sample url: https://plants.sc.egov.usda.gov/csvdownload?plantLst=plantCompleteList

2: package file: https://github.com/weecology/retriever-recipes/blob/main/scripts/aquatic_animal_excretion.py url: https://esajournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fecy.1792&file=ecy1792-sup-0001-DataS1.zip

opened by henrykironde 1

display_all_rdatasets_names in rdatasets takes a list of package_name

display_all_rdatasets_names takes list of package_name insted of taking a string of package_name as a parameter

>>> display_all_rdataset_names("aer")
List of all available Rdatasets in packages: aer
No package named 'a' found in Rdatasets
No package named 'e' found in Rdatasets
No package named 'r' found in Rdatasets

>>> display_all_rdataset_names(["aer"])
List of all available Rdatasets in packages: ['aer']
Package: aer              Dataset: affairs                   Script Name: rdataset-aer-affairs
Package: aer              Dataset: argentinacpi              Script Name: rdataset-aer-argentinacpi
Package: aer              Dataset: bankwages                 Script Name: rdataset-aer-bankwages
Package: aer              Dataset: benderlyzwick             Script Name: rdataset-aer-benderlyzwick
Package: aer              Dataset: bondyield                 Script Name: rdataset-aer-bondyield
Package: aer              Dataset: cartelstability           Script Name: rdataset-aer-cartelstability
Package: aer              Dataset: caschools                 Script Name: rdataset-aer-caschools
Package: aer              Dataset: chinaincome               Script Name: rdataset-aer-chinaincome
Package: aer              Dataset: cigarettesb               Script Name: rdataset-aer-cigarettesb
Package: aer              Dataset: cigarettessw              Script Name: rdataset-aer-cigarettessw
Package: aer              Dataset: collegedistance           Script Name: rdataset-aer-collegedistance
Package: aer              Dataset: consumergood              Script Name: rdataset-aer-consumergood
Package: aer              Dataset: cps1985                   Script Name: rdataset-aer-cps1985
Package: aer              Dataset: cps1988                   Script Name: rdataset-aer-cps1988
....

opened by Nageshbansal 1

not able to use gdal==3.3.2 while working with ".shp" files

NOTES

Expected behavior and actual behavior.

While I am having gdal 3.2.2, if I try to import ogr in a script dealing with ".shp" files, it doesn't import, but if I downgrade my gdal to 3.0.2 I'm able to import ogr and script run successfully

ogr_not_working

ogr_working

Operating system

Ubuntu 20.04 bit

GDAL version and provenance

GDAL 3.3.2 version from ubuntugis-unstable PPA

opened by Nageshbansal 0
Make sure that the the R api dataset are run on the retrieverdash

We have added some API to the retriever. Some of the APIs, like Tidycensus, can be run and tested on the retriever dashboard.

You can clone the retrieverdash project and test locally using the developer docs for the dashboard https://retrieverdash.readthedocs.io/developer.html#setting-up-locally.

When working locally, first you will need to have the APIs working well on the retriever. Use the DEV LIST in the retriever dashboard to test only the required scripts.

opened by henrykironde 0

Releases(v3.1.0)

v3.1.0(Apr 26, 2022)

v3.1.0

Major changes

Remove Travis and use GitHub actions Improve autocreate script template creation tool Update Server setup docs Change default branch from master to main Update Kaggle API function Add Anaconda badges Update BBS breed bird survey ADD hdf5 to CSV files conversion test ADD HDF5 engine XML to CSV conversion test JSON to CSV function with tests SQLite to CSV files conversion test Geojson to CSV conversion test script Added tidycensus dataset improve Dockerfile and automate Docker push to the registry Add support for clipping images Add Socrata API Added RDatasets API Add auto publish to testPyPi and PyPi
Source code(tar.gz)
Source code(zip)
v3.0.0(Jul 16, 2020)

v3.0.0

Major changes

Add provenance support to the Data Retriever Use utf-8 as default Move scripts from Retriever to retriever-recipes repository Adapt google code style and add linters, use yapf. Test linters Extend CSV field size limit Improve output when connection is not made Add version to the interface Prompt user if a newer version of script is available Add all the recipes datasets Add test for installation of committed dataset Add function to commit dataset

Minor changes

Improve "argcomplete-command" Add NUMFOCUS logo in README
Source code(tar.gz)
Source code(zip)
v2.4.0(Jun 10, 2019)

Retriever v2.4.0

Minor changes

Update long description Remove Python 2 utilities

New datasets

Catalogos-dados-brasil Transparencia-dados-abertos-brasil biotimesql
Source code(tar.gz)
Source code(zip)
mac.zip(77.76 MB)
python3-retriever_2.4.0-1_all.deb(43.29 KB)
RetrieverSetup.exe(22.84 MB)
v2.3.1(May 1, 2019)

Retriever v2.3.1

Minor changes

Update PyPi description
Source code(tar.gz)
Source code(zip)
mac.zip(77.86 MB)
python3-retriever_2.3.1-1_all.deb(43.32 KB)
RetrieverSetup.exe(22.84 MB)
v2.3.0(May 1, 2019)

Retriever v2.3.0

Major changes

Change Psycopg2 to psycopg2-binary Add Spatial data testing on Docker Add option for pretty json keep order of fetched tables and order of processing resources Add reset to specific dataset and script function Use tqdm 4.30.0 Install data into custom director using data_dir option Download data into custom directory using sub_dir

Minor changes

Add tests for reset script Add smaller samples of GIS data for testing Reactivate MySQL tests on Travis Allow custom arguments for psql Add docs and examples for Postgis support Change testdb name to testdb_retriever Improve Pypi retriever description Update documentation for passwordless setup of Postgres on Windows Setting up infrastructure for automating script creation

New datasets

USA eco legions, ecoregions-us LTREB Prairie-forest ecotone of eastern Kansas/Foster Lab dataset Sonoran Desert, sonoran-desert Adding Acton Lake dataset acton-lake

Dataset changes

MammalSuperTree.py to mammal_super_tree.py lakecats_finaltables.json to lakecats_final_tables harvard_forests.json to harvard_forest.json macroalgal_communities to macroalgal-communities
Source code(tar.gz)
Source code(zip)
mac.zip(77.86 MB)
python3-retriever_2.3.0-1_all.deb(43.29 KB)
RetrieverSetup.exe(22.84 MB)
v.2.2.0(Nov 6, 2018)

Major changes

Using requests package to fetch data. Add postgis, a Spatial support for postgres. Update ls, include more details about the scripts. update license lookup for datasets Update keywords lookup for datasets Use tqdm for all progress tracking. Changed all "-" in JSON files to "_"

Minor changes

Documention refinement. Connect to MySQL using preferred encoding. License search and keyword search added. Conda_Forge docs Add Zenodo badge to link to archive Add test for extracting data

New datasets

Add Noaa Fisheries trade, noaa-fisheries-trade. Add Fishery Statistical Collections data, fao-global-capture-product. Add bupa liver disorders dataset, bupa-liver-disorders. Add GLOBI interactions data. globi-interaction. Addition of the National Aquatic Resource Surveys (NARS), nla. Addition of partners in flight dataset, partners-in-flight. Add the ND-GAIN Country Index. nd-gain. Add world GDP in current US Dollars. dgp. Add airports dataset, airports. Repair aquatic animal excretion. Add Biotime dataset. Add lakecats final tables dataset, lakecats-final-tables. Add harvard forests data, harvard forests. Add USGS elevation data, usgs-elevation.
Source code(tar.gz)
Source code(zip)
python-retriever_2.2.0-1_all.deb(38.28 KB)
retriever-2.2.0.tar.gz(55.76 KB)
retriever.app.zip(65.22 MB)
RetrieverSetup.exe(28.16 MB)
v2.1.0(Oct 27, 2017)
v2.1.0

Major changes

Add Python interface

Add Retriever to conda

Auto complete of Retriever commands on Unix systems

Minor changes

Add license to datasets

Change the structure of raw data from string to list

Add testing on any modified dataset

Improve memory usage in cross-tab processing

Add capabilitiy for datasets to use custom Encoding

Use new Python interface for regression testing

Use Frictionless Data specification terminology for internals

New datasets

Add ant dataset and weather data to the portal dataset

NYC TreesCount

PREDICTS

aquatic_animal_excretion

biodiversity_response

bird_migration_data

chytr_disease_distr

croche_vegetation_data

dicerandra_frutescens

flensburg_food_web

great_basin_mammal_abundance

macroalgal_communities

macrocystis_variation

marine_recruitment_data

mediter_basin_plant_traits

nematode_traits

ngreatplains-flowering-dates

portal-dev

portal

predator_prey_body_ratio

predicts

socean_diet_data

species_exctinction_rates

streamflow_conditions

tree_canopy_geometries

turtle_offspring_nesting

Add vertnet individual datasets vertnet_amphibians vertnet_birds vertnet_fishes vertnet_mammals vertnet_reptiles

Source code(tar.gz)
Source code(zip)
retriever.app.zip(10.16 MB)
RetrieverSetup.exe(11.64 MB)
retriever_2.1.0.deb(33.99 KB)
v2.0.0(Feb 24, 2017)
v2.0.0

Major changes

Add Python 3 support, python 2/3 compatibility

Add json and xml as output formats

Switch to using the frictionless data datapackage json standard. This a backwards incompatible change as the form of dataset description files the retriever uses to describe the location and processing of simple datasets has changed.

Add CLI for creating, editing, deleting datapackage.json scripts

Broaden scope to include non-ecological data and rename to Data Retriever

Major expansion of documentation and move documentation to Read the Docs

Add developer documentation

Remove the GUI

Use csv module for reading of raw data to improve handling of newlines in fields

Major expansion of integration testing

Refactor regression testing to produce a single hash for a dataset regardless of output format

Add continuous integration testing for Windows

Minor changes

Use pyinstaller for creating exe for windows and app for mac and remove py2app

Use 3 level semantic versioning for both scripts and core code

Rename datasets with more descriptive names

Add a retriever minimum version for each dataset

Rename dataset description files to follow python modules conventions

Switch to py.test from nose

Expand unit testing

Add version requirements for sqlite and postgresql

Default to latin encoding

Improve UI for updating user on downloading and processing progress

New datasets

Added machine Learning datasets from UC Irvine's machine learning data sets

Source code(tar.gz)
Source code(zip)
python3-retriever_2.0.0-1_all.deb(33.13 KB)
retriever-OSX.zip(10.41 MB)
RetrieverSetup.exe(11.16 MB)
v1.8.3(Feb 12, 2016)
v1.8.3

Fixed regression in GUI

v1.8.2

Improved cleaning of column names

Fixed thread bug causing Gentry dataset to hang when installed via GUI

Removed support for 32-bit only Macs in binaries

Removed unused code

v1.8.0

Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles

Added reset command to allow resetting database configuration settings, scripts, and cached raw data

Added Dockerfile for building docker containers of each version of the software for reproducibility

Added support for wxPython 3.0

Added support for tar and gz archives

Added support for archive files whose contents don't fit in memory

Added checks for and use of system proxies

Added ability to download archives from web services

Added tests for regressions in download engine

Added citation command to provide information on citing datasets

Improved column name cleanup

Improved whitespace consistency

Improved handling of Excel files

Improved function documentation

Improved unit testing and added coverage analysis

Improved the sample script by adding a url field

Improved script loading behavior by only loading a script the first time it is discovered

Improved operating system identification

Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)

Improved cross-platform directory and line ending handling

Improved testing across platforms

Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available

Improved metadata in setup.py

Fixed type issues in Portal dataset

Fixed GUI always downloading scripts instead of checking if it needed to

Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation

Fixed issues with downloading files to specific paths

Fixed BBS50 script to match newer structure of the data

Fixed bug where csv files were not being closed after installation

Fixed errors when closing the GUI

Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring

Fixed bug causing v1.6 to break when newer scripts were added to version.txt

Fixed Bioclim script to include hdr files

Fixed missing icon images on Windows

Removed unused code

Source code(tar.gz)
Source code(zip)
python-retriever_1.8.3-1_all.deb(96.11 KB)
retriever.zip(29.07 MB)
RetrieverSetup.exe(8.32 MB)
v1.8.2(Feb 12, 2016)
This is the 1.8 release of the EcoData Retriever.

v1.8.2

Improved cleaning of column names

Fixed thread bug causing Gentry dataset to hang when installed via GUI

Removed support for 32-bit only Macs in binaries

Removed unused code

v1.8.0

Added scripts for 21 new datasets: leaf herbivory, biomass allocation, community dynamics of shortgrass steppe plants, mammal and bird foraging attributes, tree demography in Indian, small mammal community dynamics in Chile, community dynamics of Sonoran Desert perennials, biovolumes of freshwater phytoplankton, plant dynamics in Montana, Antarctic Site Inventory breeding bird survey, community abundance data compiled from the literature, spatio-temporal population data for butterflies, fish parasite host ecological characteristics, eBird, Global Wood Density Database, multiscale community data on vascular plants in a North Carolina, vertebrate home range sizes, PRISM climate data, Amniote life history database, woody plan Biomass And Allometry Database, Vertnet data on amphibians, birds, fishes, mammals, reptiles

Added reset command to allow resetting database configuration settings, scripts, and cached raw data

Added Dockerfile for building docker containers of each version of the software for reproducibility

Added support for wxPython 3.0

Added support for tar and gz archives

Added support for archive files whose contents don't fit in memory

Added checks for and use of system proxies

Added ability to download archives from web services

Added tests for regressions in download engine

Added citation command to provide information on citing datasets

Improved column name cleanup

Improved whitespace consistency

Improved handling of Excel files

Improved function documentation

Improved unit testing and added coverage analysis

Improved the sample script by adding a url field

Improved script loading behavior by only loading a script the first time it is discovered

Improved operating system identification

Improved download engine by allowing ability to maintain archive and subdirectory structure (particular relevant for spatial data)

Improved cross-platform directory and line ending handling

Improved testing across platforms

Improved checking for updated scripts so that scripts are only downloaded if the current version isn't available

Improved metadata in setup.py

Fixed type issues in Portal dataset

Fixed GUI always downloading scripts instead of checking if it needed to

Fixed bug that sometimes resulted in .retriever directories not belonging to the user who did the installation

Fixed issues with downloading files to specific paths

Fixed BBS50 script to match newer structure of the data

Fixed bug where csv files were not being closed after installation

Fixed errors when closing the GUI

Fixed issue where enclosing quotes in csv files were not being respected during cross-tab restructuring

Fixed bug causing v1.6 to break when newer scripts were added to version.txt

Fixed Bioclim script to include hdr files

Fixed missing icon images on Windows

Removed unused code

Source code(tar.gz)
Source code(zip)
python-retriever_1.8.2-1_all.deb(96.08 KB)
retriever.zip(29.07 MB)
RetrieverSetup.exe(8.32 MB)
v1.7.0(Oct 5, 2014)
This is the v1.7.0 release of the EcoData Retriever.

Added ability to download files directly for non-tabular data

Added scripts to download Bioclim and Mammal Supertree data

Added a script for the MammalDIET database

Fixed bug where some nationally standardized FIA surveys where not included

Added check for wxpython on installation to allow non-gui installs

Fixed several minor issues with Gentry script including a missing site and a column in one file that was misnamed

Windows install now adds the retriever to the path to facilitate command line use

Fixed a bug preventing installation from PyPI

Added icons to installers

Fixed the retriever failing when given a script it couldn't handle

Source code(tar.gz)
Source code(zip)
python-retriever_1.7.0-1_all.deb(96.21 KB)
retriever-app.zip(17.61 MB)
RetrieverSetup.exe(6.73 MB)
v1.6.0(Feb 11, 2014)

This released adds full OS X support to the Retriever, adds a proper Windows installer, and fixes a number of bugs.
Source code(tar.gz)
Source code(zip)
python-retriever_1.6-1_all.deb(103.32 KB)
retriever-1.6-source.tar.gz(103.52 KB)
retriever-app.zip(29.04 MB)
RetrieverSetupWindows.exe(6.72 MB)

Owner

Weecology

GitHub http://data-retriever.org

Template repo to quickly make a tested and documented GitHub action in Python with Poetry

Python + Poetry GitHub Action Template Getting started from the template Rename the src/action_python_poetry package. Globally replace instances of ac

89 Dec 25, 2022

Highlight Translator can help you translate the words quickly and accurately.

Highlight Translator can help you translate the words quickly and accurately. By only highlighting, copying, or screenshoting the content you want to translate anywhere on your computer (ex. PDF, PPT, WORD etc.), the translated results will then be automatically displayed before you.

48 Dec 21, 2022

xeuledoc - Fetch information about a public Google document.

651 Dec 27, 2022

Explicit, strict and automatic project version management based on semantic versioning.

Explicit, strict and automatic project version management based on semantic versioning. Getting started End users Semantic versioning Project version

6 Jan 25, 2022

A simple document management REST based API for collaboratively interacting with documents

documan_api A simple document management REST based API for collaboratively interacting with documents.

1 Jan 22, 2022

A tutorial for people to run synthetic data replica's from source healthcare datasets

Synthetic-Data-Replica-for-Healthcare Description What is this? A tailored hands-on tutorial showing how to use Python to create synthetic data replic

11 Mar 22, 2022

Python Programming (Practical) (1-25) Download 👇🏼

BCA-603 : Python Programming (Practical) (1-25) Download zip ?? ?? How to run programs : Clone or download this repo to your computer. Unzip (If you d

2 Jun 2, 2022

Some code that takes a pipe-separated input and converts that into a table!

tablemaker A program that takes an input: a | b | c # With comments as well. e | f | g h | i |jk And converts it to a table: ┌───┬───┬────┐ │ a │ b │

2 Aug 30, 2022

A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

FREE Reverse Engineering Self-Study Course HERE Fundamental Python The book and code repo for the FREE Fundamental Python book by Kevin Thomas. FREE B

7 Mar 19, 2022

Build documentation in multiple repos into one site.

mkdocs-multirepo-plugin Build documentation in multiple repos into one site. Setup Install plugin using pip: pip install git+https://github.com/jdoiro

47 Dec 28, 2022

Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

This repository contains code that gets data regarding financial disclosures from the Court Listener API main.py: contains driver code that interacts

2 Aug 6, 2022

graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elliptical orbits. you can change timestamp value or scale from source code idc.

solarSystemOrbitalSimulation graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elli

3 Mar 3, 2022

Software engineering course project. Secondhand trading system.

PigeonSale Software engineering course project. Secondhand trading system. Documentation API doumenatation: list of APIs Backend documentation: notes

1 Sep 1, 2022

Build AGNOS, the operating system for your comma three

agnos-builder This is the tool to build AGNOS, our Ubuntu based OS. AGNOS runs on the comma three devkit. NOTE: the edk2_tici and agnos-firmare submod

21 Dec 24, 2022

BakTst_Org is a backtesting system for quantitative transactions.

BakTst_Org 中文reademe：传送门 Introduction: BakTst_Org is a prototype of the backtesting system used for BTC quantitative trading. This readme is mainly di

18 May 8, 2021

Code for our SIGIR 2022 accepted paper : P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning

P3 Ranker Implementation for our SIGIR2022 accepted paper: P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-bas

14 Jan 4, 2023

🏆 A ranked list of awesome python developer tools and libraries. Updated weekly.

Best-of Python Developer Tools ?? A ranked list of awesome python developer tools and libraries. Updated weekly. This curated list contains 250 awesom

646 Jan 7, 2023

ReStructuredText and Sphinx bridge to Doxygen

Breathe Packagers: PGP signing key changes for Breathe >= v4.23.0. https://github.com/michaeljones/breathe/issues/591 This is an extension to reStruct

643 Dec 31, 2022

Watch a Sphinx directory and rebuild the documentation when a change is detected. Also includes a livereload enabled web server.

sphinx-autobuild Rebuild Sphinx documentation on changes, with live-reload in the browser. Installation sphinx-autobuild is available on PyPI. It can

440 Jan 6, 2023

Quickly download, clean up, and install public datasets into a database management system

Related tags

Overview

Installing the Current Release

Installing From Source

To install from source

Using the Command Line

Examples

Spatial Dataset Installation

Website

Acknowledgments

Comments

NOTES

Expected behavior and actual behavior.

ogr_not_working

ogr_working

Operating system

GDAL version and provenance

Releases(v3.1.0)

v3.1.0(Apr 26, 2022)

v3.1.0

Major changes

v3.0.0(Jul 16, 2020)

v3.0.0

Major changes

Minor changes

v2.4.0(Jun 10, 2019)

Retriever v2.4.0

Minor changes

New datasets

v2.3.1(May 1, 2019)

Retriever v2.3.1

Minor changes

v2.3.0(May 1, 2019)

Retriever v2.3.0

Major changes

Minor changes

New datasets

Dataset changes

v.2.2.0(Nov 6, 2018)

Major changes

Minor changes

New datasets

v2.1.0(Oct 27, 2017)

v2.1.0

Major changes

Minor changes

New datasets

v2.0.0(Feb 24, 2017)

v2.0.0

Major changes

Minor changes

New datasets

v1.8.3(Feb 12, 2016)

v1.8.3

v1.8.2

v1.8.0

v1.8.2(Feb 12, 2016)

v1.8.2

v1.8.0

v1.7.0(Oct 5, 2014)

v1.6.0(Feb 11, 2014)

Owner

Weecology

Template repo to quickly make a tested and documented GitHub action in Python with Poetry

Highlight Translator can help you translate the words quickly and accurately.

xeuledoc - Fetch information about a public Google document.

Explicit, strict and automatic project version management based on semantic versioning.

A simple document management REST based API for collaboratively interacting with documents

A tutorial for people to run synthetic data replica's from source healthcare datasets

Python Programming (Practical) (1-25) Download 👇🏼

Some code that takes a pipe-separated input and converts that into a table!

A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

Build documentation in multiple repos into one site.

Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

graphical orbitational simulation of solar system planets with real values and physics implemented so you get a nice elliptical orbits. you can change timestamp value or scale from source code idc.

Software engineering course project. Secondhand trading system.

Build AGNOS, the operating system for your comma three

BakTst_Org is a backtesting system for quantitative transactions.

Code for our SIGIR 2022 accepted paper : P3 Ranker: Mitigating the Gaps between Pre-training and Ranking Fine-tuning with Prompt-based Learning and Pre-finetuning