gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases.

Related tags

CLI Tools gget
Overview

gget

pypi version Downloads license DOI status Code Coverage

gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying in a single line of code.

Please cite the following paper:
Luebbert, L. & Pachter, L. (2022). Efficient querying of genomic databases for single-cell RNA-seq with gget. bioRxiv 2022.05.17.492392; doi: https://doi.org/10.1101/2022.05.17.492392

alt text

gget currently consists of the following nine modules:

  • gget ref
    Fetch File Transfer Protocols (FTPs) and metadata for reference genomes and annotations from Ensembl by species.
  • gget search
    Fetch genes and transcripts from Ensembl using free-form search terms.
  • gget info
    Fetch extensive gene and transcript metadata from Ensembl, UniProt, and NCBI using Ensembl IDs.
  • gget seq
    Fetch nucleotide or amino acid sequences of genes or transcripts from Ensembl or UniProt, respectively.
  • gget blast
    BLAST a nucleotide or amino acid sequence to any BLAST database.
  • gget blat
    Find the genomic location of a nucleotide or amino acid sequence using BLAT.
  • gget muscle
    Align multiple nucleotide or amino acid sequences to each other using Muscle5.
  • gget enrichr
    Perform an enrichment analysis on a list of genes using Enrichr.
  • gget archs4
    Find the most correlated genes to a gene of interest or find the gene's tissue expression atlas using ARCHS4.

Installation

pip install gget

For use in Jupyter Lab / Google Colab:

import gget

Quick start guide

# Fetch all Homo sapiens reference and annotation FTPs from the latest Ensembl release
$ gget ref -s homo_sapiens

# Search human genes with "ace2" AND "angiotensin" in their name/description
$ gget search -sw ace2,angiotensin -s homo_sapiens -ao and 

# Look up gene ENSG00000130234 (ACE2) with expanded info (returns all transcript isoforms for genes)
$ gget info -id ENSG00000130234 -e

# Fetch the amino acid sequence of the canonical transcript of gene ENSG00000130234
$ gget seq -id ENSG00000130234 --seqtype transcript

# Quickly find the genomic location of (the start of) that amino acid sequence
$ gget blat -seq MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS

# Blast (the start of) that amino acid sequence
$ gget blast -seq MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS

# Align nucleotide or amino acid sequences stored in a FASTA file
$ gget muscle -fa path/to/file.fa

# Use Enrichr to find the ontology of a list of genes
$ gget enrichr -g ACE2 AGT AGTR1 ACE AGTRAP AGTR2 ACE3P -db ontology

# Get the human tissue expression atlas of gene ACE2
$ gget archs4 -g ACE2 -w tissue

Jupyter Lab / Google Colab:

gget.ref("homo_sapiens")
gget.search(["ace2", "angiotensin"], "homo_sapiens", andor="and")
gget.info("ENSG00000130234", expand=True)
gget.seq("ENSG00000130234", seqtype="transcript")
gget.blat("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget.blast("MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS")
gget.muscle("path/to/file.fa")
gget.enrichr(["ACE2", "AGT", "AGTR1", "ACE", "AGTRAP", "AGTR2", "ACE3P"], database="ontology", plot=True)
gget.archs4("ACE2", which="tissue")

Manual

Jupyter Lab / Google Colab arguments are equivalent to long-option arguments (--arg).
The manual for any gget tool can be called from terminal using the -h --help flag.

gget ref

Fetch FTPs and their respective metadata (or use flag ftp to only return the links) for reference genomes and annotations from Ensembl by species.
Return format: dictionary/json.

Required arguments
-s --species
Species for which the FTPs will be fetched in the format genus_species, e.g. homo_sapiens.
Note: Not required when calling flag [--list_species].
Supported shortcuts: 'human', 'mouse'

Optional arguments
-w --which
Defines which results to return. Default: 'all' -> Returns all available results.
Possible entries are one or a combination of the following:
'gtf' - Returns the annotation (GTF).
'cdna' - Returns the trancriptome (cDNA).
'dna' - Returns the genome (DNA).
'cds' - Returns the coding sequences corresponding to Ensembl genes. (Does not contain UTR or intronic sequence.)
'cdrna' - Returns transcript sequences corresponding to non-coding RNA genes (ncRNA).
'pep' - Returns the protein translations of Ensembl genes.

-r --release
Defines the Ensembl release number from which the files are fetched, e.g. 104. Default: latest Ensembl release.

-o --out
Path to the json file the results will be saved in, e.g. path/to/directory/results.json. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
-l --list_species
Lists all available species. (Jupyter Lab / Google Colab: combine with species=None.)

-ftp --ftp
Returns only the requested FTP links.

-d --download
Downloads the requested FTPs to the current directory (requires curl to be installed).

Examples

Use gget ref in combination with kallisto | bustools to build a reference index:

kb ref -i INDEX -g T2G -f1 FASTA $(gget ref --ftp -w dna,gtf -s homo_sapiens)

→ kb ref builds a reference index using the latest DNA and GTF files of species Homo sapiens passed to it by gget ref.

Get all available genomes:

gget ref --list -r 103
# Jupyter Lab / Google Colab:
gget.ref(species=None, list_species=True, release=103)

→ Returns a list with all available genomes (checks if GTF and FASTAs are available) from Ensembl release 103.
(If no release is specified, gget ref will always return information from the latest Ensembl release.)

Get the genome reference for a specific species:

gget ref -s homo_sapiens -w gtf dna
# Jupyter Lab / Google Colab:
gget.ref("homo_sapiens", which=["gtf", "dna"])

→ Returns a json with the latest human GTF and FASTA FTPs, and their respective metadata, in the format:

{
    "homo_sapiens": {
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz",
            "ensembl_release": 106,
            "release_date": "28-Feb-2022",
            "release_time": "23:27",
            "bytes": "51379459"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 106,
            "release_date": "21-Feb-2022",
            "release_time": "09:35",
            "bytes": "881211416"
        }
    }
}

More examples


gget search

Fetch genes and transcripts from Ensembl using free-form search terms.
Return format: data frame.

Required arguments
-sw --searchwords
One or more free form search words, e.g. gaba, nmda. (Note: Search is not case-sensitive.)

-s --species
Species or database to be searched.
A species can be passed in the format 'genus_species', e.g. 'homo_sapiens'.
To pass a specific database, pass the name of the CORE database, e.g. 'mus_musculus_dba2j_core_105_1'.
All availabale databases can be found here.
Supported shortcuts: 'human', 'mouse'.

Optional arguments
-st --seqtype
'gene' (default) or 'transcript'
Returns genes or transcripts, respectively.

-ao --andor
'or' (default) or 'and'
'or': Returns all genes that INCLUDE AT LEAST ONE of the searchwords in their name/description.
'and': Returns only genes that INCLUDE ALL of the searchwords in their name/description.

-l --limit
Limits the number of search results, e.g. 10. Default: None.

-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
wrap_text
Jupyter Lab / Google Colab only. wrap_text=True displays data frame with wrapped text for easy reading (default: False).

Example

gget search -sw gaba gamma-aminobutyric -s homo_sapiens
# Jupyter Lab / Google Colab:
gget.search(["gaba", "gamma-aminobutyric"], "homo_sapiens")

→ Returns all genes that contain at least one of the search words in their name or Ensembl/external reference description:

ensembl_id gene_name ensembl_description ext_ref_description biotype url
ENSG00000034713 GABARAPL2 GABA type A receptor associated protein like 2 [Source:HGNC Symbol;Acc:HGNC:13291] GABA type A receptor associated protein like 2 protein_coding https://uswest.ensembl.org/homo_sapiens/Gene/Summary?g=ENSG00000034713
. . . . . . . . . . . . . . . . . .

More examples


gget info

Fetch extensive gene and transcript metadata from Ensembl, UniProt, and NCBI using Ensembl IDs.
Return format: data frame.

Required arguments
-id --ens_ids
One or more Ensembl IDs.

Optional arguments
-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
-e --expand
Expands returned information (only for gene and transcript IDs).
For genes, adds information on all known transcripts.
For transcripts, adds information on all known translations and exons.

wrap_text
Jupyter Lab / Google Colab only. wrap_text=True displays data frame with wrapped text for easy reading (default: False).

Example

gget info -id ENSG00000034713 ENSG00000104853 ENSG00000170296 -e 
# Jupyter Lab / Google Colab:
gget.info(["ENSG00000034713", "ENSG00000104853", "ENSG00000170296"], expand=True)

→ Returns extensive information about each requested Ensembl ID in data frame format:

uniprot_id ncbi_gene_id primary_gene_name synonyms protein_names ensembl_description uniprot_description ncbi_description biotype canonical_transcript ...
ENSG00000034713 P60520 11345 GABARAPL2 [ATG8, ATG8C, FLC3A, GABARAPL2, GATE-16, GATE16, GEF-2, GEF2] Gamma-aminobutyric acid receptor-associated protein like 2 (GABA(A) receptor-associated protein-like 2)... GABA type A receptor associated protein like 2 [Source:HGNC Symbol;Acc:HGNC:13291] FUNCTION: Ubiquitin-like modifier involved in intra- Golgi traffic (By similarity). Modulates intra-Golgi transport through coupling between NSF activity and ... Enables ubiquitin protein ligase binding activity. Involved in negative regulation of proteasomal protein catabolic process and protein... protein_coding ENST00000037243.7 ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

More examples


gget seq

Fetch nucleotide or amino acid sequence of a gene (and all its isoforms) or a transcript by Ensembl ID.
Return format: FASTA.

Required arguments
-id --ens_ids
One or more Ensembl IDs.

Optional arguments
-st --seqtype
'gene' (default) or 'transcript'.
Defines whether nucleotide or amino acid sequences are returned.
Nucleotide sequences are fetched from Ensembl.
Amino acid sequences are fetched from UniProt.

-o --out
Path to the file the results will be saved in, e.g. path/to/directory/results.fa. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
-i --isoforms
Returns the sequences of all known transcripts.
(Only for gene IDs in combination with seqtype=transcript.)

Examples

gget seq -id ENSG00000034713 ENSG00000104853 ENSG00000170296
# Jupyter Lab / Google Colab:
gget.seq(["ENSG00000034713", "ENSG00000104853", "ENSG00000170296"])

→ Returns the nucleotide sequences of ENSG00000034713, ENSG00000104853, and ENSG00000170296 in FASTA format.

gget seq -id ENSG00000034713 -st transcript -iso
# Jupyter Lab / Google Colab:
gget.seq("ENSG00000034713", seqtype="transcript", isoforms=True)

→ Returns the amino acid sequences of all known transcripts of ENSG00000034713 in FASTA format.

More examples


gget blast

BLAST a nucleotide or amino acid sequence to any BLAST database.
Return format: data frame.

Required arguments
-seq --sequence
Nucleotide or amino acid sequence, or path to FASTA or .txt file.

Optional arguments
-p --program
'blastn', 'blastp', 'blastx', 'tblastn', or 'tblastx'.
Default: 'blastn' for nucleotide sequences; 'blastp' for amino acid sequences.

-db --database
'nt', 'nr', 'refseq_rna', 'refseq_protein', 'swissprot', 'pdbaa', or 'pdbnt'.
Default: 'nt' for nucleotide sequences; 'nr' for amino acid sequences.
More info on BLAST databases

-l --limit
Limits number of hits to return. Default: 50.

-e --expect
Defines the expect value cutoff. Default: 10.0.

-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
-lcf --low_comp_filt
Turns on low complexity filter.

-mbo --megablast_off
Turns off MegaBLAST algorithm. Default: MegaBLAST on (blastn only).

-q --quiet
Prevents progress information from being displayed.

wrap_text
Jupyter Lab / Google Colab only. wrap_text=True displays data frame with wrapped text for easy reading (default: False).

Example

gget blast -seq MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR
# Jupyter Lab / Google Colab:
gget.blast("MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR")

→ Returns the BLAST result of the sequence of interest in data frame format. gget blast automatically detects this sequence as an amino acid sequence and therefore sets the BLAST program to blastp with database nr.

Description Scientific Name Common Name Taxid Max Score Total Score Query Cover ...
PREDICTED: gamma-aminobutyric acid receptor-as... Colobus angolensis palliatus NaN 336983 180 180 100% ...
. . . . . . . . . . . . . . . . . . . . . ...

BLAST from .fa or .txt file:

gget blast -seq fasta.fa
# Jupyter Lab / Google Colab:
gget.blast("fasta.fa")

→ Returns the BLAST results of the first sequence contained in the fasta.fa file.

More examples


gget blat

Find the genomic location of a nucleotide or amino acid sequence using BLAT.
Return format: data frame.

Required arguments
-seq --sequence
Nucleotide or amino acid sequence, or path to FASTA or .txt file.

Optional arguments
-st --seqtype
'DNA', 'protein', 'translated%20RNA', or 'translated%20DNA'.
Default: 'DNA' for nucleotide sequences; 'protein' for amino acid sequences.

-a --assembly
'human' (hg38) (default), 'mouse' (mm39), 'zebrafinch' (taeGut2),
or any of the species assemblies available here (use short assembly name).

-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Example

gget blat -seq MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR -a taeGut2
# Jupyter Lab / Google Colab:
gget.blat("MKWMFKEDHSLEHRCVESAKIRAKYPDRVPVIVEKVSGSQIVDIDKRKYLVPSDITVAQFMWIIRKRIQLPSEKAIFLFVDKTVPQSR", assembly="taeGut2")

→ Returns BLAT results for assembly taeGut2 (zebra finch) in data frame format. In the above example, gget blat automatically detects this sequence as an amino acid sequence and therefore sets the BLAT seqtype to protein.

genome query_size aligned_start aligned_end matches mismatches %_aligned ...
taeGut2 88 12 88 77 0 87.5 ...

More examples


gget muscle

Align multiple nucleotide or amino acid sequences to each other using Muscle5.
Return format: ClustalW formatted standard out or aligned FASTA.

Required arguments
-fa --fasta
Path to FASTA or .txt file containing the nucleotide or amino acid sequences to be aligned.

Optional arguments
-o --out
Path to the aligned FASTA file the results will be saved in, e.g. path/to/directory/results.afa. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
-s5 --super5
Aligns input using the Super5 algorithm instead of the Parallel Perturbed Probcons (PPP) algorithm to decrease time and memory.
Use for large inputs (a few hundred sequences).

wrap_text
Jupyter Lab / Google Colab only. wrap_text=True displays data frame with wrapped text for easy reading (default: False).

Example

gget muscle -fa fasta.fa
# Jupyter Lab / Google Colab:
gget.muscle("fasta.fa")

→ Returns an overview of the aligned sequences with ClustalW coloring. (To return an aligned FASTA (.afa) file, use --out argument (or save=True in Jupyter Lab/Google Colab).) In the above example, the 'fasta.fa' includes several sequences to be aligned (e.g. isoforms returned from gget seq).

alt text

More examples


gget enrichr

Perform an enrichment analysis on a list of genes using Enrichr.
Return format: data frame.

Required arguments
-g --genes
Short names (gene symbols) of genes to perform enrichment analysis on, e.g. 'PHF14 RBM3 MSL1 PHF21A'.

-db --database
Database to use as reference for the enrichment analysis.
Supports any database listed here under 'Gene-set Library' or one of the following shortcuts:
'pathway'       (KEGG_2021_Human)
'transcription'     (ChEA_2016)
'ontology'      (GO_Biological_Process_2021)
'diseases_drugs'   (GWAS_Catalog_2019)
'celltypes'      (PanglaoDB_Augmented_2021)
'kinase_interactions'  (KEA_2015)

Optional arguments
-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Flags
plot
Jupyter Lab / Google Colab only. plot=True provides a graphical overview of the first 15 results (default: False).

Example

gget enrichr -g ACE2 AGT AGTR1 -db ontology
# Jupyter Lab / Google Colab:
gget.enrichr(["ACE2", "AGT", "AGTR1"], database="ontology", plot=True)

→ Returns pathways/functions involving genes ACE2, AGT, and AGTR1 from the GO Biological Process 2021 database in data frame format. In Jupyter Lab / Google Colab, plot=True returns a graphical overview of the results:

alt text

More examples


gget archs4

Find the most correlated genes to a gene of interest or find the gene's tissue expression atlas using ARCHS4.
Return format: data frame.

Required arguments
-g --gene
Short name (gene symbol) of gene of interest, e.g. 'STAT4'.

Optional arguments
-w --which
'correlation' (default) or 'tissue'.
'correlation' returns a gene correlation table that contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues in ARCHS4.
'tissue' returns a tissue expression atlas calculated from human or mouse samples (as defined by 'species') in ARCHS4.

-s --species
'human' (default) or 'mouse'.
Defines whether to use human or mouse samples from ARCHS4.
(Only for tissue expression atlas.)

-o --out
Path to the csv the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.
Jupyter Lab / Google Colab: save=True will save the output in the current working directory.

Examples

gget archs4 -g ACE2
# Jupyter Lab / Google Colab:
gget.archs4("ACE2")

→ Returns the 100 most correlated genes to ACE2 in a data frame:

gene_symbol pearson_correlation
SLC5A1 0.579634
CYP2C18 0.576577
. . . . . .
gget archs4 -g ACE2 -w tissue
# Jupyter Lab / Google Colab:
gget.archs4("ACE2", which="tissue")

→ Returns the tissue expression of ACE2 in a data frame (by default, human data is used):

id min q1 median q3 max
System.Urogenital/Reproductive System.Kidney.RENAL CORTEX 0.113644 8.274060 9.695840 10.51670 11.21970
System.Digestive System.Intestine.INTESTINAL EPITHELIAL CELL 0.113644 5.905560 9.570450 13.26470 13.83590
. . . . . . . . . . . . . . . . . .

More examples


Comments
  • gget alphafold

    gget alphafold "zsh: illegal hardware instruction" on M1

    Hello,

    I'm trying to run gget alphafold on my M1 mac, but am encountering the following error:

    zsh: illegal hardware instruction

    I noticed other threads (143) that comment on the difficulty of running tensorflow with m1 hardware and was wondering if this might be the issue?

    I checked to see what version of tensorflow was installed with pip and found several tensorflow-related packages, but not tensorflow itself, I'm guessing this is why other workarounds don't work (i.e. installing tensorflow alpha, or what is suggested here: https://www.youtube.com/watch?v=WFIZn6titnc) :

    tensorboard 2.9.1 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow-cpu 2.9.1 tensorflow-estimator 2.9.0 tensorflow-io-gcs-filesystem 0.26.0

    Is there an easy way to resolve this?

    -Alex

    opened by alwhiteh 9
  • Keyerror:

    Keyerror: "0000:query"

    i have used example sequence in the alphafold module and it works fine however when I give it a custom sequence it give the said error Keyerror: "0000:query". Please can you guide regarding the matter

    opened by sharzil1994 5
  • Error running alphafold

    Error running alphafold

    Hi I am running gget version: 0.3.7. When I run alphafold prediction I get this error: gget alphafold AASEQUENCE /home/ccadmin/anaconda3/envs/gget/lib/python3.9/site-packages/haiku/_src/data_structures.py:37: FutureWarning: jax.tree_structure is deprecated, and will be removed in a future release. Use jax.tree_util.tree_structure instead. PyTreeDef = type(jax.tree_structure(None)) Fri Aug 12 20:12:30 2022 INFO Validating input sequence(s). Using the single-chain model. Fri Aug 12 20:12:30 2022 INFO Finding closest source for reference database. Jackhmmer search: 5%|██▉ | 7/147 [elapsed: 11:32 remaining: 3:50:48] Traceback (most recent call last): File "/home/ccadmin/anaconda3/envs/gget/bin/gget", line 8, in sys.exit(main()) File "/home/ccadmin/anaconda3/envs/gget/lib/python3.9/site-packages/gget/main.py", line 1439, in main alphafold( File "/home/ccadmin/anaconda3/envs/gget/lib/python3.9/site-packages/gget/gget_alphafold.py", line 467, in alphafold raw_msa_results = get_msa( File "/home/ccadmin/anaconda3/envs/gget/lib/python3.9/site-packages/gget/gget_alphafold.py", line 147, in get_msa raw_msa_results[db_name].extend(jackhmmer_runner.query(fasta_path)) File "/home/ccadmin/anaconda3/envs/gget/lib/python3.9/site-packages/alphafold/data/tools/jackhmmer.py", line 205, in query os.remove(db_local_chunk(i)) FileNotFoundError: [Errno 2] No such file or directory: '/home/ccadmin/tmp/jackhmmer/fcb45c67-8b27-4156-bbd8-9d11512babf2/uniref90_2021_03.fasta.8' Any idea how to fix this?

    opened by xinyangbing 5
  • Fails to depict and answer the polymeric forms

    Fails to depict and answer the polymeric forms

    i have used gget to predict the structure of chloride dismutase and it successfully gave me the pdb file of the structure in a monomeric form and when i cross checked it with pdb database it showed the structure to be a hexameric protein. Now is it necessary to fill this gap ?

    opened by Harpreet525 4
  • Local variable 'db_connection' referenced before assignment

    Local variable 'db_connection' referenced before assignment

    Installed on host via pip install --upgrade gget:

    $ system_profiler SPSoftwareDataType SPHardwareDataType
    Software:
    
        System Software Overview:
    
          System Version: macOS 12.5 (21G72)
          Kernel Version: Darwin 21.6.0
          Boot Volume: Macintosh HD
          Boot Mode: Normal
          Computer Name: Earl Grey
          User Name: Alex Reynolds (areynolds)
          Secure Virtual Memory: Enabled
          System Integrity Protection: Enabled
          Time since boot: 1 day 8:44
    
    Hardware:
    
        Hardware Overview:
    
          Model Name: MacBook Pro
          Model Identifier: MacBookPro16,4
          Processor Name: 8-Core Intel Core i9
          Processor Speed: 2.4 GHz
          Number of Processors: 1
          Total Number of Cores: 8
          L2 Cache (per Core): 256 KB
          L3 Cache: 16 MB
          Hyper-Threading Technology: Enabled
          Memory: 32 GB
          System Firmware Version: 1731.140.2.0.0 (iBridge: 19.16.16064.0.0,0)
          OS Loader Version: 540.120.3~19
          Serial Number (system): C02CT0C0PT01
          Hardware UUID: C6082A3D-359C-5F2C-AC84-5068C7891897
          Provisioning UDID: C6082A3D-359C-5F2C-AC84-5068C7891897
          Activation Lock Status: Disabled
    

    Problematic command:

    $ python --version
    Python 3.8.13
    $ gget search -s homo_sapiens 'usf1'
    Tue Aug  9 20:03:08 2022 INFO Fetching results from database: homo_sapiens_core_107_38
    Tue Aug  9 20:03:11 2022 ERROR The Ensembl server returned the following error: Character set 'utf8' unsupported
    Traceback (most recent call last):
      File "/Users/areynolds/miniconda3/bin/gget", line 8, in <module>
        sys.exit(main())
      File "/Users/areynolds/miniconda3/lib/python3.8/site-packages/gget/main.py", line 1223, in main
        gget_results = search(
      File "/Users/areynolds/miniconda3/lib/python3.8/site-packages/gget/gget_search.py", line 172, in search
        df_temp = pd.read_sql(query, con=db_connection)
    UnboundLocalError: local variable 'db_connection' referenced before assignment
    

    Using version 0.3.7:

    $ gget --version
    gget version: 0.3.7
    
    opened by alexpreynolds 4
  • Jupyter Notebook Kernel Dies When Using gget alphafold

    Jupyter Notebook Kernel Dies When Using gget alphafold

    I am able to use every gget module except for the alphaFold module. Whenever I implement a command line with AlphaFold the Jupyter Notebook kernel dies almost immediately. Is this something that occurs for others? Any recommendations are appreciated.

    Generate new prediction from amino acid sequence

    import gget gget.setup("alphafold") gget.alphafold("MAAHKGAEH")

    Screen Shot 2022-10-27 at 10 17 39 AM
    opened by tmileur 3
  • Add Uniprot localisation data

    Add Uniprot localisation data

    Many thanks for this brilliant tool. I was wondering if it would be possible to add the "subcellular localisation" segment of the uniprot ID to the tools output?

    This would be immensely helpful in terms of filtering for sub cellular location.

    Many thanks and apologies if it does this already, but I couldn't identify this data in the output

    enhancement 
    opened by Nusob888 3
  • AlphaFold model parameters download error

    AlphaFold model parameters download error

    Hi! I am hitting a SSL cert problem when running alphafold setup:

    Tue Aug 16 10:18:49 2022 INFO Downloading AlphaFold model parameters (requires 4.1 GB of storage). This might take a few minutes.
    curl: (60) SSL certificate problem: unable to get local issuer certificate                                                                                
    More details here: https://curl.se/docs/sslcerts.html
    
    curl failed to verify the legitimacy of the server and therefore could not
    establish a secure connection to it. To learn more about this situation and
    how to fix it, please visit the web page mentioned above.
    

    Where are the parameters being downloaded from? I believe this will help me check if I have the right certs and are in the right place. Any additional advice to solve this error would be greatly appreciated! Thank you!

    opened by EvoEpi 3
  • potential issue with UniProt connection

    potential issue with UniProt connection

    Hello,

    Thanks for the great package. I think there may be intermittent issues with the UniProt connection, I received this error today:

    image

    But as you can see, there is a UniProt entry for this gene: https://www.uniprot.org/uniprot/Q8NBP7

    Oddly, grabbing amino acid sequences worked fine for me yesterday. I appreciate any tips!

    opened by keoughkath 3
  • pdb module

    pdb module

    I love the new alphafold feature! Could there also be a gget pdb command for fetching structures from PDB? Combined with gget blast -db pdbaa this could be very powerful for comparing predictions and templates.

    enhancement 
    opened by sbliven 2
  • openmm=7.5.1 is no longer available from conda-forge.

    openmm=7.5.1 is no longer available from conda-forge.

    I cannot get the Alphafold module to work, as openmm v.7.5.1 is no longer available from conda-forge. Later versions of openmm do not have the version method, causing gget to crash using later versions.

    opened by ahwchemistry 2
  • gget alphafold: Add option to define jackhmmer save directory

    gget alphafold: Add option to define jackhmmer save directory

        gget will currently create a "tmp" folder in your home directory ("~/tmp/jackhmmer/") for the Jackhmmer search. I think adding an option to change this path is a great idea for a future version. The temporary files will take up to ~2 GB (in case it is possible to free this space until I have implemented your request).
    

    Originally posted by @lauraluebbert in https://github.com/pachterlab/gget/issues/43#issuecomment-1253796040

    enhancement 
    opened by lauraluebbert 0
  • Error detecting openmm

    Error detecting openmm

    Hi, as the title says, I tried to run this and installed all the dependencies. But, still, somehow it doesn't detect openmm. Can this be resolved? A screenshot is attached. Thanks.

    image

    opened by LalitNM 14
  • Add feature to fetch UCSC IDs

    Add feature to fetch UCSC IDs

    The idea would to create a feature similar to gget search for Ensembl but also for UCSC IDs.

    I remember the last time I had to do something similar, in the end I had to do a request to the path below where "{ucsc_id}" would be the ID itself: "https://genome-euro.ucsc.edu/cgi-bin/hgGene?hgg_gene={ucsc_id}&db=hg19"

    Links that should help: https://genome.ucsc.edu/goldenPath/help/api.html https://www.biotools.fr/human/ucsc_id_converter

    enhancement 
    opened by Joaodemeirelles 0
  • Option to BLAST one protein sequence against another

    Option to BLAST one protein sequence against another

    Thank you for the very cool and important package! This will save me hours and hours of computational work

    I was wondering if you can add an option to BLAST two protein sequences against each other and get their e-value etc. I have a list of proteins that I want to compare to each other. If you'd prefer to point me to how I can make this feature and do a pull request, I'm more than happy to do so too!

    enhancement 
    opened by hoangthienan95 1
Releases(v0.27.0)
Owner
Pachter Lab
Pachter Lab
Free and Open-Source Command Line tool for Text Replacement

Sniplet Free and Open Source Text Replacement Tool Description: Sniplet is a work in progress CLI tool which can do text replacement globally in Linux

Veeraraghavan Narasimhan 13 Nov 28, 2022
Unofficial Open Corporates CLI: OpenCorporates is a website that shares data on corporations under the copyleft Open Database License. This is an unofficial open corporates python command line tool.

Unofficial Open Corporates CLI OpenCorporates is a website that shares data on corporations under the copyleft Open Database License. This is an unoff

Richard Mwewa 30 Sep 8, 2022
Notion-cli-list-manager - A simple command-line tool for managing Notion databases

A simple command-line tool for managing Notion List databases. ✨

Giacomo Salici 75 Dec 4, 2022
A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

Tom Dörr 976 Jan 3, 2023
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Jonardon Hazarika 17 Dec 11, 2022
AML Command Transfer. A lightweight tool to transfer any command line to Azure Machine Learning Services

AML Command Transfer (ACT) ACT is a lightweight tool to transfer any command from the local machine to AML or ITP, both of which are Azure Machine Lea

Microsoft 11 Aug 10, 2022
Bonjour Software pypahe is a Python Package Helper command-line tool.

pypahe Bonjour Software pypahe is a Python Package Helper command-line tool. Requirements Docker runtime Usage print the latest available version of a

Bonjour Software 0 Aug 10, 2021
A command line tool to query source code from your current Python env

wxc wxc (pronounced "which") allows you to inspect source code in your Python environment from the command line. It is based on the inspect module fro

Clément Robert 13 Nov 8, 2022
A cd command that learns - easily navigate directories from the command line

NAME autojump - a faster way to navigate your filesystem DESCRIPTION autojump is a faster way to navigate your filesystem. It works by maintaining a d

William Ting 14.5k Jan 3, 2023
Ros command - Unifying the ROS command line tools

Unifying the ROS command line tools One impairment to ROS 2 adoption is that all

null 37 Dec 15, 2022
A command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, VueJS projects.

Cookiecutter A command-line utility that creates projects from cookiecutters (project templates), e.g. creating a Python package project from a Python

null 18.6k Dec 30, 2022
commandpack - A package of modules for working with commands, command packages, files with command packages.

commandpack Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal:

null 4 Sep 4, 2021
Python command line tool and python engine to label table fields and fields in data files.

Python command line tool and python engine to label table fields and fields in data files. It could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).

APICrafter 22 Dec 5, 2022
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

PdpCLI Quick Links Introduction Installation Tutorial Basic Usage Data Reader / Writer Plugins Introduction PdpCLI is a pandas DataFrame processing CL

Yasuhiro Yamaguchi 15 Jan 7, 2022
A lightweight Python module and command-line tool for generating NATO APP-6(D) compliant military symbols from both ID codes and natural language names

Python military symbols This is a lightweight Python module, including a command-line script, to generate NATO APP-6(D) compliant military symbol icon

Nick Royer 5 Dec 27, 2022
A command line tool (and Python library) for archiving Twitter JSON

A command line tool (and Python library) for archiving Twitter JSON

Documenting the Now 1.3k Dec 28, 2022
MsfMania is a command line tool developed in Python that is designed to bypass antivirus software on Windows and Linux/Mac in the future

MsfMania MsfMania is a command line tool developed in Python that is designed to bypass antivirus software on Windows and Linux/Mac in the future. Sum

null 446 Dec 21, 2022
Python library and command line tool for interacting with Bugzilla

python-bugzilla This package provides two bits: bugzilla python module for talking to a Bugzilla instance over XMLRPC or REST /usr/bin/bugzilla comman

Python Bugzilla Project 112 Nov 5, 2022