Run MapReduce jobs on Hadoop or Amazon Web Services

Overview

mrjob: the Python MapReduce library

https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png

mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop Streaming jobs.

Stable version (v0.7.4) documentation

Development version documentation

https://travis-ci.org/Yelp/mrjob.png

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.

Some important features:

  • Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).
  • Write multi-step jobs (one map-reduce step feeds into the next)
  • Easily launch Spark jobs on EMR or your own Hadoop cluster
  • Duplicate your production environment inside Hadoop
    • Upload your source tree and put it in your job's $PYTHONPATH
    • Run make and other setup scripts
    • Set environment variables (e.g. $TZ)
    • Easily install python packages from tarballs (EMR only)
    • Setup handled transparently by mrjob.conf config file
  • Automatically interpret error logs
  • SSH tunnel to hadoop job tracker (EMR only)
  • Minimal setup
    • To run on EMR, set $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY
    • To run on Dataproc, set $GOOGLE_APPLICATION_CREDENTIALS
    • No setup needed to use mrjob on your own Hadoop cluster

Installation

pip install mrjob

As of v0.7.0, Amazon Web Services and Google Cloud Services are optional depedencies. To use these, install with the aws and google targets, respectively. For example:

pip install mrjob[aws]

A Simple Map Reduce Job

Code for this example and more live in mrjob/examples.

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()

Try It Out!

# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on Dataproc
python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts

Setting up EMR on Amazon

Setting up Dataproc on Google

Advanced Configuration

To run in other AWS regions, upload your source tree, run make, and use other advanced mrjob features, you'll need to set up mrjob.conf. mrjob looks for its conf file in:

  • The contents of $MRJOB_CONF
  • ~/.mrjob.conf
  • /etc/mrjob.conf

See the mrjob.conf documentation for more information.

Project Links

Reference

More Information

Thanks to Greg Killion (ROMEO ECHO_DELTA) for the logo.

Comments
  • simplify setup/bootstrap cmds/scripts/actions

    simplify setup/bootstrap cmds/scripts/actions

    mrjob has a lot of options that are designed to upload a file, run a command, or add something to the environment:

    • bootstrap_actions
    • bootstrap_cmds
    • bootstrap_files
    • bootstrap_mrjob
    • bootstrap_python_packages
    • bootstrap_scripts
    • file_upload_args (not available from the command line)
    • python_archives
    • setup_cmds
    • setup_scripts
    • upload_archives
    • upload_files

    There are two main problems with the way things are now:

    • It's confusing
    • Options don't always run in the order you want. For example, if you upgrade Python in bootstrap_cmds, bootstrap_python_packages becomes useless because it runs first (so it'll install packages for the old version of Python).

    I don't have a complete solution, but I imagine something where you simply specify the commands you want to run, possibly referencing local paths or S3/HDFS URIs, and mrjob just does the right thing. We just need a clean way of disambiguating local and remote files.

    We should aim to make this solution the canonical way of doing things in mrjob v0.4.

    Cleanup 
    opened by coyotemarin 44
  • Python 3.3 compatibility (except EMR)

    Python 3.3 compatibility (except EMR)

    The goal of this ticket is a full port of mrjob that works for everything except EMR. This includes tests.

    Roadmap:

    • v0.4.2: Last release to support 2.5
    • v0.4.4: Experimental support for Python 3, except EMR
    • v0.5.0: Full support for Python 3.

    This does not include EMR support (#989) or updating docs and examples (#994).

    Feature 
    opened by irskep 27
  • bug/non.streaming.jar -- input output specification

    bug/non.streaming.jar -- input output specification

    This implements the initial support for specifying the input and output for the jar. It uses INPUT_MARKER and OUTPUT_MARKER. It is still probably suboptimal and needs tests. Let me know what you think.

    opened by timtadh 24
  • add support for starting an emr job flow with spot instances (requires master branch from hblanks/boto)

    add support for starting an emr job flow with spot instances (requires master branch from hblanks/boto)

    This commit adds support for starting an emr job flow with spot instances. A sample mrjob.conf follows.

    Please note: this commit depends on the master branch from:

    https://github.com/hblanks/boto
    

    A pull request for this branch is out to the main boto repo, at https://github.com/boto/boto/pull/322.

    This commit makes fairly small changes but does demand a certain amount of YAML'ing in order to specify instance groups. If there's anything I can be doing better, please let me know.

    Sample mrjob.conf:

    runners:
        emr:
            emr_instance_groups:
                -
                    count: 1
                    role: MASTER
                    instance_type: m1.small
                    market: SPOT
                    name: "[email protected]"
                    bid_price: "0.20"
    
                -
                    count: 3
                    role: CORE
                    instance_type: c1.medium
                    market: SPOT
                    name: "[email protected]"
                    bid_price: "0.20"
    
    opened by hblanks 24
  • --libjar option

    --libjar option

    Currently, using -libjars requires something like this:

    --bootstrap-file myjar.jar --bootstrap-cmd 'cp myjar.jar /home/hadoop/myjar.jar' --hadoop-arg -libjars --hadoop-arg /home/hadoop/myjar.jar
    

    It would be nice if we could do this instead:

    --libjar myjar.jar
    

    mrjob would:

    • Upload the file like the other bootstrap files
    • Copy it to a unique location like /tmp/myjar-235324.jar
    • Add the appropriate -libjars argument to StreamingStep.step_args
    Feature 
    opened by irskep 22
  • Add

    Add "migrating from dumbo" section to docs

    Migrating from dumbo (the other MapReduce Python module) should be pretty easy because its mappers and reducers have the same function signature.

    Would be great to have some input from someone who actually uses dumbo so we're not just making stuff up. :)

    Docs 
    opened by coyotemarin 22
  • Fix bug where mkdir_on_hdfs not work correctly on Hadoop 2.2 / 0.23.

    Fix bug where mkdir_on_hdfs not work correctly on Hadoop 2.2 / 0.23.

    For 2.x & 0.23, mkdir need -p for creating parent directories:

    http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/FileSystemShell.html#mkdir http://hadoop.apache.org/docs/r0.23.10/hadoop-project-dist/hadoop-common/FileSystemShell.html#mkdir

    For 1.x & 0.1x & 0.20, mkdir doesn't need any additional parameters: http://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#mkdir

    opened by ShusenLiu 21
  • binary input with manifest file

    binary input with manifest file

    This would help with #753 and other attempts to support binary data.

    Say you want to take the US Census TIGER Shapefiles as the input to the first step of your job. You want each mapper to receive exactly one file, and you need to know each file's name so you can match up the .shp and .dbf files in your reducer. And you don't want to write any Java, or include any custom jars.

    I think mrjob could handle it this way:

    • make a manifest file, with the URI of one input file per line.
    • use --jobconf mapred.max.split.size=1 to force Hadoop to pass one line to each mapper.
    • in the wrapper script that we already use for setup commands:
      • read the URI from stdin
      • export jobconf environment variables (e.g. map_input_file) so that the job thinks it's reading directly from the file.
      • use hadoop fs to pipe the file into the wrapped command: set -o pipefail; hadoop fs -cat <uri> | "$@"
    • The job reads directly from self.stdin in its mapper_init() step (see #753).

    All local and inline modes have to do to emulate this is to not split input files.

    Feature 
    opened by coyotemarin 21
  • Integrate typedbytes into mrjob

    Integrate typedbytes into mrjob

    Here is a patch to add support for typedbytes to mrjob.

    To enable a class to use typedbytes instead of the textual interface, you just need to specify

    class MyMRJob(MRJob): STREAMING_INTERFACE = MRJob.STREAMING_INTERFACE_TYPED_BYTES

    And then the interface to hadoop streaming will use typedbytes instead of line based formats.

    This is compatible with dumbo, and adds typedbytes as a requirement to MRJob.

    This fixes #430.

    opened by dgleich 21
  • Pig Support for Mrjob

    Pig Support for Mrjob

    Issue for pig support opened in https://github.com/Yelp/mrjob/issues/377

    I also have a few other patches with this pull request - https://github.com/Yelp/mrjob/issues/317 and the ability to force clear the output folder, --force-clear-output-dir. The latter is used only for debugging jobs.

    I need to add integration tests for these. I delayed this for some time due to the fear of testify, but now can add these in soon.

    This is an early version especially for the pig support. Feel free to provide me with suggestions or coding style. I am relatively new to python but am using it frequently off late.

    Thanks, Shiv

    opened by sshivaji 20
  • Support all URIs

    Support all URIs

    s3n:// is the technically correct way for EMR to refer to files on S3, but nowadays, s3:// does the exact same thing.

    (s3:// used to refer to some legacy block format that only HDFS could use; now it's deprecated, and you have to use s3bfs:// instead.)

    We should probably allow users to use s3:// or s3n:// as they prefer, and translate either to s3:// for consistency.

    Feature 
    opened by coyotemarin 20
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • docs/whats-new.rst
    • mrjob/examples/mr_text_classifier.py
    • mrjob/sim.py

    Fixes:

    • Should read refers rather than referse.
    • Should read consistently rather than consistenly.
    • Should read because rather than becaue.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • Failure to run mrjob on dataproc

    Failure to run mrjob on dataproc

    I set GOOGLE_APPLICATION_CREDENTIALS env variable properly, and am running a simple mrjob with -r dataproc option. However, it says

    google.api_core.exceptions.Unknown: None Stream removed
    

    While calling

    self.cluster_client.get_cluster()
    

    I see that the v1beta2 API has been deprecated in favor of v1. Does the dataproc plugin need a re-write to adopt it? Anything I can do to help?

    opened by BradHolmes 0
  • Make unpacking archives optional

    Make unpacking archives optional

    We are using MrJob to process WARC files, in similar manner to this example given in the Writing Jobs guide.

    For our use case, it is crucial that the .gz compressed file is not automatically decompressed before use.

    This PR proposes a new setting that would allow this to be controlled via a unpack_archives option passed to the MrJob runner. This new option defaults to True to maintain the expected default behaviour, while allowing us to set it to False when needed. We have tested this locally and it seems to work just fine.

    I've attempted to document this new option, as per the contributing guidelines, but I'm not sure I've covered everything. Is there any other documentation I should add?

    opened by anjackson 0
Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.

Streamparse lets you run Python code against real-time streams of data via Apache Storm. With streamparse you can create Storm bolts and spouts in Pyt

Parsely, Inc. 1.5k Dec 22, 2022
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Luigi is a Python (3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow managemen

Spotify 16.2k Jan 1, 2023
Nautobot-custom-jobs - Custom jobs for Nautobot

nautobot-custom-jobs This repo contains custom jobs for Nautobot. Installation P

Dan Peachey 9 Oct 27, 2022
Universal Command Line Interface for Amazon Web Services

aws-cli This package provides a unified command line interface to Amazon Web Services. Jump to: Getting Started Getting Help More Resources Getting St

Amazon Web Services 13.3k Jan 1, 2023
Universal Command Line Interface for Amazon Web Services

This package provides a unified command line interface to Amazon Web Services.

Amazon Web Services 13.3k Jan 7, 2023
HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions 💻🌐💡

aws-iot-shadow-rest-api HTTP Calls to Amazon Web Services Rest API for IoT Core Shadow Actions ?? ?? ?? This simple script implements the following aw

AIIIXIII 3 Jun 6, 2022
Exercise to teach a newcomer to the CLSP grid to set up their environment and run jobs

Exercise to teach a newcomer to the CLSP grid to set up their environment and run jobs

Alexandra 2 May 18, 2022
Django Serverless Cron - Run cron jobs easily in a serverless environment

Django Serverless Cron - Run cron jobs easily in a serverless environment

Paul Onteri 41 Dec 16, 2022
Amazon Scraper: A command-line tool for scraping Amazon product data

Amazon Product Scraper: 2021 Description A command-line tool for scraping Amazon product data to CSV or JSON format(s). Requirements Python 3 pip3 Ins

null 49 Nov 15, 2021
Live Coding - Mensageria na AWS com Amazon SNS e Amazon SQS

Live Coding - Mensageria na AWS com Amazon SNS e Amazon SQS Repositório para o Live Coding do dia 08/12/2021 Serviços utilizados Amazon SNS Amazon SQS

Cassiano Ricardo de Oliveira Peres 3 Mar 1, 2022
Official s3cmd repo -- Command line tool for managing Amazon S3 and CloudFront services

S3cmd tool for Amazon Simple Storage Service (S3) Author: Michal Ludvig, [email protected] Project homepage (c) TGRMN Software and contributors S3tools

null 4.1k Jan 6, 2023
RESTler is the first stateful REST API fuzzing tool for automatically testing cloud services through their REST APIs and finding security and reliability bugs in these services.

RESTler is the first stateful REST API fuzzing tool for automatically testing cloud services through their REST APIs and finding security and reliability bugs in these services.

Microsoft 1.8k Jan 4, 2023
Finds Jobs on LinkedIn using web-scraping

Find Jobs on LinkedIn ?? This program finds jobs by scraping on LinkedIn ??‍?? Relies on User Input. Accepts: Country, City, State ?? Data about jobs

Matt 44 Dec 27, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Dec 31, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 20.6k Feb 13, 2021
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
Hadoop Yan RPC unauthorized RCE

Vuln Impact On November 15, 2021, A security researcher disclosed that there was an unauthorized access vulnerability in Hadoop yarn RPC. This vulnera

Al1ex 25 Nov 24, 2022
Hadoop Yan ResourceManager unauthorized RCE

Vuln Impact There was an unauthorized access vulnerability in Hadoop yarn ResourceManager. This vulnerability existed in Hadoop yarn, the core compone

Al1ex 25 Nov 24, 2022
Implementation of a hadoop based movie recommendation system

Implementation-of-a-hadoop-based-movie-recommendation-system 通过编写代码,设计一个基于Hadoop的电影推荐系统,通过此推荐系统的编写,掌握在Hadoop平台上的文件操作,数据处理的技能。windows 10 hadoop 2.8.3 p

汝聪(Ricardo) 5 Oct 2, 2022
Run your jupyter notebooks as a REST API endpoint. This isn't a jupyter server but rather just a way to run your notebooks as a REST API Endpoint.

Jupter Notebook REST API Run your jupyter notebooks as a REST API endpoint. This isn't a jupyter server but rather just a way to run your notebooks as

Invictify 54 Nov 4, 2022