Python HDFS client


Python HDFS client

Because the world needs yet another way to talk to HDFS from Python.


This library provides a Python client for WebHDFS. NameNode HA is supported by passing in both NameNodes. Responses are returned as nice Python classes, and any failed operation will raise some subclass of HdfsException matching the Java exception.

Example usage:

>>> fs = pyhdfs.HdfsClient(hosts=',', user_name='someone')
>>> fs.list_status('/')
[FileStatus(pathSuffix='benchmarks', permission='777', type='DIRECTORY', ...), FileStatus(...), ...]
>>> fs.listdir('/')
['benchmarks', 'hbase', 'solr', 'tmp', 'user', 'var']
>>> fs.mkdirs('/fruit/x/y')
>>> fs.create('/fruit/apple', 'delicious')
>>> fs.append('/fruit/apple', ' food')
>>> with contextlib.closing('/fruit/apple')) as f:
b'delicious food'
>>> fs.get_file_status('/fruit/apple')
FileStatus(length=14, owner='someone', type='FILE', ...)
>>> fs.get_file_status('/fruit/apple').owner
>>> fs.get_content_summary('/fruit')
ContentSummary(directoryCount=3, fileCount=1, length=14, quota=-1, spaceConsumed=14, spaceQuota=-1)
>>> list(fs.walk('/fruit'))
[('/fruit', ['x'], ['apple']), ('/fruit/x', ['y'], []), ('/fruit/x/y', [], [])]
>>> fs.exists('/fruit/apple')
>>> fs.delete('/fruit')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../", line 525, in delete
pyhdfs.HdfsPathIsNotEmptyDirectoryException: `/fruit is non empty': Directory is not empty
>>> fs.delete('/fruit', recursive=True)
>>> fs.exists('/fruit/apple')
>>> issubclass(pyhdfs.HdfsFileNotFoundException, pyhdfs.HdfsIOException)

The methods and return values generally map directly to WebHDFS endpoints. The client also provides convenience methods that mimic Python os methods and HDFS CLI commands (e.g. walk and copy_to_local).

pyhdfs logs all HDFS actions at the INFO level, so turning on INFO level logging will give you a debug record for your application.

For more information, see the full API docs.


pip install pyhdfs

Python 3 is required.

Development testing Documentation Status

First run x.y.z, which will download, extract, and run the HDFS NN/DN processes in the current directory. (Replace x.y.z with a real version.) Then run the following commands. Note they will create and delete hdfs://localhost/tmp/pyhdfs_test.


python3 -m venv env
source env/bin/activate
pip install -e .
pip install -r dev_requirements.txt
  • client should return some info when succuessfully create a file

    client should return some info when succuessfully create a file

    for example, hdfs server may return a response with headers like this

    HTTP/1.1 201 Created
    Location: webhdfs://<HOST>:<PORT>/<PATH>
    Content-Length: 0

    I want to get location from response headers, however, client.create do not return any thing.

    opened by cosven 7
  • Write error

    Write error

    Hello Mkdir and listdir work fine But create didn't

    fs.create('/fruit/apple', 'delicious')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/root/miniconda2/lib/python2.7/site-packages/", line 426, in create
        metadata_response.headers['location'], data=data, **self._requests_kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 126, in put
        return request('put', url, data=data, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 58, in request
        return session.request(method=method, url=url, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 512, in request
        resp = self.send(prep, **send_kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 622, in send
        r = adapter.send(request, **kwargs)
      File "/root/miniconda2/lib/python2.7/site-packages/requests/", line 513, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='1566bb80c4dc', port=50075): Max retries exceeded with url: /webhdfs/v1/fruit/apple?op=CREATE& (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f644f364510>: Failed to establish a new connection: [Errno -2] Name or service not known',))
    opened by albertoRamon 4
  • requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    Traceback (most recent call last): File "D:\Anaconda3\lib\site-packages\urllib3\", line 601, in urlopen chunked=chunked) File "D:\Anaconda3\lib\site-packages\urllib3\", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "D:\Anaconda3\lib\http\", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "D:\Anaconda3\lib\http\", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1065, in _send_output self.send(chunk) File "D:\Anaconda3\lib\http\", line 986, in send self.sock.sendall(data) ConnectionResetError: [WinError 10054] 远程主机强迫关闭了一个现有的连接。

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "D:\Anaconda3\lib\site-packages\requests\", line 440, in send timeout=timeout File "D:\Anaconda3\lib\site-packages\urllib3\", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "D:\Anaconda3\lib\site-packages\urllib3\util\", line 357, in increment raise six.reraise(type(error), error, _stacktrace) File "D:\Anaconda3\lib\site-packages\urllib3\packages\", line 685, in reraise raise value.with_traceback(tb) File "D:\Anaconda3\lib\site-packages\urllib3\", line 601, in urlopen chunked=chunked) File "D:\Anaconda3\lib\site-packages\urllib3\", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "D:\Anaconda3\lib\http\", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "D:\Anaconda3\lib\http\", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "D:\Anaconda3\lib\http\", line 1065, in _send_output self.send(chunk) File "D:\Anaconda3\lib\http\", line 986, in send self.sock.sendall(data) urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "D:\workspace\phdfs\", line 144, in fs.copy_from_local(parname,"/test/fcst/china/10d_arwpost_sta/near/" + wrflisttime.format("YYYYMMDD") + "/" + parname,overwrite = True) File "D:\Anaconda3\lib\site-packages\", line 753, in copy_from_local self.create(dest, f, **kwargs) File "D:\Anaconda3\lib\site-packages\", line 426, in create metadata_response.headers['location'], data=data, **self._requests_kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 126, in put return request('put', url, data=data, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 58, in request return session.request(method=method, url=url, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 508, in request resp = self.send(prep, **send_kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 618, in send r = adapter.send(request, **kwargs) File "D:\Anaconda3\lib\site-packages\requests\", line 490, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None))

    opened by Georege 4
  • BUG:Chinese character can't copy to hdfs

    BUG:Chinese character can't copy to hdfs

    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 2-3: Body ('张三') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

    opened by yiershanxll 3
  • Help me,please . The second run of the function in the script results in an abnormal result

    Help me,please . The second run of the function in the script results in an abnormal result

    I am a rookie~~!!

    The following code:

    list_info = [{"tenant": "coco", "hive_path": "/user/open_001_dev", "ftp_path": "/files/prov/001"},
                     {"tenant": "lili", "hive_path": "/user/open_002_dev", "ftp_path": "/files/prov/002"}]
    result = 0
    def hive_content_size():
        global result
        for item in range(2):
            if "hive_path" in list_info[item]:

    The result of the first loop is output normally,but the output of the second loop is abnormal.

    The bottom is the error report:

    ContentSummary(directoryCount=1258, fileCount=3773, length=141829751002, quota=4000000, spaceConsumed=425489253006, spaceQuota=659706976665600)
    Failed to reach to (attempt 3/3)
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 445, in _make_request
        six.raise_from(e, None)
      File "<string>", line 3, in raise_from
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 440, in _make_request
        httplib_response = conn.getresponse()
      File "/usr/local/python/lib/python3.9/http/", line 1347, in getresponse
      File "/usr/local/python/lib/python3.9/http/", line 307, in begin
        version, status, reason = self._read_status()
      File "/usr/local/python/lib/python3.9/http/", line 268, in _read_status
        line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
      File "/usr/local/python/lib/python3.9/", line 704, in readinto
        return self._sock.recv_into(b)
    socket.timeout: timed out
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 439, in send
        resp = conn.urlopen(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 755, in urlopen
        retries = retries.increment(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/util/", line 532, in increment
        raise six.reraise(type(error), error, _stacktrace)
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/packages/", line 735, in reraise
        raise value
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 699, in urlopen
        httplib_response = self._make_request(
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 447, in _make_request
        self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
      File "/usr/local/python/lib/python3.9/site-packages/urllib3-1.26.4-py3.9.egg/urllib3/", line 336, in _raise_timeout
        raise ReadTimeoutError(
    urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='', port=9000): Read timed out. (read timeout=10)
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 418, in _request
        response = self._requests_session.request(
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/usr/local/python/lib/python3.9/site-packages/requests-2.25.1-py3.9.egg/requests/", line 529, in send
        raise ReadTimeout(e, request=request)
    requests.exceptions.ReadTimeout: HTTPConnectionPool(host='', port=19888): Read timed out. (read timeout=10)
    Traceback (most recent call last):
      File "/home/hadoop/shay/monthly_report/", line 24, in <module>
      File "/home/hadoop/shay/monthly_report/", line 22, in hive_content_size
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 633, in get_content_summary
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 450, in _get
      File "/usr/local/python/lib/python3.9/site-packages/PyHDFS-0.3.1-py3.9.egg/pyhdfs/", line 442, in _request
    pyhdfs.HdfsNoServerException: Could not use any of the given hosts

    ask for help~~!!!

    opened by qwe55982 2
  • HdfsFileAlreadyExistsException is not implemented?

    HdfsFileAlreadyExistsException is not implemented?

    Hi! Thanks for your great work. I have noticed that some Exceptions are not implemented right now?

    For example: If I try to upload the file with same path, the python raises ConnectionError instead of HdfsFileAlreadyExistsException.

    error message as following:

    Traceback (most recent call last):
      File "", line 12, in <module>
        fs.create('/xxx/xxx/images/test.png', data=file)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/pyhdfs/", line 504, in create
        metadata_response.headers['location'], data=data, **self._requests_kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 132, in put
        return request('put', url, data=data, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/home/chiuhongyu/workplace/xxx/venv/lib/python3.6/site-packages/requests/", line 498, in send
        raise ConnectionError(err, request=request)
    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
    opened by james77777778 1
  • Support customized WEBHDFS_PATH

    Support customized WEBHDFS_PATH

    In the latest version of pyhdfs, webhdfs is set as a constant '/webhdfs/v1', it works well in most kind of scene, but users may use their customized HTTP URL, for example, users may set their own webhdfs service using Pylon, and they access their restful server using their customized URL PATTERN like http://<HOST>:<HTTP_PORT>/webhdfs/api/v2/<PATH>?op=...

    opened by SparkSnail 1
  • TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    I am using hadoop 2.6( with Docker: sudo docker run -i -t sequenceiq/hadoop-docker:2.6.0 /etc/ -bash).

    When I using PyHDFS to call client.list_status, I got error:

    Traceback (most recent call last):
      File "", line 3, in <module>
      File "...testenv/lib/python3.4/site-packages/", line 428, in list_status
        _json(self._get(path, 'LISTSTATUS', **kwargs))['FileStatuses']['FileStatus']
      File "...testenv/lib/python3.4/site-packages/", line 427, in <listcomp>
        FileStatus(**item) for item in
    TypeError: __new__() got an unexpected keyword argument 'storagePolicy'

    The code:

    from pyhdfs import HdfsClient
    client = HdfsClient(hosts='')

    This issue is cause of JSON from server has extra property storagePolicy, add it to can fix this. But I want to know weather this property is standard property of HDFS/WebHDFS.

    opened by robberphex 1
  • why response assert not empty

    why response assert not empty

    In, line 424

    assert not metadata_response.content

    In my client, I get some response when upload files.

    b'<html>\r\n<head><title>307 Temporary Redirect</title></head>\r\n<body bgcolor="white">\r\n<center><h1>307 Temporary Redirect</h1></center>\r\n<hr><center>nginx/1.13.8</center>\r\n</body>\r\n</html>\r\n'

    This response does not mean the upload process failed, and I can successfully upload my files when I delete this line. Why add this line? could you please help me to figure out this problem?

    opened by SparkSnail 0
  • Support setting webhdfs_path

    Support setting webhdfs_path

    In the latest version of pyhdfs, webhdfs is set as a constant '/webhdfs/v1', it works well in most kind of scene, but users may use their customized HTTP URL, for example, users may set their own webhdfs service using Pylon, and they access their restful server using their customized URL PATTERN like http://<HOST>:<HTTP_PORT>/webhdfs/api/v2/<PATH>?op=...

    opened by SparkSnail 0
  • Let pyhdfs can visit HDFS in kerberos environment

    Let pyhdfs can visit HDFS in kerberos environment

    When HDFS need kerberos authentication,ur cannot visit HDFS. So maybe u should add authentication information in ur In fact, it will call request module when python visit HDFS, so add authentication information at here.

    opened by LuckyNemo 0
  • got type error while append file

    got type error while append file

    File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 520, in append path, 'APPEND', expected_status=HTTPStatus.TEMPORARY_REDIRECT, **kwargs) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 466, in _post return self._request('post', path, op, expected_status, **kwargs) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 431, in _request _check_response(response, expected_status) File "/usr/local/lib/python3.6/site-packages/pyhdfs/", line 933, in _check_response remote_exception['message'] = exception_name + ' - ' + remote_exception['message'] TypeError: must be str, not NoneType

    opened by BingoZ 0
  • can't parse JSON with unprintable characters

    can't parse JSON with unprintable characters

    If a weird non-utf file name is created in HDFS, then the client fails when it can't interpret the response as a valid JSON string.

    e.g. it's possible to put a ctrl-r in the file name

    opened by jingw 0
Chrome Post-Exploitation is a client-server Chrome exploit to remotely allow an attacker access to Chrome passwords, downloads, history, and more.

ChromePE [Linux/Windows] Chrome Post-Exploitation is a client-server Chrome exploit to remotely allow an attacker access to Chrome passwords, download

Finn Lancaster 3 Oct 5, 2022
This is a partial and quick and dirty proof of concept implementation of the following specifications to configure a tor client to use trusted exit relays only.

This is a partial and quick and dirty proof of concept implementation of the following specifications to configure a tor client to use trusted exit re

null 22 Nov 9, 2022
client attack remotely , this script was written for educational purposes only

client attack remotely , this script was written for educational purposes only, do not use against to any victim, which you do not have permission for it

null 9 Jun 5, 2022
Client script for the fisherman phishing tool

Client script for the fisherman phishing tool

Pushkar Raj 1 Feb 23, 2022
Python decompiler for Python 1.5-2.4 (for historical archive)

This preserves the early code of a Python decompiler for Python versions 1.5 to 2.4. I have been able to install this using pyenv using Python 2.3.7 u

R. Bernstein 2 Jan 4, 2022
Python & JavaScript Obfuscator made in Python 3.

Python Code Obfuscator A script that converts code into full on random numerical expressions. Simple Scripts: Python Mode... Input: Function that deco

rzx. 1 Dec 29, 2021
A Python & JavaScript Obfuscator made in Python 3.

Python Code Obfuscator A script that converts code into full on random numerical expressions. Simple Scripts: Python Mode... Input: Function that deco

Karim 3 Mar 24, 2022
Whois-Python - Get Whois Domain with Python GUI

Whois-Python-GUI Get Whois Domain with Python - GUI :) WARNING > Dont Copy ! - W

MR.D3F417 3 Feb 21, 2022
A Python Bytecode Disassembler helping reverse engineers in dissecting Python binaries

A Python Bytecode Disassembler helping reverse engineers in dissecting Python binaries by disassembling and analyzing the compiled python byte-code(.pyc) files across all python versions (including Python 3.10.*)

neeraj 95 Dec 26, 2022
A simple python script for hosting a Snowflake Proxy in your python program or with it's standalone cli

snowflake-cli Snowflake is a system to defeat internet censorship, made by Tor Project. The system works by volunteers who run the snowflake extension

Guilherme Paixão 6 Jul 14, 2022
Bandit is a tool designed to find common security issues in Python code.

A security linter from PyCQA Free software: Apache license Documentation: Source:

Python Code Quality Authority 4.8k Dec 31, 2022
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

null 2.1k Dec 25, 2022
Dlint is a tool for encouraging best coding practices and helping ensure Python code is secure.

Dlint Dlint is a tool for encouraging best coding practices and helping ensure Python code is secure. The most important thing I have done as a progra

Dlint 127 Dec 27, 2022
A tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine or expire obfuscated scripts.

PyArmor Homepage (中文版网站) Documentation(中文版) PyArmor is a command line tool used to obfuscate python scripts, bind obfuscated scripts to fixed machine

Dashingsoft 1.9k Dec 30, 2022
Looks at Python code to search for things which look "dodgy" such as passwords or diffs

dodgy Dodgy is a very basic tool to run against your codebase to search for "dodgy" looking values. It is a series of simple regular expressions desig

Landscape 112 Nov 25, 2022
Big-Papa Integrates Javascript and python for remote cookie stealing which then can be used for session hijacking

Big-Papa is a remote cookie stealer which can then be used for session hijacking and Bypassing 2 Factor Authentication

null 77 Jan 3, 2023
A Python tool to automate some dorking stuff to find information disclosures.

WebDork v1.0.3 A open-source tool to find publicly available sensitive information about Companies/Organisations! WebDork A Python tool to automate so

Rahul rc 123 Jan 8, 2023
Python implementation of the diceware password generating algorithm.

Diceware Password Generator - Generate High Entropy Passwords Please Note - This Program Do Not Store Passwords In Any Form And All The Passwords Are

Sameera Madushan 35 Dec 25, 2022
Python tool for dumping flash via uboot reliably

Reliable Uboot Flash Dumper is a Python tool for dumping flash via uboot reliably. If you've ever had to dump flash via uboot and a serial connection and became frustrated about doing it several times and hand-merging files together to fix issues, this should help you out

SecurityJon 25 May 10, 2022