sqldf for pandas

Overview

pandasql

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

Installation

$ pip install -U pandasql

Basics

The main function used in pandasql is sqldf. sqldf accepts 2 parametrs

  • a sql query string
  • a set of session/environment variables (locals() or globals())

Specifying locals() or globals() can get tedious. You can define a short helper function to fix this.

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying

pandasql uses SQLite syntax. Any pandas dataframes will be automatically detected by pandasql. You can query them as you would any regular SQL table.

$ python
>>> from pandasql import sqldf, load_meat, load_births
>>> pysqldf = lambda q: sqldf(q, globals())
>>> meat = load_meat()
>>> births = load_births()
>>> print pysqldf("SELECT * FROM meat LIMIT 10;").head()
                  date  beef  veal  pork  lamb_and_mutton broilers other_chicken turkey
0  1944-01-01 00:00:00   751    85  1280               89     None          None   None
1  1944-02-01 00:00:00   713    77  1169               72     None          None   None
2  1944-03-01 00:00:00   741    90  1128               75     None          None   None
3  1944-04-01 00:00:00   650    89   978               66     None          None   None
4  1944-05-01 00:00:00   681   106  1029               78     None          None   None

joins and aggregations are also supported

>>> q = """SELECT
        m.date, m.beef, b.births
     FROM
        meats m
     INNER JOIN
        births b
           ON m.date = b.date;"""
>>> joined = pyqldf(q)
>>> print joined.head()
                    date    beef  births
403  2012-07-01 00:00:00  2200.8  368450
404  2012-08-01 00:00:00  2367.5  359554
405  2012-09-01 00:00:00  2016.0  361922
406  2012-10-01 00:00:00  2343.7  347625
407  2012-11-01 00:00:00  2206.6  320195

>>> q = "select
           strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year;"
>>> print pysqldf(q).head()
   year  beef_total
0  1944        8801
1  1945        9936
2  1946        9010
3  1947       10096
4  1948        8766

More information and code samples available in the examples folder or on our blog.

Analytics

Comments
  • window Function - Row_Number not working

    window Function - Row_Number not working

    Hi I am truing to use window functions but it seems they are not supported ? on checking i found out that sqllite recently enabled them. Do you think you can include them as well.

    opened by saketdabi1385 17
  • PandaSQL won't run inside main()

    PandaSQL won't run inside main()

    Hi there,

    I noticed that when I attempt to call pandasql inside main(), the script throws an error, no matter if I use globals() or locals(). I also tried to define qdf() outside main(), to no avail.

    Code

    # encoding=utf-8
    
    # pandasql_test.py
    
    # description:
    # test for pandasql functionality
    # inside a main() function
    
    # imports
    import pandas as pd
    import pandasql as psql
    
    # main function
    def main():
    
        # abbreviate pandasql functionality
        qdf = lambda q: psql.sqldf(q, locals())
    
        # create data
        df = pd.DataFrame({
            'name': ['Joe', 'Bill', 'Bob'],
            'weight': [70, 85, 75]
        })
    
        q = 'select * from df'
        print qdf(q)
    
    if __name__ == '__main__':
        main()
    

    Error pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) no such table: df [SQL: 'select * from df']

    Thanks in advance for any advice!

    opened by j0ma 13
  • pandasql does not work in a function

    pandasql does not work in a function

    Hi, I made a function like the one below to query a panda dataframe but it tells me 'Exception: UNECE_csv not found'. Cannot understand where is the prob since if I run the same scripts outside the function it works perfectly.

    def import_csv_and_query():
    
    filepath='C:\\DATA\\Work\\.....csv
    import pandas as pd
    from pandasql import sqldf
    UNECE_csv=pd.read_csv(filepath,dtype=object,na_filter=False)
    q="""
    
    MY LONG QUERY
    """
    
    UNECE_clean=sqldf(q, globals())
    
    UNECE_colums=UNECE_clean.columns.tolist()
    UNECE_array=UNECE_clean.values.tolist()
    UNECE=[UNECE_colums]+UNECE_array   
    return UNECE
    
    opened by cardosan 9
  • Nested SELECT doesn't work?

    Nested SELECT doesn't work?

    Nested SELECTs don't appear to be recognised - I assume the bracketed term is not being passed as a literal query, rather the pandasql parser is trying to look it up as a variable?

    directorates=['Community Wellbeing & Social Care',
                  'Childrens Services',
                  'Economy & Environment',
                  'Resources',
                  'Corporate']
    expenseTypes=['Accommodation Costs',
                  'Payment to Private Contractors',
                  'Operational Equipment',
                  'Professional Services']
    import itertools
    a=list(itertools.product(directorates,expenseTypes))
    unzipa=[t for t in zip(*a)]
    df_long=pd.DataFrame({'directorates':unzipa[0],
                          'expense_types':unzipa[1],
                          'total':np.random.randint(0,20000,len(directorates)*len(expenseTypes))
                          })
    
    import pandas as pd
    from pandasql import sqldf
    pysqldf = lambda q: sqldf(q, globals())
    
    q='''
    SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
    '''
    pysqldf(q)
    
    ---------------------------------------------------------------------------
    Exception                                 Traceback (most recent call last)
    <ipython-input-15-7c9285cfa654> in <module>()
          2 SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
          3 '''
    ----> 4 pysqldf(q)
    
    <ipython-input-5-6a58ed7de60b> in <lambda>(q)
          1 import pandas as pd
          2 from pandasql import sqldf
    ----> 3 pysqldf = lambda q: sqldf(q, globals())
    
    /usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py in sqldf(q, env, inmemory)
        116             if not inmemory :
        117                 os.remove(dbname)
    --> 118             raise Exception("%s not found" % table)
        119         df = env[table]
        120         df = _ensure_data_frame(df, table)
    
    Exception: (SELECT * FROM df_long LIMIT 3) not found
    
    opened by psychemedia 8
  • Table 'meat' already exists

    Table 'meat' already exists

    Hello,

    I am quite interested in this library, since I use pandas in my workflow every day. However, I come across the following error when running the basic tutorial code on the front page:

    PS Y:\> ipython
    Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
    Type "copyright", "credits" or "license" for more information.
    
    IPython 0.13.1 -- An enhanced Interactive Python.
    ?         -> Introduction and overview of IPython's features.
    %quickref -> Quick reference.
    help      -> Python's own help system.
    object?   -> Details about 'object', use 'object??' for extra details.
    
    In [1]: from pandasql import sqldf, load_meat, load_births
    
    In [2]: pysqldf = lambda q: sqldf(q, globals())
    
    In [3]: meat = load_meat()
    
    In [4]: births = load_births()
    
    In [5]: print pysqldf("SELECT * FROM meat LIMIT 10;").head()
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-5-0c7b33a2459f> in <module>()
    ----> 1 print pysqldf("SELECT * FROM meat LIMIT 10;").head()
    
    <ipython-input-2-6245773871e5> in <lambda>(q)
    ----> 1 pysqldf = lambda q: sqldf(q, globals())
    
    C:\Python27\lib\site-packages\pandasql\sqldf.pyc in sqldf(q, env)
         60                 raise Exception("%s not found" % table)
         61             df = env[table]
    ---> 62             _write_table(table, df, conn)
         63         try:
         64             result = frame_query(q, conn)
    
    C:\Python27\lib\site-packages\pandasql\sqldf.pyc in _write_table(tablename, df, conn)
         32             raise Exception(msg)
         33
    ---> 34     write_frame(df, name=tablename, con=conn, flavor='sqlite')
         35
         36
    
    C:\Python27\lib\site-packages\pandas\io\sql.pyc in write_frame(frame, name, con, flavor, if_exists, **kwargs)
        185     exists = table_exists(name, con, flavor)
        186     if if_exists == 'fail' and exists:
    --> 187         raise ValueError, "Table '%s' already exists." % name
        188
        189     #create or drop-recreate if necessary
    
    ValueError: Table 'meat' already exists.
    
    In [6]:
    

    I am running Python 2.7 64 bit on Windows 7. My pandas version is 0.10.1 and all the dependencies are up to date:

    PS Y:\> pip install -U pandas
    Requirement already up-to-date: pandas in c:\python27\lib\site-packages
    Requirement already up-to-date: python-dateutil in c:\python27\lib\site-packages (from pandas)
    Requirement already up-to-date: pytz in c:\python27\lib\site-packages (from pandas)
    Requirement already up-to-date: numpy>=1.6.1 in c:\python27\lib\site-packages (from pandas)
    Requirement already up-to-date: six in c:\python27\lib\site-packages (from python-dateutil->pandas)
    Cleaning up...
    

    Any ideas of what I can try doing differently?

    opened by colindickson 6
  • Multiple backends, better interface and others

    Multiple backends, better interface and others

    I don't see how to split the pull request into multiple ones, so all the changes are here in one place:

    • use SQLAlchemy and support Postgres (other engines can easily be added)
    • specifying variable environment is optional, by default it assumes locals() and globals() together
    • new class-based interface: allows not to specify database each time
    • cleaner interface: docstring, custom exception, refactoring
    • fix datetime handling
    • don't drop column named 'index'
    • drop implicit automatic conversion to dataframes: it is easy for user to call pd.DataFrame constructor, so what for the magic?
    • testing uses pytest and tests for all supported backends
    • add asserts for actual content in tests, not length only
    • demo fixed to work with modern pandas and python versions
    • add dependencies to setup.py

    Hope you will find some time to look at them and integrate some into the base version.

    opened by aplavin 3
  • exception treatment changed

    exception treatment changed

    When I try to import pandasql the system prints this error:

    from pandasql import sqldf
      File "/opt/python34/lib/python3.4/site-packages/pandasql/__init__.py", line 1, in <module>
        from .sqldf import sqldf
      File "/opt/python34/lib/python3.4/site-packages/pandasql/sqldf.py", line 117
        except Exception, e:
                        ^
    SyntaxError: invalid syntax
    

    If I remove the ', e', the library works well.

    Version INSTALLED: 0.6.1 (latest)

    opened by xmnlab 3
  • except Exception, e:

    except Exception, e:

    When I try to import pandasql the system prints this error:

    from pandasql import sqldf
      File "/opt/python34/lib/python3.4/site-packages/pandasql/__init__.py", line 1, in <module>
        from .sqldf import sqldf
      File "/opt/python34/lib/python3.4/site-packages/pandasql/sqldf.py", line 117
        except Exception, e:
                        ^
    SyntaxError: invalid syntax
    

    If I remove the ', e', the library works well.

    Version INSTALLED: 0.6.1 (latest)

    opened by xmnlab 3
  • sqldf returning None if SELECT clause is not *

    sqldf returning None if SELECT clause is not *

    Hi Greg,

    I cloned the most recent versions of pandas and pandasql today from GitHub and am noticing some fishy behavior in the output of my sqldf select statements: When I use anything other than '*' as my SELECT clause, sqldf is returning None.

    # sqldf_select_issue.py
    import pandas as pd
    from pandasql import sqldf as _sqldf
    
    def sqldf(query):
        return _sqldf(query,globals())
    
    msg_data = pd.read_csv('sample_message_text.tsv',index_col=None)[['response','init_gender','thread_count']]
    
    tests = []
    
    tests.append("""
    select *
    from msg_data
    limit 10""")
    
    tests.append("""
    select init_gender
    from msg_data
    limit 10""")
    
    tests.append("""
    select *
    from msg_data
    where init_gender = 'male'
    limit 10""")
    
    tests.append("""
    select sum(thread_count) as total_threads
    from msg_data""")
    
    for query in tests:
        print("~"*30)
        output = sqldf(query)
        print(output)
        print(type(output))    
    

    Here is the output I get then running the above code:

    Ravdek@RAVDEK-WIN7 ~/python_files/message_text_stats
    $ python sqldf_select_issue.py
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       response init_gender  thread_count
    0         0      female         74022
    1         0      female            24
    2         0      female             6
    3         0      female             1
    4         0      female             2
    5         0      female             2
    6         0      female             1
    7         0      female             2
    8         0      female            18
    9         0        male        774894
    <class 'pandas.core.frame.DataFrame'>
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    None
    <type 'NoneType'>
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       response init_gender  thread_count
    0         0        male        774894
    1         0        male           237
    2         0        male             1
    3         0        male           192
    4         0        male            10
    5         0        male             4
    6         0        male            59
    7         0        male           198
    8         0        male           435
    9         0        male            47
    <class 'pandas.core.frame.DataFrame'>
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    None
    <type 'NoneType'>
    

    If it's at all relevant, I'm working with Cygwin on Windows.

    opened by RAvdek 3
  • Working with list version 02 rebased

    Working with list version 02 rebased

    Hello,

    Please cancel my yesterday pull request, I had totally screwed my git tree.

    Here is a new cleaner one, with the "one column" scenario mostly solved. On my pc, a dataframe of 1 column of integers is not working properly :

    x=pd.DataFrame([1, 3 ], columns=['toto']) sqldf("select * from x",locals())

    opened by stonebig 3
  • giving the possibility to use sqlite

    giving the possibility to use sqlite ":memory:" as pandasql back-end

    Hello,

    Pandasql is a really nice piece of code. If I understood well, it uses a temporary sqlite database "on disk".

    Would it be possible to add the option to use sqlite in ":memory:" rather than "on disk " ? Is there a hidden problem with ":memory" that prevented you to default to it ?

    opened by stonebig 3
  • Error in Reading External .SQL File

    Error in Reading External .SQL File

    Hello there,
    I wanted to do a data transformation in SQL using pandasql and store the query result to df_final. This is the code I used:

    sqlkey = lambda q : sqldf(q,globals())
    query = open('query.sql', 'r')
    df_final = sqlkey(query)
    

    And query.sql is an external .SQL file located in the same directory as my notebook. I thought the code above will run the external SQL file and store the result in df_final, but instead I got an error that said: TypeError: expected string or bytes-like object
    How should I solve this issue? Thanks in advance!

    opened by matthewfarant 0
  • doesn't work in Flask

    doesn't work in Flask

    @app.route("/query", methods= ["GET", 'POST']) def query(): pysqldf = lambda q: sqldf(q, locals()) csv_data = pd.read_csv('./for_download/for_query.csv', index_col=[0]) query = "select * from csv_data" trows = pysqldf(query)

    I can't get the table in this way, and face this error: pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) no such table: csv_data

    opened by mpourvali 1
  • Bool columns are returned as integer

    Bool columns are returned as integer

    sqldf returns boolean columns as integer

    # Import libraries
    import pandas as pd
    from pandasql import sqldf
    
    # Create a dummy pd.Dataframe
    df = pd.DataFrame({'col1': ['A', 'B','B', 'A', 'A'], 'col':[True, False, True, True, True]})
    
    query = """
    SELECT *
    FROM df
    WHERE col1 == 'A';
    """
    
    df_view = sqldf(query, locals())
    df_view
    
    opened by julioevi 0
  • More descriptive exceptions?

    More descriptive exceptions?

    I want my program to only error on sql syntax errors however it also errors on empty queries, non existent columns, rows, ids, etc. The best solution to this would be to use try except, however the exceptions aren't descriptive enough for that to work.

    I feel like better exceptions might solve other, unforeseen problems as well so maybe its something that should be improved?

    opened by jmtekin 0
  • Vertical output for result.

    Vertical output for result.

    Hi! Is it possible to have some results using Vertical Output? It seems SQLlite supports it. https://database.guide/display-sqlite-query-results-using-vertical-output/

    opened by olegans1972 0
  • Right Join not available like merge command has

    Right Join not available like merge command has

    I need to extract those records that are not matched based on the primary - key. also, we have don't have any indication like pandas merge command has indicator = True on which we are able to extract those rows which are not matched

    opened by muhammadumerjaved44 0
Owner
yhat
yhat
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 1, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 3, 2023
Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

Databricks 3.2k Jan 4, 2023
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

null 8.2k Jan 1, 2023
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 4, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 1, 2023
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

PdpCLI Quick Links Introduction Installation Tutorial Basic Usage Data Reader / Writer Plugins Introduction PdpCLI is a pandas DataFrame processing CL

Yasuhiro Yamaguchi 15 Jan 7, 2022
Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

Francis 5 May 14, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 1, 2023
Sparkling Pandas

SparklingPandas SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. Sparkl

null 366 Oct 27, 2022
An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)

Rugby score prediction An end-to-end machine learning web app to predict rugby scores Overview An demo project to provide a high-level overview of the

null 34 May 24, 2022
A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

Guangyuan(Frank) Li 32 Nov 20, 2022
A pandas-like deferred expression system, with first-class SQL support

Ibis: Python data analysis framework for Hadoop and SQL engines Service Status Documentation Conda packages PyPI Azure Coverage Ibis is a toolbox to b

Ibis Project 2.3k Jan 6, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 345 Dec 28, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Dec 31, 2022
Mypy stubs, i.e., type information, for numpy, pandas and matplotlib

Mypy type stubs for NumPy, pandas, and Matplotlib This is a PEP-561-compliant stub-only package which provides type information for matplotlib, numpy

Predictive Analytics Lab 194 Dec 19, 2022