sqldf for pandas

yhat

Last update: Jan 9, 2023

Related tags

Data Containers pandasql

Overview

pandasql

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

Installation

$ pip install -U pandasql

Basics

The main function used in pandasql is sqldf. sqldf accepts 2 parametrs

a sql query string
a set of session/environment variables (locals() or globals())

Specifying locals() or globals() can get tedious. You can define a short helper function to fix this.

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Querying

pandasql uses SQLite syntax. Any pandas dataframes will be automatically detected by pandasql. You can query them as you would any regular SQL table.

$ python
>>> from pandasql import sqldf, load_meat, load_births
>>> pysqldf = lambda q: sqldf(q, globals())
>>> meat = load_meat()
>>> births = load_births()
>>> print pysqldf("SELECT * FROM meat LIMIT 10;").head()
                  date  beef  veal  pork  lamb_and_mutton broilers other_chicken turkey
0  1944-01-01 00:00:00   751    85  1280               89     None          None   None
1  1944-02-01 00:00:00   713    77  1169               72     None          None   None
2  1944-03-01 00:00:00   741    90  1128               75     None          None   None
3  1944-04-01 00:00:00   650    89   978               66     None          None   None
4  1944-05-01 00:00:00   681   106  1029               78     None          None   None

joins and aggregations are also supported

>>> q = """SELECT
        m.date, m.beef, b.births
     FROM
        meats m
     INNER JOIN
        births b
           ON m.date = b.date;"""
>>> joined = pyqldf(q)
>>> print joined.head()
                    date    beef  births
403  2012-07-01 00:00:00  2200.8  368450
404  2012-08-01 00:00:00  2367.5  359554
405  2012-09-01 00:00:00  2016.0  361922
406  2012-10-01 00:00:00  2343.7  347625
407  2012-11-01 00:00:00  2206.6  320195

>>> q = "select
           strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year;"
>>> print pysqldf(q).head()
   year  beef_total
0  1944        8801
1  1945        9936
2  1946        9010
3  1947       10096
4  1948        8766

More information and code samples available in the examples folder or on our blog.

Comments

window Function - Row_Number not working

Hi I am truing to use window functions but it seems they are not supported ? on checking i found out that sqllite recently enabled them. Do you think you can include them as well.

opened by saketdabi1385 17

PandaSQL won't run inside main()

Hi there,

I noticed that when I attempt to call pandasql inside main(), the script throws an error, no matter if I use globals() or locals(). I also tried to define qdf() outside main(), to no avail.

Code

# encoding=utf-8

# pandasql_test.py

# description:
# test for pandasql functionality
# inside a main() function

# imports
import pandas as pd
import pandasql as psql

# main function
def main():

    # abbreviate pandasql functionality
    qdf = lambda q: psql.sqldf(q, locals())

    # create data
    df = pd.DataFrame({
        'name': ['Joe', 'Bill', 'Bob'],
        'weight': [70, 85, 75]
    })

    q = 'select * from df'
    print qdf(q)

if __name__ == '__main__':
    main()

Error pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) no such table: df [SQL: 'select * from df']

Thanks in advance for any advice!

opened by j0ma 13

pandasql does not work in a function

Hi, I made a function like the one below to query a panda dataframe but it tells me 'Exception: UNECE_csv not found'. Cannot understand where is the prob since if I run the same scripts outside the function it works perfectly.

def import_csv_and_query():

filepath='C:\\DATA\\Work\\.....csv
import pandas as pd
from pandasql import sqldf
UNECE_csv=pd.read_csv(filepath,dtype=object,na_filter=False)
q="""

MY LONG QUERY
"""

UNECE_clean=sqldf(q, globals())

UNECE_colums=UNECE_clean.columns.tolist()
UNECE_array=UNECE_clean.values.tolist()
UNECE=[UNECE_colums]+UNECE_array   
return UNECE

opened by cardosan 9

Nested SELECT doesn't work?

Nested SELECTs don't appear to be recognised - I assume the bracketed term is not being passed as a literal query, rather the pandasql parser is trying to look it up as a variable?

directorates=['Community Wellbeing & Social Care',
              'Childrens Services',
              'Economy & Environment',
              'Resources',
              'Corporate']
expenseTypes=['Accommodation Costs',
              'Payment to Private Contractors',
              'Operational Equipment',
              'Professional Services']
import itertools
a=list(itertools.product(directorates,expenseTypes))
unzipa=[t for t in zip(*a)]
df_long=pd.DataFrame({'directorates':unzipa[0],
                      'expense_types':unzipa[1],
                      'total':np.random.randint(0,20000,len(directorates)*len(expenseTypes))
                      })

import pandas as pd
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

q='''
SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
'''
pysqldf(q)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-15-7c9285cfa654> in <module>()
      2 SELECT directorates FROM (SELECT * FROM df_long LIMIT 3) tbl;
      3 '''
----> 4 pysqldf(q)

<ipython-input-5-6a58ed7de60b> in <lambda>(q)
      1 import pandas as pd
      2 from pandasql import sqldf
----> 3 pysqldf = lambda q: sqldf(q, globals())

/usr/local/lib/python3.4/dist-packages/pandasql/sqldf.py in sqldf(q, env, inmemory)
    116             if not inmemory :
    117                 os.remove(dbname)
--> 118             raise Exception("%s not found" % table)
    119         df = env[table]
    120         df = _ensure_data_frame(df, table)

Exception: (SELECT * FROM df_long LIMIT 3) not found

opened by psychemedia 8

Table 'meat' already exists

Hello,

I am quite interested in this library, since I use pandas in my workflow every day. However, I come across the following error when running the basic tutorial code on the front page:

PS Y:\> ipython
Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from pandasql import sqldf, load_meat, load_births

In [2]: pysqldf = lambda q: sqldf(q, globals())

In [3]: meat = load_meat()

In [4]: births = load_births()

In [5]: print pysqldf("SELECT * FROM meat LIMIT 10;").head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-0c7b33a2459f> in <module>()
----> 1 print pysqldf("SELECT * FROM meat LIMIT 10;").head()

<ipython-input-2-6245773871e5> in <lambda>(q)
----> 1 pysqldf = lambda q: sqldf(q, globals())

C:\Python27\lib\site-packages\pandasql\sqldf.pyc in sqldf(q, env)
     60                 raise Exception("%s not found" % table)
     61             df = env[table]
---> 62             _write_table(table, df, conn)
     63         try:
     64             result = frame_query(q, conn)

C:\Python27\lib\site-packages\pandasql\sqldf.pyc in _write_table(tablename, df, conn)
     32             raise Exception(msg)
     33
---> 34     write_frame(df, name=tablename, con=conn, flavor='sqlite')
     35
     36

C:\Python27\lib\site-packages\pandas\io\sql.pyc in write_frame(frame, name, con, flavor, if_exists, **kwargs)
    185     exists = table_exists(name, con, flavor)
    186     if if_exists == 'fail' and exists:
--> 187         raise ValueError, "Table '%s' already exists." % name
    188
    189     #create or drop-recreate if necessary

ValueError: Table 'meat' already exists.

In [6]:

I am running Python 2.7 64 bit on Windows 7. My pandas version is 0.10.1 and all the dependencies are up to date:

PS Y:\> pip install -U pandas
Requirement already up-to-date: pandas in c:\python27\lib\site-packages
Requirement already up-to-date: python-dateutil in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: pytz in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: numpy>=1.6.1 in c:\python27\lib\site-packages (from pandas)
Requirement already up-to-date: six in c:\python27\lib\site-packages (from python-dateutil->pandas)
Cleaning up...

Any ideas of what I can try doing differently?

opened by colindickson 6

Multiple backends, better interface and others
I don't see how to split the pull request into multiple ones, so all the changes are here in one place:

use SQLAlchemy and support Postgres (other engines can easily be added)

specifying variable environment is optional, by default it assumes locals() and globals() together

new class-based interface: allows not to specify database each time

cleaner interface: docstring, custom exception, refactoring

fix datetime handling

don't drop column named 'index'

drop implicit automatic conversion to dataframes: it is easy for user to call pd.DataFrame constructor, so what for the magic?

testing uses pytest and tests for all supported backends

add asserts for actual content in tests, not length only

demo fixed to work with modern pandas and python versions

add dependencies to setup.py

Hope you will find some time to look at them and integrate some into the base version.
opened by aplavin 3

exception treatment changed

When I try to import pandasql the system prints this error:

from pandasql import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/__init__.py", line 1, in <module>
    from .sqldf import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/sqldf.py", line 117
    except Exception, e:
                    ^
SyntaxError: invalid syntax

If I remove the ', e', the library works well.

Version INSTALLED: 0.6.1 (latest)

opened by xmnlab 3

except Exception, e:

When I try to import pandasql the system prints this error:

from pandasql import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/__init__.py", line 1, in <module>
    from .sqldf import sqldf
  File "/opt/python34/lib/python3.4/site-packages/pandasql/sqldf.py", line 117
    except Exception, e:
                    ^
SyntaxError: invalid syntax

If I remove the ', e', the library works well.

Version INSTALLED: 0.6.1 (latest)

opened by xmnlab 3

sqldf returning None if SELECT clause is not *

Hi Greg,

I cloned the most recent versions of pandas and pandasql today from GitHub and am noticing some fishy behavior in the output of my sqldf select statements: When I use anything other than '*' as my SELECT clause, sqldf is returning None.

# sqldf_select_issue.py
import pandas as pd
from pandasql import sqldf as _sqldf

def sqldf(query):
    return _sqldf(query,globals())

msg_data = pd.read_csv('sample_message_text.tsv',index_col=None)[['response','init_gender','thread_count']]

tests = []

tests.append("""
select *
from msg_data
limit 10""")

tests.append("""
select init_gender
from msg_data
limit 10""")

tests.append("""
select *
from msg_data
where init_gender = 'male'
limit 10""")

tests.append("""
select sum(thread_count) as total_threads
from msg_data""")

for query in tests:
    print("~"*30)
    output = sqldf(query)
    print(output)
    print(type(output))

Here is the output I get then running the above code:

Ravdek@RAVDEK-WIN7 ~/python_files/message_text_stats
$ python sqldf_select_issue.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   response init_gender  thread_count
0         0      female         74022
1         0      female            24
2         0      female             6
3         0      female             1
4         0      female             2
5         0      female             2
6         0      female             1
7         0      female             2
8         0      female            18
9         0        male        774894
<class 'pandas.core.frame.DataFrame'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
None
<type 'NoneType'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   response init_gender  thread_count
0         0        male        774894
1         0        male           237
2         0        male             1
3         0        male           192
4         0        male            10
5         0        male             4
6         0        male            59
7         0        male           198
8         0        male           435
9         0        male            47
<class 'pandas.core.frame.DataFrame'>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
None
<type 'NoneType'>

If it's at all relevant, I'm working with Cygwin on Windows.

opened by RAvdek 3

Working with list version 02 rebased

Hello,

Please cancel my yesterday pull request, I had totally screwed my git tree.

Here is a new cleaner one, with the "one column" scenario mostly solved. On my pc, a dataframe of 1 column of integers is not working properly :

x=pd.DataFrame([1, 3 ], columns=['toto']) sqldf("select * from x",locals())

opened by stonebig 3
giving the possibility to use sqlite ":memory:" as pandasql back-end

Hello,

Pandasql is a really nice piece of code. If I understood well, it uses a temporary sqlite database "on disk".

Would it be possible to add the option to use sqlite in ":memory:" rather than "on disk " ? Is there a hidden problem with ":memory" that prevented you to default to it ?

opened by stonebig 3
Error in Reading External .SQL File
Hello there,
I wanted to do a data transformation in SQL using pandasql and store the query result to df_final. This is the code I used:

sqlkey = lambda q : sqldf(q,globals()) query = open('query.sql', 'r') df_final = sqlkey(query)

And query.sql is an external .SQL file located in the same directory as my notebook. I thought the code above will run the external SQL file and store the result in df_final, but instead I got an error that said: TypeError: expected string or bytes-like object
How should I solve this issue? Thanks in advance!
opened by matthewfarant 0
doesn't work in Flask

@app.route("/query", methods= ["GET", 'POST']) def query(): pysqldf = lambda q: sqldf(q, locals()) csv_data = pd.read_csv('./for_download/for_query.csv', index_col=[0]) query = "select * from csv_data" trows = pysqldf(query)

I can't get the table in this way, and face this error: pandasql.sqldf.PandaSQLException: (sqlite3.OperationalError) no such table: csv_data

opened by mpourvali 1

Bool columns are returned as integer

sqldf returns boolean columns as integer

# Import libraries
import pandas as pd
from pandasql import sqldf

# Create a dummy pd.Dataframe
df = pd.DataFrame({'col1': ['A', 'B','B', 'A', 'A'], 'col':[True, False, True, True, True]})

query = """
SELECT *
FROM df
WHERE col1 == 'A';
"""

df_view = sqldf(query, locals())
df_view

opened by julioevi 0

More descriptive exceptions?

I want my program to only error on sql syntax errors however it also errors on empty queries, non existent columns, rows, ids, etc. The best solution to this would be to use try except, however the exceptions aren't descriptive enough for that to work.

I feel like better exceptions might solve other, unforeseen problems as well so maybe its something that should be improved?

opened by jmtekin 0
Vertical output for result.

Hi! Is it possible to have some results using Vertical Output? It seems SQLlite supports it. https://database.guide/display-sqlite-query-results-using-vertical-output/

opened by olegans1972 0
Right Join not available like merge command has

I need to extract those records that are not matched based on the primary - key. also, we have don't have any indication like pandas merge command has indicator = True on which we are able to extract those rows which are not matched

opened by muhammadumerjaved44 0

Owner

yhat

GitHub

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 1, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 1, 2023

Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

348 Jan 3, 2023

Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

3.2k Jan 4, 2023

Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 1, 2023

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

2.2k Jan 4, 2023

The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

260 Jan 1, 2023

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

206 Dec 13, 2022

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

PdpCLI Quick Links Introduction Installation Tutorial Basic Usage Data Reader / Writer Plugins Introduction PdpCLI is a pandas DataFrame processing CL

15 Jan 7, 2022

Pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code

pandas-method-chaining pandas-method-chaining is a plugin for flake8 that provides method chaining linting for pandas code. It is a fork from pandas-v

5 May 14, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 1, 2023

Sparkling Pandas

SparklingPandas SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. Sparkl

366 Oct 27, 2022

An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)

Rugby score prediction An end-to-end machine learning web app to predict rugby scores Overview An demo project to provide a high-level overview of the

34 May 24, 2022

A compendium of useful, interesting, inspirational usage of pandas functions, each example will be an ipynb file

Pandas_by_examples A compendium of useful/interesting/inspirational usage of pandas functions, each example will be an ipynb file What is this reposit

32 Nov 20, 2022

A pandas-like deferred expression system, with first-class SQL support

Ibis: Python data analysis framework for Hadoop and SQL engines Service Status Documentation Conda packages PyPI Azure Coverage Ibis is a toolbox to b

2.3k Jan 6, 2023

Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

345 Dec 28, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Dec 31, 2022

Mypy stubs, i.e., type information, for numpy, pandas and matplotlib

Mypy type stubs for NumPy, pandas, and Matplotlib This is a PEP-561-compliant stub-only package which provides type information for matplotlib, numpy

194 Dec 19, 2022