Skip to content

Commit 701d6f5

Browse files
authored
Merge pull request #23 from zbrookle/sql_guide
DOC: Add sql syntax guide
2 parents c50be0f + 7069671 commit 701d6f5

3 files changed

Lines changed: 81 additions & 4 deletions

File tree

README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
![CI](https://github.com/zbrookle/dataframe_sql/workflows/CI/badge.svg)
44

5+
56
## Installation
67

78
```bash
@@ -41,6 +42,81 @@ FALSE_SERIES = Series(data=[False for _ in range(0, dataframe_size)]))
4142
NONE_SERIES = Series(data=[None for _ in range(0, dataframe_size)]))
4243
```
4344

45+
### SQL Syntax
46+
The sql syntax for dataframe_sql is as follows:
47+
48+
Select statement:
49+
50+
```SQL
51+
SELECT [{ ALL | DISTINCT }]
52+
{ [ <expression> ] | <expression> [ [ AS ] <alias> ] } [, ...]
53+
[ FROM <from_item> [, ...] ]
54+
[ WHERE <bool_expression> ]
55+
[ GROUP BY { <expression> [, ...] } ]
56+
[ HAVING <bool_expression> ]
57+
```
58+
59+
Set operations:
60+
61+
```SQL
62+
<select_statement1>
63+
{UNION [DISTINCT] | UNION ALL | INTERSECT [DISTINCT] | EXCEPT [DISTINCT] | EXCEPT ALL}
64+
<select_statment2>
65+
```
66+
67+
Joins:
68+
69+
```SQL
70+
INNER, CROSS, FULL OUTER, LEFT OUTER, RIGHT OUTER, FULL, LEFT, RIGHT
71+
```
72+
73+
Order by and limit:
74+
75+
```SQL
76+
<set>
77+
[ORDER BY <expression>]
78+
[LIMIT <number>]
79+
```
80+
81+
Supported expressions and functions:
82+
```SQL
83+
+, -, *, /
84+
```
85+
```SQL
86+
CASE WHEN <condition> THEN <result> [WHEN ...] ELSE <result> END
87+
```
88+
```SQL
89+
SUM, AVG, MIN, MAX
90+
```
91+
```SQL
92+
{RANK | DENSE_RANK} OVER([PARTITION BY (<expresssion> [, <expression>...)])
93+
```
94+
```SQL
95+
CAST (<expression> AS <data_type>)
96+
```
97+
*Anything in <> is meant to be some string <br>
98+
*Anything in [] is optional <br>
99+
*Anything in {} is grouped together
100+
101+
### Supported Data Types for cast expressions include:
102+
* VARCHAR, STRING
103+
* INT16, SMALLINT
104+
* INT32, INT
105+
* INT64, BIGINT
106+
* FLOAT16
107+
* FLOAT32
108+
* FLOAT, FLOAT64
109+
* BOOL
110+
* DATETIME64, TIMESTAMP
111+
* CATEGORY
112+
* OBJECT
113+
114+
*Data types in dataframe SQL support many different name for certain datatypes becuase
115+
popular SQL data types are not implemented with common names in pandas and other
116+
dataframe frameworks
117+
<br>
118+
**To make this less confusing all data types that are of the same size on the
119+
backend are grouped together in this list
44120

45121
## Issues that come from Pandas
46122

dataframe_sql/grammar/sql.grammar

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,6 @@ groupby_expr: expression -> group_by
2525

2626
window_expr: [window_expr ","] _window_name "AS"i ( window_definition )
2727

28-
SET_OP: "UNION"i [ ("ALL"i | "DISTINCT"i) ] | "INTERSECT"i "DISTINCT"i | "EXCEPT"i "DISTINCT"i
29-
3028
from_item: NAME [ [ "AS"i ] alias ] -> table
3129
| join -> join
3230
| ( "(" query_expr ")" ) [ [ "AS"i ] alias ] -> subquery
@@ -85,6 +83,7 @@ TYPENAME: "object"i
8583
| "datetime64"i
8684
| "timestamp"i
8785
| "category"i
86+
| "string"i
8887
AGGREGATION.8: "sum"i | "avg"i | "min"i | "max"i
8988
alias: NAME -> alias_string
9089
_window_name: NAME

dataframe_sql/tests/pandas_sql_functionality_test.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1367,7 +1367,8 @@ def test_sql_data_types():
13671367
cast(avocado_id as category) as avocado_id_category,
13681368
cast(date as datetime64) as date,
13691369
cast(date as timestamp) as time,
1370-
cast(region as varchar) as region_varchar
1370+
cast(region as varchar) as region_varchar,
1371+
cast(region as string) as region_string
13711372
from avocado
13721373
"""
13731374
)
@@ -1389,6 +1390,7 @@ def test_sql_data_types():
13891390
pandas_frame["date"] = pandas_frame["Date"].astype("datetime64")
13901391
pandas_frame["time"] = pandas_frame["Date"].astype("datetime64")
13911392
pandas_frame["region_varchar"] = pandas_frame["region"].astype("string")
1393+
pandas_frame["region_string"] = pandas_frame["region"].astype("string")
13921394
pandas_frame = pandas_frame.drop(columns=["avocado_id", "Date", "region"])
13931395

13941396
tm.assert_frame_equal(pandas_frame, my_frame)
@@ -1453,6 +1455,6 @@ def test_boolean_order_of_operations_with_parens():
14531455
if __name__ == "__main__":
14541456
register_env_tables()
14551457

1456-
test_boolean_order_of_operations_with_parens()
1458+
test_sql_data_types()
14571459

14581460
remove_env_tables()

0 commit comments

Comments
 (0)