simple-sql-parser/Language/SQL/SimpleSQL/Parse.hs at ab82249523cd6a68ca9481b2de8436f89eca56db · JakeWheat/simple-sql-parser · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000

{-
= TOC:

notes
Public api
Names - parsing identifiers
Typenames
Scalar expressions
  simple literals
  star, param
  parens expression, row constructor and scalar subquery
  case, cast, exists, unique, array/ multiset constructor
  typed literal, app, special function, aggregate, window function
  suffixes: in, between, quantified comparison, match predicate, array
    subscript, escape, collate
  operators
  scalar expression top level
  helpers
query expressions
  select lists
  from clause
  other table expression clauses:
    where, group by, having, order by, offset and fetch
  common table expressions
  query expression
  set operations
lexers
utilities

= Notes about the code

The lexers appear at the bottom of the file. There tries to be a clear
separation between the lexers and the other parser which only use the
lexers, this isn't 100% complete at the moment and needs fixing.

== Left factoring

The parsing code is aggressively left factored, and try is avoided as
much as possible. Try is avoided because:

 * when it is overused it makes the code hard to follow
 * when it is overused it makes the parsing code harder to debug
 * it makes the parser error messages much worse

The code could be made a bit simpler with a few extra 'trys', but this
isn't done because of the impact on the parser error
messages. Apparently it can also help the speed but this hasn't been
looked into.

== Parser error messages

A lot of care has been given to generating good parser error messages
for invalid syntax. There are a few utils below which partially help
in this area.

There is a set of crafted bad expressions in ErrorMessages.hs, these
are used to guage the quality of the error messages and monitor
regressions by hand. The use of <?> is limited as much as possible:
each instance should justify itself by improving an actual error
message.

There is also a plan to write a really simple expression parser which
doesn't do precedence and associativity, and the fix these with a pass
over the ast. I don't think there is any other way to sanely handle
the common prefixes between many infix and postfix multiple keyword
operators, and some other ambiguities also. This should help a lot in
generating good error messages also.

Both the left factoring and error message work are greatly complicated
by the large number of shared prefixes of the various elements in SQL
syntax.

== Main left factoring issues

There are three big areas which are tricky to left factor:

 * typenames
 * scalar expressions which can start with an identifier
 * infix and suffix operators

=== typenames

There are a number of variations of typename syntax. The standard
deals with this by switching on the name of the type which is parsed
first. This code doesn't do this currently, but might in the
future. Taking the approach in the standard grammar will limit the
extensibility of the parser and might affect the ease of adapting to
support other sql dialects.

=== identifier scalar expressions

There are a lot of scalar expression nodes which start with
identifiers, and can't be distinguished the tokens after the initial
identifier are parsed. Using try to implement these variations is very
simple but makes the code much harder to debug and makes the parser
error messages really bad.

Here is a list of these nodes:

 * identifiers
 * function application
 * aggregate application
 * window application
 * typed literal: typename 'literal string'
 * interval literal which is like the typed literal with some extras

There is further ambiguity e.g. with typed literals with precision,
functions, aggregates, etc. - these are an identifier, followed by
parens comma separated scalar expressions or something similar, and it
is only later that we can find a token which tells us which flavour it
is.

There is also a set of nodes which start with an identifier/keyword
but can commit since no other syntax can start the same way:

 * case
 * cast
 * exists, unique subquery
 * array constructor
 * multiset constructor
 * all the special syntax functions: extract, position, substring,
  convert, translate, overlay, trim, etc.

The interval literal mentioned above is treated in this group at the
moment: if we see 'interval' we parse it either as a full interval
literal or a typed literal only.

Some items in this list might have to be fixed in the future, e.g. to
support standard 'substring(a from 3 for 5)' as well as regular
function substring syntax 'substring(a,3,5) at the same time.

The work in left factoring all this is mostly done, but there is still
a substantial bit to complete and this is by far the most difficult
bit. At the moment, the work around is to use try, the downsides of
which is the poor parsing error messages.

=== infix and suffix operators

== permissiveness

The parser is very permissive in many ways. This departs from the
standard which is able to eliminate a number of possibilities just in
the grammar, which this parser allows. This is done for a number of
reasons:

 * it makes the parser simple - less variations
 * it should allow for dialects and extensibility more easily in the
  future (e.g. new infix binary operators with custom precedence)
 * many things which are effectively checked in the grammar in the
  standard, can be checked using a typechecker or other simple static
  analysis

To use this code as a front end for a sql engine, or as a sql validity
checker, you will need to do a lot of checks on the ast. A
typechecker/static checker plus annotation to support being a compiler
front end is planned but not likely to happen too soon.

Some of the areas this affects:

typenames: the variation of the type name should switch on the actual
name given according to the standard, but this code only does this for
the special case of interval type names. E.g. you can write 'int
collate C' or 'int(15,2)' and this will parse as a character type name
or a precision scale type name instead of being rejected.

scalar expressions: every variation on scalar expressions uses the same
parser/syntax. This means we don't try to stop non boolean valued
expressions in boolean valued contexts in the parser. Another area
this affects is that we allow general scalar expressions in group by,
whereas the standard only allows column names with optional collation.

These are all areas which are specified (roughly speaking) in the
syntax rather than the semantics in the standard, and we are not
fixing them in the syntax but leaving them till the semantic checking
(which doesn't exist in this code at this time).
-}

{-# LANGUAGE TupleSections #-}
{-# LANGUAGE OverloadedStrings #-}
-- | This is the module with the parser functions.
module Language.SQL.SimpleSQL.Parse
    (parseQueryExpr
    ,parseScalarExpr
    ,parseStatement
    ,parseStatements
    ,ParseError(..)
    ,prettyError
    ,ansi2011
    ) where

import Text.Megaparsec
    (ParsecT
    ,runParserT

    ,ParseErrorBundle(..)
    ,errorBundlePretty
    ,hidden
    ,failure
    ,ErrorItem(..)

    ,(<|>)
    ,token
    ,choice
    ,eof
    ,try
    ,sepBy
    ,sepBy1
    ,optional
    ,option
    ,some
    ,many
    ,between
    ,lookAhead
    )
import qualified Control.Monad.Combinators.Expr as E
import qualified Control.Monad.Permutations as P
import qualified Text.Megaparsec as M

import Control.Monad.Reader
    (Reader
    ,runReader
    ,ask
    ,asks
    )

import qualified Data.Set           as Set
import qualified Data.List.NonEmpty as NE
import Data.Void (Void)

import Control.Monad (guard, void)
import Control.Applicative ((<**>))
import Data.Char (isDigit)
import Data.List (sort,groupBy)
import Data.Function (on)
import Data.Maybe (catMaybes, isJust, mapMaybe, fromMaybe)
import Data.Text (Text)
import qualified Data.Text as T

import Language.SQL.SimpleSQL.Syntax
import Language.SQL.SimpleSQL.Dialect
import qualified Language.SQL.SimpleSQL.Lex as L
--import Text.Megaparsec.Debug (dbg)
import Text.Read (readMaybe)

------------------------------------------------------------------------------

-- = Public API

-- | Parses a query expr, trailing semicolon optional.
parseQueryExpr
    :: Dialect
    -- ^ dialect of SQL to use
    -> Text
    -- ^ filename to use in error messages
    -> Maybe (Int,Int)
    -- ^ line number and column number of the first character
    -- in the source to use in error messages
    -> Text
    -- ^ the SQL source to parse
    -> Either ParseError QueryExpr
parseQueryExpr = wrapParse topLevelQueryExpr

-- | Parses a statement, trailing semicolon optional.
parseStatement
    :: Dialect
    -- ^ dialect of SQL to use
    -> Text
    -- ^ filename to use in error messages
    -> Maybe (Int,Int)
    -- ^ line number and column number of the first character
    -- in the source to use in error messages
    -> Text
    -- ^ the SQL source to parse
    -> Either ParseError Statement
parseStatement = wrapParse topLevelStatement

-- | Parses a list of statements, with semi colons between
-- them. The final semicolon is optional.
parseStatements
    :: Dialect
    -- ^ dialect of SQL to use
    -> Text
    -- ^ filename to use in error messages
    -> Maybe (Int,Int)
    -- ^ line number and column number of the first character
    -- in the source to use in error messages
    -> Text
    -- ^ the SQL source to parse
    -> Either ParseError [Statement]
parseStatements = wrapParse statements

-- | Parses a scalar expression.
parseScalarExpr
    :: Dialect
    -- ^ dialect of SQL to use
    -> Text
    -- ^ filename to use in error messages
    -> Maybe (Int,Int)
    -- ^ line number and column number of the first character
    -- in the source to use in error messages
    -> Text
    -- ^ the SQL source to parse
    -> Either ParseError ScalarExpr
parseScalarExpr = wrapParse scalarExpr

-- Megaparsec is too clever, so have to create a new type to represent
-- either a lex error or a parse error

data ParseError
    = LexError L.ParseError
    | ParseError (ParseErrorBundle L.SQLStream Void)

prettyError :: ParseError -> Text
prettyError (LexError e) = T.pack $ errorBundlePretty e
prettyError (ParseError e) = T.pack $ errorBundlePretty e

{-
This helper function takes the parser given and:

sets the position when parsing
automatically skips leading whitespace
checks the parser parses all the input using eof
converts the error return to the nice wrapper
-}

wrapParse :: Parser a
          -> Dialect
          -> Text
          -> Maybe (Int,Int)
          -> Text
          -> Either ParseError a
wrapParse parser d f p src = do
    lx <- either (Left . LexError) Right $ L.lexSQLWithPositions d True f p src
    either (Left . ParseError) Right $
        runReader (runParserT (parser <* (hidden eof)) (T.unpack f)
                   $ L.SQLStream (T.unpack src) $ filter notSpace lx) d
  where
    notSpace = notSpace' . L.tokenVal
    notSpace' (L.Whitespace {}) = False
    notSpace' (L.LineComment {}) = False
    notSpace' (L.BlockComment {}) = False
    notSpace' _ = True

------------------------------------------------------------------------------

-- parsing code

type Parser = ParsecT Void L.SQLStream (Reader Dialect)

{-
------------------------------------------------

= Names

Names represent identifiers and a few other things. The parser here
handles regular identifiers, dotten chain identifiers, quoted
identifiers and unicode quoted identifiers.

Dots: dots in identifier chains are parsed here and represented in the
Iden constructor usually. If parts of the chains are non identifier
scalar expressions, then this is represented by a BinOp "."
instead. Dotten chain identifiers which appear in other contexts (such
as function names, table names, are represented as [Name] only.

Identifier grammar:

unquoted:
underscore <|> letter : many (underscore <|> alphanum

example
_example123

quoted:

double quote, many (non quote character or two double quotes
together), double quote

"example quoted"
"example with "" quote"

unicode quoted is the same as quoted in this parser, except it starts
with U& or u&

u&"example quoted"
-}

name :: Text -> Parser Name
name lbl = label lbl $ do
    bl <- askDialect diKeywords
    uncurry Name <$> identifierTok bl

-- todo: replace (:[]) with a named function all over

names :: Text -> Parser [Name]
names lbl =
    label lbl (reverse <$> (((:[]) <$> name lbl) `chainrSuffix` anotherName))
  -- can't use a simple chain here since we
  -- want to wrap the . + name in a try
  -- this will change when this is left factored
  where
    anotherName :: Parser ([Name] -> [Name])
    anotherName = try ((:) <$> (hidden (symbol "." *> name lbl)))

{-
= Type Names

Typenames are used in casts, and also in the typed literal syntax,
which is a typename followed by a string literal.

Here are the grammar notes:

== simple type name

just an identifier chain or a multi word identifier (this is a fixed
list of possibilities, e.g. as 'character varying', see below in the
parser code for the exact list).

<simple-type-name> ::= <identifier-chain>
     | multiword-type-identifier

== Precision type name

<precision-type-name> ::= <simple-type-name> <left paren> <unsigned-int> <right paren>

e.g. char(5)

note: above and below every where a simple type name can appear, this
means a single identifier/quoted or a dotted chain, or a multi word
identifier

== Precision scale type name

<precision-type-name> ::= <simple-type-name> <left paren> <unsigned-int> <comma> <unsigned-int> <right paren>

e.g. decimal(15,2)

== Lob type name

this is a variation on the precision type name with some extra info on
the units:

<lob-type-name> ::=
   <simple-type-name> <left paren> <unsigned integer> [ <multiplier> ] [ <char length units> ] <right paren>

<multiplier>    ::=   K | M | G
<char length units>    ::=   CHARACTERS | CODE_UNITS | OCTETS

(if both multiplier and char length units are missing, then this will
parse as a precision type name)

e.g.
clob(5M octets)

== char type name

this is a simple type with optional precision which allows the
character set or the collation to appear as a suffix:

<char type name> ::=
    <simple type name>
    [ <left paren> <unsigned-int> <right paren> ]
    [ CHARACTER SET <identifier chain> ]
    [ COLLATE <identifier chain> ]

e.g.

char(5) character set my_charset collate my_collation

= Time typename

this is typename with optional precision and either 'with time zone'
or 'without time zone' suffix, e.g.:

<datetime type> ::=
    [ <left paren> <unsigned-int> <right paren> ]
    <with or without time zone>
<with or without time zone> ::= WITH TIME ZONE | WITHOUT TIME ZONE
    WITH TIME ZONE | WITHOUT TIME ZONE

= row type name

<row type> ::=
    ROW <left paren> <field definition> [ { <comma> <field definition> }... ] <right paren>

<field definition> ::= <identifier> <type name>

e.g.
row(a int, b char(5))

= interval type name

<interval type> ::= INTERVAL <interval datetime field> [TO <interval datetime field>]

<interval datetime field> ::=
  <datetime field> [ <left paren> <unsigned int> [ <comma> <unsigned int> ] <right paren> ]

= array type name

<array type> ::= <data type> ARRAY [ <left bracket> <unsigned integer> <right bracket> ]

= multiset type name

<multiset type>    ::=   <data type> MULTISET

A type name will parse into the 'smallest' constructor it will fit in
syntactically, e.g. a clob(5) will parse to a precision type name, not
a lob type name.

Unfortunately, to improve the error messages, there is a lot of (left)
factoring in this function, and it is a little dense.

the hideArg is used when the typename is used as part of a typed
literal expression, to hide what comes after the paren in
'typename('. This is so 'arbitrary_fn(' gives an 'expecting expression',
instead of 'expecting expression or number', which is odd.

-}

typeName :: Parser TypeName
typeName = typeName' False

typeName' :: Bool -> Parser TypeName
typeName' hideArg =
    label "typename" (
        (rowTypeName <|> intervalTypeName <|> otherTypeName)
        `chainrSuffix` tnSuffix)
  where
    rowTypeName =
        RowTypeName <$> (hidden (keyword_ "row") *> parens (commaSep1 rowField))
    rowField = (,) <$> name "type name" <*> typeName
    ----------------------------
    intervalTypeName =
        hidden (keyword_ "interval") *>
        (uncurry IntervalTypeName <$> intervalQualifier)
    ----------------------------
    otherTypeName =
        nameOfType <**>
            (typeNameWithParens
             <|> pure Nothing <**> (hidden timeTypeName <|> hidden charTypeName)
             <|> pure TypeName)
    nameOfType = reservedTypeNames <|> names "type name"
    charTypeName = charSet <**> (option [] tcollate <**> pure (flip4 CharTypeName))
                   <|> pure [] <**> (tcollate <**> pure (flip4 CharTypeName))
    typeNameWithParens =
        (hidden openParen *> (if hideArg then hidden unsignedInteger else unsignedInteger))
        <**> (closeParen *> hidden precMaybeSuffix
              <|> hidden (precScaleTypeName <|> precLengthTypeName) <* closeParen)
    precMaybeSuffix = (. Just) <$> (timeTypeName <|> charTypeName)
                      <|> pure (flip PrecTypeName)
    precScaleTypeName =
        (hidden comma *> (if hideArg then hidden unsignedInteger else unsignedInteger))
        <**> pure (flip3 PrecScaleTypeName)
    precLengthTypeName =
        Just <$> lobPrecSuffix
        <**> (optional lobUnits <**> pure (flip4 PrecLengthTypeName))
        <|> pure Nothing <**> ((Just <$> lobUnits) <**> pure (flip4 PrecLengthTypeName))
    timeTypeName = tz <**> pure (flip3 TimeTypeName)
    ----------------------------
    lobPrecSuffix = PrecK <$ keyword_ "k"
                    <|> PrecM <$ keyword_ "m"
                    <|> PrecG <$ keyword_ "g"
                    <|> PrecT <$ keyword_ "t"
                    <|> PrecP <$ keyword_ "p"
    lobUnits = PrecCharacters <$ keyword_ "characters"
               -- char and byte are the oracle spelling
               -- todo: move these to oracle dialect
               <|> PrecCharacters <$ keyword_ "char"
               <|> PrecOctets <$ keyword_ "octets"
               <|> PrecOctets <$ keyword_ "byte"
    tz = True <$ keywords_ ["with", "time","zone"]
         <|> False <$ keywords_ ["without", "time","zone"]
    charSet = keywords_ ["character", "set"] *> names "character set name"
    tcollate = keyword_ "collate" *> names "collation name"
    ----------------------------
    tnSuffix = multiset <|> array
    multiset = MultisetTypeName <$ keyword_ "multiset"
    array = keyword_ "array" *>
        (optional (brackets unsignedInteger) <**> pure (flip ArrayTypeName))
    ----------------------------
    -- this parser handles the fixed set of multi word
    -- type names, plus all the type names which are
    -- reserved words
    reservedTypeNames = do
        stn <- askDialect diSpecialTypeNames
        (:[]) . Name Nothing . T.unwords <$> makeKeywordTree stn


{-
= Scalar expressions

== simple literals

See the stringToken lexer below for notes on string literal syntax.
-}

stringLit :: Parser ScalarExpr
stringLit = (\(s,e,t) -> StringLit s e t) <$> stringTokExtend

numberLit :: Parser ScalarExpr
numberLit = NumLit <$> sqlNumberTok False

simpleLiteral :: Parser ScalarExpr
simpleLiteral = numberLit <|> stringLit

{-
== star, param, host param

=== star

used in select *, select x.*, and agg(*) variations, and some other
places as well. The parser makes an attempt to not parse star in
most contexts, to provide better experience when the user makes a mistake
in an expression containing * meaning multiple. It will parse a *
at the top level of a select item, or in arg in a app argument list.
-}

star :: Parser ScalarExpr
star =
    hidden $ choice
    [Star <$ symbol "*"
    -- much easier to use try here than to left factor where
    -- this is allowed and not allowed
    ,try (QStar <$> (names "qualified star" <* symbol "." <* symbol "*"))]

{-
== parameter

unnamed parameter or named parameter
use in e.g. select * from t where a = ?
select x from t where x > :param
-}

parameter :: Parser ScalarExpr
parameter = choice
    [Parameter <$ questionMark
    ,HostParameter
     <$> hostParamTok
     <*> hoptional (keyword "indicator" *> hostParamTok)]

-- == positional arg

positionalArg :: Parser ScalarExpr
positionalArg = PositionalArg <$> positionalArgTok

{-
== parens

scalar expression parens, row ctor and scalar subquery
-}

parensExpr :: Parser ScalarExpr
parensExpr = parens $ choice
    -- no parens here used for nested parens expressions
    -- this could be fixed to be general with some refactoring, but at
    -- the moment, you can't use additional redundant parens in a
    -- subqueryexpr
    [SubQueryExpr SqSq <$> queryExprNoParens
    ,ctor <$> commaSep1 scalarExpr]
  where
    ctor [a] = Parens a
    ctor as = SpecialOp [Name Nothing "rowctor"] as

{-
== case, cast, exists, unique, array/multiset constructor, interval

All of these start with a fixed keyword which is reserved, so no other
syntax can start with the same keyword.

=== case expression
-}

caseExpr :: Parser ScalarExpr
caseExpr =
    Case <$> (keyword_ "case" *> optional scalarExpr)
         <*> some whenClause
         <*> optional elseClause
         <* keyword_ "end"
  where
   whenClause = (,) <$> (keyword_ "when" *> commaSep1 scalarExpr)
                    <*> (keyword_ "then" *> scalarExpr)
   elseClause = keyword_ "else" *> scalarExpr

{-
=== cast

cast: cast(expr as type)
-}

cast :: Parser ScalarExpr
cast = keyword_ "cast" *>
       parens (Cast <$> scalarExpr
                    <*> (keyword_ "as" *> typeName))

{-
=== convert

convertSqlServer: SqlServer dialect CONVERT(data_type(length), expression, style)
-}

convertSqlServer :: Parser ScalarExpr
convertSqlServer = guardDialect diConvertFunction
                   *> keyword_ "convert" *>
                   parens (Convert <$> typeName <*> (comma *> scalarExpr)
                      <*> optional (comma *> unsignedInteger))

{-
=== exists, unique

subquery expression:
[exists|unique] (queryexpr)
-}

subquery :: Parser ScalarExpr
subquery = SubQueryExpr <$> sqkw <*> parens queryExpr
  where
    sqkw = SqExists <$ keyword_ "exists" <|> SqUnique <$ keyword_ "unique"

-- === array/multiset constructor

arrayCtor :: Parser ScalarExpr
arrayCtor = keyword_ "array" >>
    choice
    [ArrayCtor <$> parens queryExpr
    ,Array (Iden [Name Nothing "array"]) <$> brackets (commaSep scalarExpr)]

{-
As far as I can tell, table(query expr) is just syntax sugar for
multiset(query expr). It must be there for compatibility or something.
-}

multisetCtor :: Parser ScalarExpr
multisetCtor =
    choice
    [keyword_ "multiset" >>
     choice
     [MultisetQueryCtor <$> parens queryExpr
     ,MultisetCtor <$> brackets (commaSep scalarExpr)]
    ,keyword_ "table" >>
     MultisetQueryCtor <$> parens queryExpr]

nextValueFor :: Parser ScalarExpr
nextValueFor = keywords_ ["next","value","for"] >>
    NextValueFor <$> names "sequence generator name"

{-
=== interval

interval literals are a special case and we follow the grammar less
permissively here

parse SQL interval literals, something like
interval '5' day (3)
or
interval '5' month

if the literal looks like this:
interval 'something'

then it is parsed as a regular typed literal. It must have a
interval-datetime-field suffix to parse as an intervallit

It uses try because of a conflict with interval type names: todo, fix
this. also fix the monad -> applicative
-}

intervalLit :: Parser ScalarExpr
intervalLit =
    label "interval literal" $ try (keyword_ "interval" >> do
    s <- hoptional $ choice [Plus <$ symbol_ "+"
                           ,Minus <$ symbol_ "-"]
    lit <- singleQuotesOnlyStringTok
    q <- hoptional intervalQualifier
    mkIt s lit q)
  where
    mkIt Nothing val Nothing = pure $ TypedLit (TypeName [Name Nothing "interval"]) val
    mkIt s val (Just (a,b)) = pure $ IntervalLit s val a b
    mkIt (Just {}) _val Nothing = fail "cannot use sign without interval qualifier"

{-
== typed literal, app, special, aggregate, window, iden

All of these start with identifiers (some of the special functions
start with reserved keywords).

they are all variations on suffixes on the basic identifier parser

The windows is a suffix on the app parser

=== iden prefix term

all the scalar expressions which start with an identifier

(todo: really put all of them here instead of just some of them)
-}

idenExpr :: Parser ScalarExpr
idenExpr =
    -- todo: try reversing these
    -- then if it parses as a typename as part of a typed literal
    -- and not a regularapplike, then you'll get a better error message
    try typedLiteral <|> regularAppLike
  where
    -- parse regular iden or app
    -- if it could potentially be a typed literal typename 'literaltext'
    -- optionally try to parse that
    regularAppLike = do
        e <- (keywordFunctionOrIden
              <|> (names "identifier" <**> (hidden app <|> pure Iden)))
        let getInt s = readMaybe (T.unpack s)
        case e of
            Iden nm -> tryTypedLiteral (TypeName nm) <|> pure e
            App nm [NumLit prec]
                | Just prec' <- getInt prec ->
                tryTypedLiteral (PrecTypeName nm prec') <|> pure e
            App nm [NumLit prec,NumLit scale]
                | Just prec' <- getInt prec
                , Just scale' <- getInt scale ->
                tryTypedLiteral (PrecScaleTypeName nm prec' scale') <|> pure e
            _ -> pure e
    tryTypedLiteral tn =
        TypedLit tn <$> hidden singleQuotesOnlyStringTok
    typedLiteral =
        TypedLit <$> hidden (typeName' True) <*> singleQuotesOnlyStringTok
    keywordFunctionOrIden = do
        d <- askDialect id
        x <- hidden (keywordTok (diIdentifierKeywords d ++ diAppKeywords d))
        let i = T.toLower x `elem` diIdentifierKeywords d
            a = T.toLower x `elem` diAppKeywords d
        case () of
            _ | i && a -> pure [Name Nothing x] <**> (hidden app <|> pure Iden)
              | i -> pure (Iden [Name Nothing x])
              | a -> pure [Name Nothing x] <**> app
              | otherwise -> -- shouldn't get here
                    fail $ "unexpected keyword: " <> T.unpack x

{-
=== special

These are keyword operators which don't look like normal prefix,
postfix or infix binary operators. They mostly look like function
application but with keywords in the argument list instead of commas
to separate the arguments.

the special op keywords
parse an operator which is
operatorname(firstArg keyword0 arg0 keyword1 arg1 etc.)
-}

data SpecialOpKFirstArg = SOKNone
                        | SOKOptional
                        | SOKMandatory

specialOpK :: Text -- name of the operator
           -> SpecialOpKFirstArg -- has a first arg without a keyword
           -> [(Text,Bool)] -- the other args with their keywords
                              -- and whether they are optional
           -> Parser ScalarExpr
specialOpK opName firstArg kws =
    keyword_ opName >> do
    void openParen
    let pfa = do
              e <- scalarExpr
              -- check we haven't parsed the first
              -- keyword as an identifier
              case (e,kws) of
                  (Iden [Name Nothing i], (k,_):_)
                      | T.toLower i == k ->
                          fail $ "unexpected " ++ T.unpack i
                  _ -> pure ()
              pure e
    fa <- case firstArg of
         SOKNone -> pure Nothing
         SOKOptional -> optional (try pfa)
         SOKMandatory -> Just <$> pfa
    as <- mapM parseArg kws
    void closeParen
    pure $ SpecialOpK [Name Nothing opName] fa $ catMaybes as
  where
    parseArg (nm,mand) =
        let p = keyword_ nm >> scalarExpr
        in fmap (nm,) <$> if mand
                          then Just <$> p
                          else optional (try p)

{-
The actual operators:

EXTRACT( date_part FROM expression )

POSITION( string1 IN string2 )

SUBSTRING(extraction_string FROM starting_position [FOR length]
[COLLATE collation_name])

CONVERT(char_value USING conversion_char_name)

TRANSLATE(char_value USING translation_name)

OVERLAY(string PLACING embedded_string FROM start
[FOR length])

TRIM( [ [{LEADING | TRAILING | BOTH}] [removal_char] FROM ]
target_string
[COLLATE collation_name] )
-}

specialOpKs :: Parser ScalarExpr
specialOpKs = choice $ map try
    [extract, position, substring, convert, translate, overlay, trim]

extract :: Parser ScalarExpr
extract = specialOpK "extract" SOKMandatory [("from", True)]

position :: Parser ScalarExpr
position = specialOpK "position" SOKMandatory [("in", True)]

{-
strictly speaking, the substring must have at least one of from and
for, but the parser doens't enforce this
-}

substring :: Parser ScalarExpr
substring = specialOpK "substring" SOKMandatory
                [("from", False),("for", False)]

convert :: Parser ScalarExpr
convert = specialOpK "convert" SOKMandatory [("using", True)]


translate :: Parser ScalarExpr
translate = specialOpK "translate" SOKMandatory [("using", True)]

overlay :: Parser ScalarExpr
overlay = specialOpK "overlay" SOKMandatory
                [("placing", True),("from", True),("for", False)]

{-
trim is too different because of the optional char, so a custom parser
the both ' ' is filled in as the default if either parts are missing
in the source
-}

trim :: Parser ScalarExpr
trim =
    keyword "trim" >>
    parens (mkTrim
            <$> option "both" sides
            <*> option " " singleQuotesOnlyStringTok
            <*> (keyword_ "from" *> scalarExpr))
  where
    sides = choice ["leading" <$ keyword_ "leading"
                   ,"trailing" <$ keyword_ "trailing"
                   ,"both" <$ keyword_ "both"]
    mkTrim fa ch fr =
      SpecialOpK [Name Nothing "trim"] Nothing
          $ catMaybes [Just (fa,StringLit "'" "'" ch)
                      ,Just ("from", fr)]

{-
=== app, aggregate, window

This parses all these variations:
normal function application with just a csv of scalar exprs
aggregate variations (distinct, order by in parens, filter and where
  suffixes)
window apps (fn/agg followed by over)

This code is also a little dense like the typename code because of
left factoring, later they will even have to be partially combined
together.
-}

app :: Parser ([Name] -> ScalarExpr)
app =
    hidden openParen *> choice
    [hidden duplicates
     <**> (commaSep1 scalarExprOrStar
           <**> ((hoption [] orderBy <* closeParen)
                 <**> (hoptional afilter <**> pure (flip5 AggregateApp))))
     -- separate cases with no all or distinct which must have at
     -- least one scalar expr
    ,commaSep1 scalarExprOrStar
     <**> choice
          [closeParen *> hidden (choice
                         [window
                         ,withinGroup
                         ,(Just <$> afilter) <**> pure (flip3 aggAppWithoutDupeOrd)
                         ,pure (flip App)])
          ,hidden orderBy <* closeParen
           <**> (hoptional afilter <**> pure (flip4 aggAppWithoutDupe))]
     -- no scalarExprs: duplicates and order by not allowed
    ,([] <$ closeParen) <**> choice
                             [window
                             ,withinGroup
                             ,pure $ flip App]
    ]
  where
    aggAppWithoutDupeOrd n es f = AggregateApp n SQDefault es [] f
    aggAppWithoutDupe n = AggregateApp n SQDefault