Hi, [I probably should direct this at George... :> ] I'm writing a command parser. Some set of commands are "common". But, other instances of the parser are augmented with additional commands/syntax. [These additions are known at compile time, not "dynamic"] Ideally, I want a solution that allows folks developing those "other" commands to just "bolt onto" what I have done. E.g., creating a single "grammar definition" (lex/yacc) is probably not a good way to go (IME, most small parsers tend to be ad hoc). [Note: I can't even guarantee that the extensions will be consistent or "harmonious" with the grammar that I implement] A naive approach (?) might be for my code to take a crack at a particular "statement"/command and, in case of FAIL, invoke some bolt-on parser to give it a chance to make sense out of the input. If *it* FAILs, then the input is invalid, etc. This sounds like an incredible kluge. Any better suggestions?
Parsers for extensible grammars?
Started by ●October 22, 2014
Reply by ●October 22, 20142014-10-22
On 10/22/2014 12:14 PM, Don Y wrote:> Ideally, I want a solution that allows folks developing those > "other" commands to just "bolt onto" what I have done. E.g., > creating a single "grammar definition" (lex/yacc) is probably > not a good way to go (IME, most small parsers tend to be ad hoc).By that, I mean: having folks augment a lex/yacc grammar definition that I've developed as the basis of these "common" commands is probably ill-advised. *I* can choose such an implementation for the common commands but should probably not impose that solution on the extensions (and risk folks breaking the common stuff!)
Reply by ●October 22, 20142014-10-22
On Wed, 22 Oct 2014 12:14:06 -0700, Don Y wrote:> Hi, > > [I probably should direct this at George... :> ] > > I'm writing a command parser. Some set of commands are "common". > But, other instances of the parser are augmented with additional > commands/syntax. > > [These additions are known at compile time, not "dynamic"] > > Ideally, I want a solution that allows folks developing those "other" > commands to just "bolt onto" what I have done. E.g., creating a single > "grammar definition" (lex/yacc) is probably not a good way to go (IME, > most small parsers tend to be ad hoc). > > [Note: I can't even guarantee that the extensions will be consistent or > "harmonious" with the grammar that I implement] > > A naive approach (?) might be for my code to take a crack at a > particular "statement"/command and, in case of FAIL, invoke some bolt-on > parser to give it a chance to make sense out of the input. If *it* > FAILs, then the input is invalid, etc. > > This sounds like an incredible kluge. Any better suggestions?If it's always something simple like <command> <argument> <argument> <argument> and the only thing that's changing is the command and (possibly) the argument types (text, numerical, etc.) that are bound to it, then a parser should be pretty easy -- I wouldn't even bother with Lex and Yacc. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Reply by ●October 22, 20142014-10-22
On 23/10/14 06:14, Don Y wrote:> I'm writing a command parser. Some set of commands are "common". > But, other instances of the parser are augmented with additional > commands/syntax. > > [These additions are known at compile time, not "dynamic"] > > Ideally, I want a solution that allows folks developing those > "other" commands to just "bolt onto" what I have done. E.g., > creating a single "grammar definition" (lex/yacc) is probably > not a good way to go (IME, most small parsers tend to be ad hoc). > > [Note: I can't even guarantee that the extensions will be > consistent or "harmonious" with the grammar that I implement] > > A naive approach (?) might be for my code to take a crack at > a particular "statement"/command and, in case of FAIL, invoke > some bolt-on parser to give it a chance to make sense out of > the input. If *it* FAILs, then the input is invalid, etc. > > This sounds like an incredible kluge. Any better suggestions?You want composable grammars. This is right in a core area of my expertise. Lex/Yacc (and all other forms of bottom-up LR parsing) are the wrong way to go, because the tools must do deep analysis of the entire grammar before producing a parser, and the errors that occur require significant expertise to understand. Top-down parsers (also known as LL or recursive descent) are much easier to write and debug. They're typically implemented using limited look-ahead (often one token, LL-1) and are slightly less powerful than LR parsers with the same look-ahead. When an LL parser has unlimited lookahead, they're very powerful but can have horrible performance due to the extensive backtracking required. However, so-called packrat parsers trade memory for performance, while retaining almost all the power of unlimited backtracking. For a grammar of N rules, the memory usage for an input of size M is O(N*M), that is, linear in the size of the input. Because the CPU performance is so good,there's no need for a separate lexer - the parse performance is the same as the DFAs used for lexing so there's no advantage to be had. Packrat parsers are generally created from PEGs (Parsing Expression Grammars). There are plenty of PEG parser generators; I maintain the most widely used one for Ruby, but they exist for dozens of languages. The really nice thing about LL parsers is they're composable. You can take two grammars with different rules, and simply layer one on top of the other. Often it's helpful to allow a rule in one grammar to call the same rule (by name) in the other grammar - this works very like O-O inheritance. Note that none of this will help you if you want "minimum abbreviation" support for your keywords. That is best implemented using a trie. It's all compatible with the use of a PEG parser though; the keyword matcher won't be written using a PEG, is all. Clifford Heath.
Reply by ●October 23, 20142014-10-23
On 10/22/2014 1:13 PM, Tim Wescott wrote:> On Wed, 22 Oct 2014 12:14:06 -0700, Don Y wrote: > >> Hi, >> >> [I probably should direct this at George... :> ] >> >> I'm writing a command parser. Some set of commands are "common". >> But, other instances of the parser are augmented with additional >> commands/syntax. >> >> [These additions are known at compile time, not "dynamic"] >> >> Ideally, I want a solution that allows folks developing those "other" >> commands to just "bolt onto" what I have done. E.g., creating a single >> "grammar definition" (lex/yacc) is probably not a good way to go (IME, >> most small parsers tend to be ad hoc). >> >> [Note: I can't even guarantee that the extensions will be consistent or >> "harmonious" with the grammar that I implement] >> >> A naive approach (?) might be for my code to take a crack at a >> particular "statement"/command and, in case of FAIL, invoke some bolt-on >> parser to give it a chance to make sense out of the input. If *it* >> FAILs, then the input is invalid, etc. >> >> This sounds like an incredible kluge. Any better suggestions? > > If it's always something simple like > > <command> <argument> <argument> <argument> > > and the only thing that's changing is the command and (possibly) the > argument types (text, numerical, etc.) that are bound to it, then a parser > should be pretty easy -- I wouldn't even bother with Lex and Yacc.The problem I'm stuck with is not knowing what is likely to be added. While I suspect most will be similarly trivial, I can also imagine some esoteric additions! The "common" commands are trivial enough that, if I knew ALL would be of similar form, I would almost adopt an approach of sequentially attempting to parse a command with individual templates -- until one "fit". Extensions would just be tacked onto the end of a (short) series of conditionals: if (parse(string, template1)) command1(); else if (parse(string, template2)) command2(); else if ... Performance isn't an issue (though footprint may be). This would allow extensions to *simply* focus on what they needed to do for *those* individual commands (without fear of extensions mucking up the common commands!) I wouldn't run the risk of someone buggering a "shared grammar" definition (and failing to add appropriate regression tests to detect that). But, if the extensions get more esoteric (than this trivial approach), this starts to look toy-ish; there's very little to *leverage*, here... :<
Reply by ●October 23, 20142014-10-23
On 10/22/2014 2:22 PM, Clifford Heath wrote:> On 23/10/14 06:14, Don Y wrote: >> I'm writing a command parser. Some set of commands are "common". >> But, other instances of the parser are augmented with additional >> commands/syntax. >> >> [These additions are known at compile time, not "dynamic"] >> >> Ideally, I want a solution that allows folks developing those >> "other" commands to just "bolt onto" what I have done. E.g., >> creating a single "grammar definition" (lex/yacc) is probably >> not a good way to go (IME, most small parsers tend to be ad hoc). >> >> [Note: I can't even guarantee that the extensions will be >> consistent or "harmonious" with the grammar that I implement]> You want composable grammars. This is right in a core area of my expertise. > Lex/Yacc (and all other forms of bottom-up LR parsing) are the wrong way to go, > because the tools must do deep analysis of the entire grammar before producing > a parser, and the errors that occur require significant expertise to understand. > > Top-down parsers (also known as LL or recursive descent) are much easier to > write and debug. They're typically implemented using limited look-ahead (often > one token, LL-1) and are slightly less powerful than LR parsers with the same > look-ahead. > > When an LL parser has unlimited lookahead, they're very powerful but can have > horrible performance due to the extensive backtracking required. > > However, so-called packrat parsers trade memory for performance, while > retaining almost all the power of unlimited backtracking. For a grammar of N > rules, the memory usage for an input of size M is O(N*M), that is, linear in > the size of the input. Because the CPU performance is so good,there's no need > for a separate lexer - the parse performance is the same as the DFAs used for > lexing so there's no advantage to be had. > > Packrat parsers are generally created from PEGs (Parsing Expression Grammars). > There are plenty of PEG parser generators; I maintain the most widely used one > for Ruby, but they exist for dozens of languages. > > The really nice thing about LL parsers is they're composable. You can take two > grammars with different rules, and simply layer one on top of the other. Often > it's helpful to allow a rule in one grammar to call the same rule (by name) in > the other grammar - this works very like O-O inheritance. > > Note that none of this will help you if you want "minimum abbreviation" support > for your keywords. That is best implemented using a trie. It's all compatible > with the use of a PEG parser though; the keyword matcher won't be written using > a PEG, is all.This sounds like an *ideal* approach! But... I'm concerned that too many folks would find a "formal" grammar tool to be intimidating. I'm thinking about the *hundreds* of different "configuration files" I've encountered, "argument handling", etc. and how *few* of these were implemented with any sort of formal tools. Rather, almost as a RULE, they all seemed to be jumbles of spaghetti code (though they "got the job done", more or less). Cases where "space" was accepted as whitespace -- but "tab", not. Or, vice versa. Cases where comments HAD to begin in column one and others where leading whitespace was ignored. Or, where comments could not coexist with preceding "content" on the same line. Or, where groups of statements had to be co-located on the same line (with some formal delimiter) instead of being allowed to span lines (couple this with a maximum line length constraint and you can see one potential issue!) And, in each case, no compelling REASON for the restrictions (as if "it was too much work" to provide a cleaner/more consistent implementation). I'm left with the impression that these folks either underestimated the complexity of the tasks they set out for themselves; or, were intimidated by more "structured" approaches to SUCH A COMMON PROBLEM! [Or, the structured approaches were harder to comprehend/debug than the ad hoc spaghetti-code implementations!] So, I figure decoupling my portion of the implementation (common commands) from these potential extensions might be A Wise Choice.
Reply by ●October 23, 20142014-10-23
On Thu, 23 Oct 2014 08:22:24 +1100, Clifford Heath <no.spam@please.net> wrote:>On 23/10/14 06:14, Don Y wrote: >> I'm writing a command parser. Some set of commands are "common". >> But, other instances of the parser are augmented with additional >> commands/syntax. >> >> [These additions are known at compile time, not "dynamic"] >> >> Ideally, I want a solution that allows folks developing those >> "other" commands to just "bolt onto" what I have done. E.g., >> creating a single "grammar definition" (lex/yacc) is probably >> not a good way to go (IME, most small parsers tend to be ad hoc). >> >> [Note: I can't even guarantee that the extensions will be >> consistent or "harmonious" with the grammar that I implement] >> >> A naive approach (?) might be for my code to take a crack at >> a particular "statement"/command and, in case of FAIL, invoke >> some bolt-on parser to give it a chance to make sense out of >> the input. If *it* FAILs, then the input is invalid, etc. >> >> This sounds like an incredible kluge. Any better suggestions? > >You want composable grammars.Or just parallel parsers. Don said "bolt onto" and "harmonious", but it isn't clear to me whether what he wants is to provide some kind of macro facility in which new commands are fully expressible in *his* language, or whether he wants to allow development of a parallel language that remains interoperable with his but also is separate from it. The first is [relatively] easy, the second not so much. In either case, the additional parser should get the first crack at input with Don's own parser as a fallback.>Top-down parsers (also known as LL or recursive descent) are much easier >to write and debug. They're typically implemented using limited >look-ahead (often one token, LL-1) and are slightly less powerful than >LR parsers with the same look-ahead.^^^^^^^^ You mean _much_ less powerful. AFAIK, LR has not been shown to be a proper superset of LL, but it is known to be a vastly larger space. The LR(1) subset provably contains LL(1) and also many LL(>1) languages. That said, a large number of useful languages can be expressed in LL with computationally reasonable lookahead. And it is true in general that LL parsers are easier to understand and debug.>When an LL parser has unlimited lookahead, they're very powerful but can >have horrible performance due to the extensive backtracking required.Not really. "Ungetting" tokens - even a large number of them - is not very costly as long as 1) trying a parse produces no side effects [i.e. the predication is purely syntactic], or 2) side effects are implemented "functionally" and can be simply thrown away. Syntactic predication and quite a lot of useful semantic predication can be implemented using stack or region allocation strategies, but there are some hard spaghetti cases where it's nice to have GC available in your development language.>However, so-called packrat parsers trade memory for performance, while >retaining almost all the power of unlimited backtracking. For a grammar >of N rules, the memory usage for an input of size M is O(N*M), that is, >linear in the size of the input. Because the CPU performance is so >good,there's no need for a separate lexer - the parse performance is the >same as the DFAs used for lexing so there's no advantage to be had. > >Packrat parsers are generally created from PEGs (Parsing Expression >Grammars). There are plenty of PEG parser generators; I maintain the >most widely used one for Ruby, but they exist for dozens of languages.PEG (and GLR also) replicates the parser state to deal with ambiguous input by following the different paths independently. But the fact is that most language constructs are unambiguous and thus do not cause forking, and *unavoidable* ambiguity in the language tends to be resolved quickly when the ambiguous constructs actually are used. The problem with PEG (and GLR also) grammars is that there still are no really useful tools for finding/debugging *unintended* ambiguity. The generator tool won't give reduce/reduce errors ... warnings maybe, but not errors ... and will create a parser that just forks on ambiguous input, wanders off into the hills and gets lost. Most useful program/control languages do not need PEG (or even GLR). If your language really is that complicated, you probably should rethink it.>The really nice thing about LL parsers is they're composable. You can >take two grammars with different rules, and simply layer one on top of >the other. Often it's helpful to allow a rule in one grammar to call the >same rule (by name) in the other grammar - this works very like O-O >inheritance.Yes.>Note that none of this will help you if you want "minimum abbreviation" >support for your keywords. That is best implemented using a trie. It's >all compatible with the use of a PEG parser though; the keyword matcher >won't be written using a PEG, is all.The trie would be part of the tokenizer, so it would be compatible with any parse strategy.>Clifford Heath.George
Reply by ●October 23, 20142014-10-23
Hi George, On 10/23/2014 1:45 AM, George Neuner wrote:> Or just parallel parsers. > > Don said "bolt onto" and "harmonious", but it isn't clear to me > whether what he wants is to provide some kind of macro facility in > which new commands are fully expressible in *his* language, or whether > he wants to allow development of a parallel language that remains > interoperable with his but also is separate from it.I hadn't considered the idea of expressing "new" commands in terms of "old"/common commands. What I am looking for is the ability to define a set of commands that will "always" be used/useful. Then, allow others to add to that set of commands to address specific needs that are necessary/unique to their environment. (i.e., the environment in which these common commands are deployed) Contrived example: MESSAGE <text> PAUSE <delay> STDIN <device> STDOUT <device> Deploy this in a context where stdin/stdout might want to also support a serial port ("<device> = TTY") would probably necessitate adding: BITRATE <speed> CHARACTER <width> PARITY <odd/even/off/ignore> In the presence of multiple TTY's, it might then require: PORT <address> IRQ <number> Deploy it where stdin/stdout are a VDT and, perhaps: BACKGROUND <color> FOREGROUND <color> CURSOR <blink/solid/off> Deploy it where stdin/stdout are a telnet connection, and perhaps: IP <address> MASK <dotted_quad> GATEWAY <address> I.e., the "arguments" can't be easily foreseen from the perspective of the first/common commands. Note that I've presented these in a somewhat "harmonious"/consistent manner. Another implementer could have adopted a syntax like: FORMAT <bitrate><parity><width> Or: display WHITE on field of BLUE Or: SET IP X.X.X.X/24 VIA Y.Y.Y.Y I.e., I can't assume he'll consider the <setting> <value> form to be appropriate to his/her future needs.> The first is [relatively] easy, the second not so much. > > In either case, the additional parser should get the first crack at > input with Don's own parser as a fallback.Hmmm... I had considered the opposite approach: let me verify the input is/isn't a "common command" before exposing the input to the "other" parser. The thought being: common commands are then implicitly invariant (you can't change the syntax of them) AND they "always work" (even if the additional parser has bugs in its implementation).
Reply by ●October 23, 20142014-10-23
Don Y wrote:> On 10/22/2014 1:13 PM, Tim Wescott wrote: >> If it's always something simple like >> >> <command> <argument> <argument> <argument> >> >> and the only thing that's changing is the command and (possibly) the >> argument types (text, numerical, etc.) that are bound to it, then a >> parser >> should be pretty easy -- I wouldn't even bother with Lex and Yacc. > > The problem I'm stuck with is not knowing what is likely to be added. > While I suspect most will be similarly trivial, I can also imagine > some esoteric additions!You could try to bring every command into the <command> <argument>... form with some clever syntax. I'm tempted to say "Lisp", but that probably doesn't enjoy universal love :-) However, you can easily bring if (foo) { bar(); baz(); } else { fred(); } into "<command> <argument>..." form by making "{}" string delimiters. This makes this a command 'if' with four parameters: an expression '(foo)', a string for the 'if' command, an unquoted word 'else', and a string for the 'else' command. I recently built a command shell like that, and I believe Tcl does it the same way. So all you would need to give your users would be a possiblity to register a command verb, and a way to evaluate strings as command sequences. Stefan
Reply by ●October 23, 20142014-10-23
On Wed, 22 Oct 2014 22:54:13 -0700, Don Y wrote:> On 10/22/2014 1:13 PM, Tim Wescott wrote: >> On Wed, 22 Oct 2014 12:14:06 -0700, Don Y wrote: >> >>> Hi, >>> >>> [I probably should direct this at George... :> ] >>> >>> I'm writing a command parser. Some set of commands are "common". >>> But, other instances of the parser are augmented with additional >>> commands/syntax. >>> >>> [These additions are known at compile time, not "dynamic"] >>> >>> Ideally, I want a solution that allows folks developing those "other" >>> commands to just "bolt onto" what I have done. E.g., creating a >>> single "grammar definition" (lex/yacc) is probably not a good way to >>> go (IME, most small parsers tend to be ad hoc). >>> >>> [Note: I can't even guarantee that the extensions will be consistent >>> or "harmonious" with the grammar that I implement] >>> >>> A naive approach (?) might be for my code to take a crack at a >>> particular "statement"/command and, in case of FAIL, invoke some >>> bolt-on parser to give it a chance to make sense out of the input. If >>> *it* FAILs, then the input is invalid, etc. >>> >>> This sounds like an incredible kluge. Any better suggestions? >> >> If it's always something simple like >> >> <command> <argument> <argument> <argument> >> >> and the only thing that's changing is the command and (possibly) the >> argument types (text, numerical, etc.) that are bound to it, then a >> parser should be pretty easy -- I wouldn't even bother with Lex and >> Yacc. > > The problem I'm stuck with is not knowing what is likely to be added. > While I suspect most will be similarly trivial, I can also imagine some > esoteric additions!Who's going to be adding stuff, why, and why won't you have control over the process?> The "common" commands are trivial enough that, if I knew ALL would be of > similar form, I would almost adopt an approach of sequentially > attempting to parse a command with individual templates -- until one > "fit". Extensions would just be tacked onto the end of a (short) series > of conditionals: > > if (parse(string, template1)) > command1(); > else if (parse(string, template2)) > command2(); > else if ...I have an in-house parser that I use for debugging and service on a number of projects. It's OO in C++, but it basically supports a syntax of <keyword> <anything else> You create an object and give it the keyword, which automagically registers it with the first-layer parser. When someone types something in the first-layer parser looks for the keyword, and if its found, hands it to the object for further parsing. It's crude but effective. I keep telling myself that I'll make more library-ish functionality for parsing the <anything else> bit, but so far that's all ad-hoc. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com







