Rework PHP parser [#300031]

Comment	File	Size	Author
#78	300031_rework_PHP_parser_4.patch	13.01 KB	solotandem
#77	api.patch	5.62 KB	drumm
#74	300031_rework_PHP_parser_3.patch	9.98 KB	solotandem
#73	api.patch	5.75 KB	drumm
#70	300031_rework_PHP_parser_2.patch	15.07 KB	solotandem
#64	cache-install.phc_.txt	27.52 KB	CorniI
#64	cache-install.rphp_.txt	4.55 KB	CorniI
#38	300031_rework_PHP_parser.patch	23.06 KB	solotandem
#35	database.inc_.txt	5.62 KB	solotandem
#35	database.edited.inc_.txt	17.27 KB	solotandem
#35	structure.html_.txt	1.57 KB	solotandem
#26	parse.txt	5.92 KB	chx
#22	foo.txt	2.95 KB	chx
#19	tokenizer.patch	32.36 KB	Damien Tournoud

Comment #1

drumm

he/him

NY, US

CreditAttribution: drumm commented 26 August 2008 at 09:14

Component:

Code

» Parser

Log in or register to post comments

Comment #2

sdboyer CreditAttribution: sdboyer commented 29 August 2008 at 14:00

AWESOME! This is really great. I'd had the Reflection API in mind myself, but aside from the php4/5 question, I haven't really investigated the pros/cons of one vs. the other.

Log in or register to post comments

Comment #3

drumm

he/him

NY, US

CreditAttribution: drumm commented 29 August 2008 at 15:47

Reflection API and PHP 5 are great, however, it requires including the code, which would not be an option for documenting all of contrib.

Log in or register to post comments

Comment #4

sun

German

Karlsruhe

CreditAttribution: sun commented 7 October 2008 at 02:05

The registry contains some code in registry.inc that could work as a starting point. It already scans files for objects and functions. Add the other required tokens of interest, i.e. T_COMMENT, T_DOC_COMMENT.

The coder_format script shipped with Coder module might also be worth to look at. It is capable of reformatting PHP files so they match Drupal's coding standards (style).

Log in or register to post comments

Comment #5

sdboyer CreditAttribution: sdboyer commented 15 October 2008 at 20:48

I floated a thought to drumm in irc today; reposting it here as requested. Caveat: I don't understand how a token-based parser does/would/could work; my knowledge base is doing documentation using Reflection. drumm and I have talked about this before, and we basically ended up agreeing that both approaches would probably be useful. The big problem with a Reflection-based approach is the potential security hole: since the files have to be require'd in before their contents can be reflected upon, it means that anything in global scope in those files will be executed. That's fine if you trust the code, but if we're doing, say, cron-based runs of everything in Drupal CVS...problems.

But that's old news. The new stuff is a way of potentially getting around that security problem: instead of just requiring in the files to be read for documentation purposes, we manipulate the file first, THEN parse it in and go. Specifically, we read in the file, suck out everything that's in global scope and drop it into a dead-space function. We then write the file back out (unless we can just read it from memory without things getting REALLY hacky) with the dead-space function in it, THEN parse in that temporary file, and begin our Reflecting.

I'll also email either php-general or php-internals about this and see what I get back, aside from flames.

Log in or register to post comments

Comment #6

Crell CreditAttribution: Crell commented 18 December 2008 at 21:18

The problem with the approach in #5 is that if you suck out anything non-global, you also lose anything inside a class, vis, methods. We have a growing number of classes that need documenting.

Keeping the string in memory should be easy in PHP 5 thanks to the memory:// stream wrapper.

Has anyone taken the time to look at the phpDocumentor project and how they handle it? They're syntax is close enough anyway that we could probably leverage much or all of that codebase, and get much better standardization with the rest of the PHP world in the process.

Log in or register to post comments

Comment #7

sdboyer CreditAttribution: sdboyer commented 18 December 2008 at 23:40

The problem with the approach in #5 is that if you suck out anything non-global, you also lose anything inside a class, vis, methods.

I think you misread what I said? The suggestion was to suck out globals, not suck out non-globals. I think I may see the problem you're alluding to, but...well, could you clarify?

phpDocumentor is token-based, not Reflection-based. A lot of what they do could be re-used for a token-based approach, but is basically irrelevant to a Reflection-based approach. Arbitrary code execution is only an issue for Reflection-based systems (at least so far as we've thought it through), so there's not much to offer in that vein. But, most importantly!

Keeping the string in memory should be easy in PHP 5 thanks to the memory:// stream wrapper.

This sparked a thought I had previously had, but that I only just started investigating now, and yep: include() can take stream arguments other than conventional files. I need to investigate this more, but it seems like there are some very interesting potential options there...

Log in or register to post comments

Comment #8

Crell CreditAttribution: Crell commented 19 December 2008 at 00:17

I'm curious why we want to use reflection so badly? Remember we also want to keep global-level comment blocks. And that still doesn't solve the problem of having to parse all that code into memory, from which it cannot be removed. You're going to have to either break it up into a separate PHP process for each file or buy an awful lot of RAM. :-)

Log in or register to post comments

Comment #9

drumm

he/him

NY, US

CreditAttribution: drumm commented 19 December 2008 at 00:29

Reflection would be a convenient way of saving a lot of work, since it comes with getDocComment(). However, it simply won' pick up globals or constants.

The tokenizer would basically be a state machine that iterates over the token array, keeping track of nesting level and looking for where things start & stop. I recently committed a small use of the PHP tokenizer in api_page_function(). It draws http://delocalizedham.com/node/10.

Log in or register to post comments

Comment #10

sdboyer CreditAttribution: sdboyer commented 19 December 2008 at 21:13

@drumm:

Sorry, I don't quite follow what the idea is in that blog post you linked to.

It's becoming increasingly clear that some amount of manipulation of the files will be necessary if Reflection is to be used at all; said manipulation will almost certainly mean use of the tokenizer via any number of possible approaches. Which leads me to think that the best approach will probably end up being a mix of the two.

@Crell:

As drumm said: Reflection, because it can cut out a _lot_ of work. Also, Reflection, because extending it for new custom @tags is trivial. And maybe finally, Reflection, because it's really worth not reinventing the wheel and using the php's own internal parser when and where possible.

I don't see a reason why we need to force the API module to run in a mass loop over everything. There are some potential complexities and dependency issues, sure; but assuming we have a good underlying data storage schema for it, we can either a) batch it into parts then split it over cronjobs if needed, and/or b) save hash values of parsed files and only re-run when hashes differ.

Log in or register to post comments

Comment #11

drumm

he/him

NY, US

CreditAttribution: drumm commented 26 December 2008 at 23:26

Things which this issue will be fixed. There should be unit tests for all.

* Heredoc with '}' in first column ending a function.
* globals recognized in strings #319978: $user is display twice in globals page (api.drupal.org)
* No objects or things inside of objects.

Log in or register to post comments

Comment #12

bobthecow CreditAttribution: bobthecow commented 27 January 2009 at 12:03

I have a (mostly) working PHP tokenizer based parser for the API module. It is currently running the Zoop Framework API site (when the site is up... That server is a bit sick right now).

I wrote it based on the 5.x API module, but am about to port it to the 6.x branch. It currently handles classes, class methods and member variables. It does run into problems with duplicate function names in the global namespace, but it is pretty solid otherwise.

When I port it to 6.x I'd like to tackle #101308: Add commenting on API objects too, because I've been itching for that (and views support) for a while.

Log in or register to post comments

Comment #13

drumm

he/him

NY, US

CreditAttribution: drumm commented 27 January 2009 at 14:50

Version:	master	» 6.x-1.x-dev
Priority:	Normal	» Critical

This is now top priority now that there is a stable 6.x release.

The way I envision this working is that the tokenizer pass splits code into an array of source code (constants, globals, functions, objects) and associated documentation comments.

Log in or register to post comments

Comment #14

sdboyer CreditAttribution: sdboyer commented 27 January 2009 at 17:42

So...I feel the need to apologize here.

I've been up and down through the various Reflection-based tools that are available, and I'm increasingly getting to be of the opinion that Reflection itself doesn't have that much to offer us. It does seem clear to me that there's a LOT from the phpdoc code base we could be borrowing (including a boatload of regex, iirc from my scan through it a month ago), but Reflection itself isn't really going to provide that much value-added. If we want to make it easy to add new @tag definitions, Reflection can help a bit, but it really doesn't seem to be the bee's knees in the way that I've been suggesting.

So, my apologies, but good luck with the reworking :)

Log in or register to post comments

Comment #15

Shiny CreditAttribution: Shiny commented 28 January 2009 at 03:00

i've forked this project, to add classes+methods (more p regex) - but now i'm trying the tokenizer method.
Is anyone else working on this?

Log in or register to post comments

Comment #16

bobthecow CreditAttribution: bobthecow commented 28 January 2009 at 15:01

yep. I've got a working classes+methods tokenizer based API module for 5.x, i'm in the middle of porting it to 6.x...

Log in or register to post comments

Comment #17

Shiny CreditAttribution: Shiny commented 29 January 2009 at 00:16

@bobthecow can you share a gitrepo of your D5 version? I'm keen to use this week and help in anyway - i'm setting up this api module for another opensource project.

Log in or register to post comments

Comment #18

Shiny CreditAttribution: Shiny commented 2 February 2009 at 21:24

@bobthecow can you share your D5 version? I can port it to 6 for you.

Log in or register to post comments

Comment #19

Damien Tournoud CreditAttribution: Damien Tournoud commented 5 February 2009 at 00:39

File	Size
tokenizer.patch	32.36 KB

Here is a first attempt at making a clean token()-based parser.

The parser itself works quite well, but several key features are still missing (but shouldn't be that hard to implement):
- code highlight (should be easy, we already have a decomposed form)
- group management
- promote classes to a first-class citizens in api.module
- globals, constants and member variables (should be fairly easy, too)

Log in or register to post comments

Comment #20

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 February 2009 at 00:57

Whoa.. thats a bit more object-oriented than I was expecting. I was just hoping for a simple loop over token_get_all() that returns array(array('documentation comment' => ..., 'code' => '...', 'start line' => ...), ...). I am cautiously optimistic about this much engineering.

I might not be able to review this in-detail for a few days, especially with this much code. More sample code can be added to tests/sample/ and tests can check that the parsing functions are working.

Log in or register to post comments

Comment #21

Damien Tournoud CreditAttribution: Damien Tournoud commented 5 February 2009 at 01:05

Status:

Active

» Needs work

I was just hoping for a simple loop over token_get_all() that returns array(array('documentation comment' => ..., 'code' => '...', 'start line' => ...), ...).

I was too, at first. But the FSM that is required to implement this properly quickly overflowed my cognitive power. Seriously. So this is a divide and conquer approach.

More sample code can be added to tests/sample/ and tests can check that the parsing functions are working.

This is still a work in progress, but the whole D7 code based is now parsed without error, which is encouraging.

Log in or register to post comments

Comment #22

chx CreditAttribution: chx commented 5 February 2009 at 03:37

File	Size
foo.txt	2.95 KB

I also have some token based code from some earlier advantures of mine. It's nowhere near complete not at all but with copying it over to parser.inc of which quite some needs to be salvaged because the tokenizer won't tokenize the docblocks further for us I believe it can be a) finalized b) much closer to what Drumm wanted. However, for functions, this is a bit nicer. Feel free to disregard this piece or to usher me to finish it :)

Log in or register to post comments

Comment #23

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 February 2009 at 03:49

Yes, we have to improve/rebuild/find another doc block parser. That should wait for phase 2.

Log in or register to post comments

Comment #24

bobthecow CreditAttribution: bobthecow commented 11 February 2009 at 14:39

Sounds like there is a lot of interest in getting this module updated. I've set up a github repo here, where I'll be working on my changes. It would be great if you all could join me :)

The initial commit is straight out of CVS HEAD.

Log in or register to post comments

Comment #25

Shiny CreditAttribution: Shiny commented 4 March 2009 at 01:42

@bobthecow - can't track that github because it only has one commit
try forking from http://github.com/Br3nda/drupal-module-api/tree/master

i'll try applying you code to that.

Log in or register to post comments

Comment #26

chx CreditAttribution: chx commented 13 April 2009 at 16:22

File	Size
parse.txt	5.92 KB

Log in or register to post comments

Comment #27

solotandem CreditAttribution: solotandem commented 11 June 2009 at 18:37

The new grammar parser (possibly with a plug-in to to parse the Doxygen comment block items) will suffice for parsing and providing the information needed for this project.

I am willing to integrate the new parser into this project and could start on this any time. Does it make sense to add me as a co-maintainer?

Log in or register to post comments

Comment #28

Damien Tournoud CreditAttribution: Damien Tournoud commented 11 June 2009 at 18:55

@solotandem: #19 is an already advanced research for a PHP grammar parser. It supports blocks, functions and classes. It could be easily extended to support expressions and function calls to. Take a look at it, it could help you kick-start your own project.

Log in or register to post comments

Comment #29

solotandem CreditAttribution: solotandem commented 13 June 2009 at 23:28

@Damien: Thanks for the suggestion. However, the grammar parser currently supports all but eight of the PHP tokens. Let me know if it could be put to good use on this project. (Note: the grammar parser code has not been committed yet.)

Log in or register to post comments

Comment #30

chx CreditAttribution: chx commented 14 June 2009 at 07:34

Damien, though #26 is not integrated with api.module yet, it is enough for what api.module needs and drumm asked for it because your code is a biit too much.

solotandem, you need to talk to drumm. If it supports classes and interfaces (mine does not) then it's great! We need that. I wanted to continue with those but never happened.

We have a doxygen parser in api module replacing that was never the target of this issue.

Log in or register to post comments

Comment #31

solotandem CreditAttribution: solotandem commented 15 June 2009 at 13:29

chx: Yes it supports classes and interfaces.

Log in or register to post comments

Comment #32

boombatower CreditAttribution: boombatower commented 17 June 2009 at 01:08

subscribe

I saw this issue and thought that solotandem's parse was a perfect fit.

Sounds like the integration and replacement of old parse will take a bit of work. Does it make sense for solotandem to become a co-maintainer instead of creating a giant patch of lots of little ones?

Log in or register to post comments

Comment #33

chx CreditAttribution: chx commented 19 June 2009 at 14:16

Where is this parser to look at?

Log in or register to post comments

Comment #34

drumm

he/him

NY, US

CreditAttribution: drumm commented 19 June 2009 at 17:46

The doc comment parser needs a lot of work and should be tackled in a separate issue; it will probably end up with a full rewrite. There are a number of small problems in other issues, such as @see basically not working. Lets keep this issue to the first pass, finding blocks of code and related doc comments. (The third pass is adding links at render.)

Chx's parser in #26 is currently preferred since it does what is needed, without extra complexity. Where is the plan that shows how PGP will move forward?

PGP looks like it is designed around saving outputted files. I am trying to move in the opposite direction, not relying on the file system and even removing the need to have a checkout. It looks like the Version Control API family of modules is getting or has the ability to read from CVS, so we no longer need to manage CVS checkout outside of Drupal. The target for all the intermediate parsing is the database so it can be queried.

Log in or register to post comments

Comment #35

solotandem CreditAttribution: solotandem commented 23 June 2009 at 17:30

File	Size
structure.html_.txt	1.57 KB
database.edited.inc_.txt	17.27 KB
database.inc_.txt	5.62 KB

chx: the PGP reader and writer classes are committed to CVS (no release yet).

drumm: the PGP API is not file based. The module GUI supports reading from and writing to files, but the API does not know anything about the file system. The reader parses the source code string passed to it. This string could be anything from an expression to an entire file. The writer produces a source code string from the grammar statements (in an array) passed to it. Other code (apart from the PGP API classes) has to get the source code snippet to parse and do something with the source code string returned. The grammar array and source code string can be saved in a database record, written to file, etc.

PGP will have three main classes:
- PGPReader to parse the tokens into grammar statements
- PGPEditor to edit the statements
- PGPWriter to produce a source code string from the edited statements.

I have attached the following files (with a .txt extension added):
- The source file "database.inc" is an excerpt from a recent version of includes/database.inc.
- The grammar file "database.edited.inc" is a modified grammar array structure produced by the PGP parser.
- The file "structure.html" with notes on the grammar array structure in the edited file.

The "edited" file was produced by:
- parsing the source file using PGPReader.
- editing the grammar statements using (a pre-commit version of) PGPEditor to remove the function body "statements" and add a "text" element with the original function text (minus its doc comment).
- writing the text element using PGPWriter to reproduce the text of the function from the grammar statements.

I have left in the array structure the PHP open tag, Drupal ID tag, and whitespace tags, but these are easily ignored.

Future plans for PGP include convenience functions to produce:
- lists of classes, functions, variables, constants, etc.
- the information included in the "edited" file above

The "edited" file includes information I would expect the API documentation module to need. What other information is needed?

Does the PGP parser seem like a fit for this module? Please post any comments you may have about the parser as an issue on the PGP project.

Log in or register to post comments

Comment #36

Frando CreditAttribution: Frando commented 5 July 2009 at 23:31

Wow, database.edited.inc_.txt looks really really impressive. For me this seems to be a great base for api module, it supports everything needed and much more than api's current parser does. We really need to handle this until D7 goes beta or something, the sooner the better. So for me using this pgp parser looks like the way forward. I don't think the rewriting part is needed for api.module though. And subscribe.

Log in or register to post comments

Comment #37

drumm

he/him

NY, US

CreditAttribution: drumm commented 7 July 2009 at 05:08

Priority:

Critical

» Normal

Next release is soon, a change this big can't make it.

Log in or register to post comments

Comment #38

solotandem CreditAttribution: solotandem commented 14 July 2009 at 02:57

File	Size
300031_rework_PHP_parser.patch	23.06 KB

The attached patch integrates the API module with the PGP module to parse source code files and produce documentation items.

To test the integration:

I created documentation for the following files using the existing code and the patch code:
- includes/common.inc
- includes/locale.inc
- api
I saved the api tables created from each set of code.
I compared the columns in the two sets of tables.

All columns have the same text with two exceptions:

The existing code did not create an item for the function api_format_documentation_lists.
The patch code included items for 5 uncommented globals. (The existing code does not locate them as they are undocumented. I am not sure whether they should be included or not.)
The patch code included "@ingroup themeable" in the documentation column for the theme_locale_languages_overview_form item.

To verify these results, you will need to download and install the latest PGP module committed after this comment and apply a simple patch to make it compatible with 6.x (as only a 7.x-dev release exists). The updated module and patch will be posted shortly.

In the process of integrating PGP, I found a few instances where the regular expressions used in API are opportunistic as far as what they expect to encounter in the text (and do not work in certain edge cases). I have not changed these but would expect to deal with them through the standard issue queue workflow.

PGP will automatically provide documentation items for undocumented code elements (functions, constants and globals). Also, PGP is able to provide the needed documentation items for classes and interfaces (I left commented out lines with the tokens related to same in the patch). To incorporate these items, what database schema changes are planned (for elements like scope modifiers)?

Log in or register to post comments

Comment #39

drumm

he/him

NY, US

CreditAttribution: drumm commented 15 July 2009 at 06:13

About the exceptions:
* api_format_documentation_lists() should not have documentation. Its doc comment did not start with /** until I committed a fix a couple minutes ago.
* Globals should only be picked up when they are on their own, not within functions or classes. Working doc comments are at http://cvs.drupal.org/viewvc.py/drupal/contributions/docs/developer/glob....
* This parser doesn't need to worry about @ingroup or other tags, that is the next stage's job.

More notes:
* Freeing up memory is not necessary unless there are serious memory usage issues. I always recommend running cron via PHP CLI with Drush or drupal.sh, so it can have a sufficiently high memory allocation.
* Freeing memory would be be less of a concern if global variables were not used. Globals should be avoided.
* I'm guessing there may be future plans that make more sense, but PGPEditor looks like it just implements some singleton functions.
* To reiterate, I would like to see this PHP parser be separate from the doc comment parser. The "This needs to be reworked into a robust Doxygen-style comment parser." comment in pgp.editor.inc suggests these two stages of parsing might be intermingled. Any bugs in the doc comment parser should be filed in separate issues.

Log in or register to post comments

Comment #40

solotandem CreditAttribution: solotandem commented 15 July 2009 at 21:22

Thanks for the review.

About your remarks:
* The PGP parser is not doing anything with @ingroup or other documentation tags (these are being handled in api/parser.inc).
* A robust Doxygen-style comment parser is needed somewhere (as the api module relies on opportunistic parsing that fails in places as evidenced by the display errors on api.drupal.org) with (as you write) its own issues.
* I freed up memory because I was testing it in real time from PHP CLI (but outside of cron, Drush or drupal.sh) and my PHP memory usage limit (128M) was being hit (granted I could have upped the setting).
* Regarding globals: $editor could easily be passed to the two functions using it; $nested_groups could be as well, but $docblocks would require a bit more care.

Log in or register to post comments

Comment #41

drumm

he/him

NY, US

CreditAttribution: drumm commented 16 July 2009 at 06:07

Okay, I guess we will have to pay attention to memory usage, that is a lot.

I fully agree that our doxygen parser sucks, but one thing at a time. It does decently for now.

Keep up the good work. PGP will need a good Drupal 6 release since this will be running in production on Drupal 6.

Log in or register to post comments

Comment #42

sdboyer CreditAttribution: sdboyer commented 28 July 2009 at 13:08

For the parsing of @tags, if my recollections from crawling through phpdoc's guts are accurate, it should be pretty easy to borrow from/build on their general approaches.

Given my interest in the @tags...and the fact that I pretty much completely neglected to do any of the help on this project I was hoping to last year...I'll offer to do the relevant research & coding to improve our @tag handling.

Log in or register to post comments

Comment #43

drumm

he/him

NY, US

CreditAttribution: drumm commented 1 August 2009 at 04:34

sdboyer - please open a separate issue for the doc comment parser itself; it certainly needs some help. This issue is only for PHP code parsing.

Log in or register to post comments

Comment #44

drumm

he/him

NY, US

CreditAttribution: drumm commented 25 August 2009 at 20:31

Priority:

Normal

» Critical

Any progress on this?

Log in or register to post comments

Comment #45

sdboyer CreditAttribution: sdboyer commented 28 August 2009 at 06:04

Sorry, I've done basically no coding on my own time in the last month. No real good reason, just code block. I'm finally pushing through. I've got some other stuff that needs to be written over the next few days, but unless I'm a complete douche, I'll file that new issue early next week.

Log in or register to post comments

Comment #46

Carl Johan CreditAttribution: Carl Johan commented 4 September 2009 at 11:57

+1

Log in or register to post comments

Comment #47

drumm

he/him

NY, US

CreditAttribution: drumm commented 4 September 2009 at 13:25

Parsing only form.inc from Drupal 7 uses just under 326k with chx's parse script. This will of course go up a little in an actual implementation.

Log in or register to post comments

Comment #48

drumm

he/him

NY, US

CreditAttribution: drumm commented 8 September 2009 at 02:15

I have been cleaning up parser.inc in preparation for the new parser. It now has 11.9% less code, removing a few duplicated sections. Next is to separate out doc comment parsing from the PHP parser, so there is a clean area to work around.

I think everything is still working, but it would have been good to add unit tests. Adding code to tests/sample and testing for the parsed code in simpletests would be a great contribution.

Log in or register to post comments

Comment #49

jhodgdon

she/her

English

CreditAttribution: jhodgdon commented 2 October 2009 at 13:37

Just a comment: I do not think that Drupal 7 should be released without this issue being fixed on api.drupal.org. As things stand now, as I'm sure you are aware, there is no presence at all on a.d.o for the now many classes that are key to Drupal 7 functionality, and it really cuts into the usefulness of api.drupal.org.

For instance, I was trying to figure out yesterday what had happened to the search API between D6 and D7, in order to refactor a patch that no longer applied after the DBTNG (whatever that stands for) changes. My patch added args to the search_simplify() function, and api.drupal.org currently shows in D7 that it was only being called during search indexing, not during searching. WTF? Can't be true... (you need to preprocess search terms the same when you index them as when you are searching).

So I had to turn on my grep-through-the-code fu, and finanally found out the reason is that it's called inside a PHP class instead of a normal function, and it's simply not there on api.drupal.org.

Just one example... Anyway, in my opinion, we should not release D7 without api.drupal.org having a way to look at docs for classes.

Log in or register to post comments

Comment #50

boombatower CreditAttribution: boombatower commented 2 October 2009 at 20:31

solotandem is out of town for the next week or so, but he has re-factor the parser for a number of reasons, but I think we should move forward on this.

1) Replace current api module PHP parsing with parser module.
2) Design class pages and data structure.
3) Add additional integration to parse the new structures.

Log in or register to post comments

Comment #51

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 October 2009 at 00:26

Does "parser module mean pgp? Does "re-factor the parser" mean re-factor pgp? Is there an issue to track this re-factoring.

Most of the recent commits to parser.inc have been separating PHP and doc comment parsing. Any doc comment parsing is moving into api_save_documentation(). The remaining PHP parser should be relatively clean to replace.

Log in or register to post comments

Comment #52

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 October 2009 at 00:29

If anyone is looking to help, unit tests would be incredibly useful here.

Log in or register to post comments

Comment #53

drumm

he/him

NY, US

CreditAttribution: drumm commented 5 October 2009 at 01:22

I think the refactoring phase of this development is done. I probably won't go ahead with swapping in a parser today, pending a status update on pgp.

Log in or register to post comments

Comment #54

Frando CreditAttribution: Frando commented 9 December 2009 at 14:15

bump. Is there any news about this? Is PGP usable for api.module by now? We really need to tackle this prior to a D7 release and hopefully also prior to betas, as otherwise all our classes are completely undocumented on api.drupal.org, which is quite fatal.

Is there a plan on how to procede? Which parts need help?

Log in or register to post comments

Comment #55

solotandem CreditAttribution: solotandem commented 9 December 2009 at 15:47

PGP is usable for api.module. However, to handle interfaces and classes, aren't some changes needed to the api module tables and the corresponding arrays of values in the code? Let me know and I will post a patch to integrate this module with the grammar parser module.

Log in or register to post comments

Comment #56

Frando CreditAttribution: Frando commented 11 December 2009 at 13:49

Yes, the internal structure of api.module will have to be extended to support classes and interfaces.

And we want to have crosslinking from interfaces to classes that implement it and of course crosslinking in class hierarchies.

Log in or register to post comments

Comment #57

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick commented 17 January 2010 at 20:31

Now that 7.x alpha1 is out, this is even more critical, since it actively blocks people from porting their modules. Any chance someone could chisel out some time to work on this /very/ soon?

Log in or register to post comments

Comment #58

chx CreditAttribution: chx commented 17 January 2010 at 22:51

OK so this issue is cursed. Theoretically PGP could work with API. In reality it's not been rerolled or pushed for months and tonight when I wanted to do something useful I wasted that time with debating whether to use it or not. Great.

Log in or register to post comments

Comment #59

chx CreditAttribution: chx commented 17 January 2010 at 22:52

And again why it's just Jim and me? There is never any help not even for the most important things that really everyone wants. No, we are doing Hitler and Drupal videos instead, so fucking lots easier!

Log in or register to post comments

Comment #60

chx CreditAttribution: chx commented 18 January 2010 at 01:26

We discussed with Neil that for simplicity we will store things like Database::getConnection as a function name. Further enhancement would be handling return types -- link the interface automatically indicated in the doxygen.

All this, however will not solve all the problems. Just with a static code analyzer it's rather tricky to figure out where the doxygen of a given method lies as we document at interfaces not in the classes themselves. But no we squarely refuse to add a simple marker to the method names which would help...

Log in or register to post comments

Comment #61

aismail CreditAttribution: aismail commented 19 January 2010 at 12:22

The heavy-lifting (a.k.a. parsing) doesn't need to be done on the Drupal side. We could use an external tool for that which is good at what it does.

I'd suggest using a third-party parser, written in whatever language, such as the one from http://www.phpcompiler.org/. The existing ones will have numerous unit tests and the maintenance work wouldn't be on the Drupal side, which has limited interest in investing work in such a project (api.drupal.org being one of the main uses).

Usually these kind of parsers output an Abstract Syntax Tree, which is a tree representation of one or more source files. Functions, variables, classes are all nodes in that tree, together with the associated comments (if parser doesn't ignore them) and with information like "line number & column" (where the given node appears). What api.drupal.org needs is a pretty basic traversal of that AST.

Now, I see 2 approaches, depending on the language in which the parser is written and on its export abilities:
1) Export the AST to a file (by calling an external process), which is then used as input for a Drupal module which re-creates the AST structure in memory and walks it
2) Write the traversal in the same language as the parser, linking them in a big executable. The output should then be exported to a standardized format (such as xml, CSV), simple enough for a Drupal module to parse and insert into the schema of API module.

2) is probably simpler to write (since re-creating the AST in memory based on its export might be tedious) but 1) keeps all the Drupal-specific code in a Drupal module. Opinions? If it was my choice, I'd probably choose 1).

Log in or register to post comments

Comment #62

chx CreditAttribution: chx commented 19 January 2010 at 14:09

FYI: phc can dump the AST in an XML file. However, we lose comments so I am not 100% this is a viable solution, after all.

Log in or register to post comments

Comment #63

chx CreditAttribution: chx commented 19 January 2010 at 15:03

Another candidate is http://www.eclipse.org/pdt/articles/ast/PHP_AST.html but we need someone with Java skills to write a little command line tool so we can run with it.

Log in or register to post comments

Comment #64

CorniI CreditAttribution: CorniI commented 19 January 2010 at 15:21

File	Size
cache-install.rphp_.txt	4.55 KB
cache-install.phc_.txt	27.52 KB

Hi,
I got working installations of both phc (dataflow branch) and roadsend rphp.
Both dumps show the XML output of cache-install.inc
For rphp, i had to remove the type-hinted array, as type-hints aren't supported yet in the grammar. Also rphp is unsuitable for the given task currently, because it's AST dumper and parser is too incomplete to parse all of drupal. This might look different in a half year, but well...
phc warns that it's line-numbers may be a bit off, so this needs to be hand-controlled. This is a bug in their parser they (pbiggar) don't know the reason for.
@#61: I'd probably implement everything as an AST visitor in the host language and fill the database from that visitor, this should be the most straight-forward. API-Module then would just be responsible for presenting the database contents. This would also scale quite good, because the memory consumption of a good parser should be much lower than it is for php files (require'd or not).
Corni

Log in or register to post comments

Comment #65

boombatower CreditAttribution: boombatower commented 19 January 2010 at 20:29

Or we use the existing parser that can be controlled from php and provides everything we need...? All the alternatives require another method of importing the data into PHP and maintaining/interfacing with an external language/library.

Log in or register to post comments

Comment #66

Crell CreditAttribution: Crell commented 19 January 2010 at 20:35

If the existing parser provided everything we needed, this issue wouldn't be here. We don't need to custom write everything if existing systems can do it for us.

Log in or register to post comments

Comment #67

boombatower CreditAttribution: boombatower commented 19 January 2010 at 20:37

Clarification as the many comment above point to PGP a project on Drupal.org which is what I have been referring and just needs a re-roll of patch which solotandem is working on, but is running into issues with the way API module is designed as it seems to do parsing on every page load which is rather wasteful. He is trying to come up with a simple way to integrate without redesigning it to do all the parsing at the beginning for the time being.

My suggestion would be to serialize whatever information is displayed, but doesn't need to be stored in a structured manor for searching or whatever, but parsing (however limited) is not what we want to do on every page load.

Log in or register to post comments

Comment #68

solotandem CreditAttribution: solotandem commented 20 January 2010 at 00:00

I will post a re-rolled patch by tomorrow. I have the integration code re-rolled and am comparing the results of using the existing API module code to that with the Grammar Parser as the parsing engine. A little bit of due diligence.

Log in or register to post comments

Comment #69

drumm

he/him

NY, US

CreditAttribution: drumm commented 20 January 2010 at 03:26

For PGP - what have you found for it's memory usage? Is that still a problem? Is it running without patching on Drupal 6?

The patch I'm expecting is a relatively straightforward replacement of api_parse_php_file()'s implementation. This is all done at parse time, anything API does on page load shouldn't be part of this issue.

Log in or register to post comments

Comment #70

solotandem CreditAttribution: solotandem commented 20 January 2010 at 20:16

File	Size
300031_rework_PHP_parser_2.patch	15.07 KB

The attached patch integrates the API module with the Grammar Parser module to parse source code files and produce documentation items. I added a constant to allow easy switching between the existing code and the integrated code to compare the documentation results. This would be eliminated in a final patch.

Further integration is possible. For example, in api_format_php(), the call to token_get_all is redundant and could be eliminated as the Grammar Parser already made this call and has this information stored in the grammar objects.

I tested the integration in the same fashion as described in #38 and with the same observations. In addition:

The existing code does not include @file comments.
The patch code includes items for the class functions in the pgp module. (I realize the current code does not include these as this is closely related to the topic of this issue.)
The patch code does not store the line numbers. (They can be included in a later patch.)

To verify these results, download and install the latest PGP release created after this comment and apply a simple patch to make it compatible with 6.x (as only a 7.x-dev release exists).

This patch treats the class functions the same as the non-class functions as the details of what to store remain to be fleshed out. In the function signature, I dropped the word 'function' and left the scope modifiers. On display, the existing code drops the function name and only shows the scope modifier. Again, I left this until the details are worked out. We need to decide how to handle:

scope modifiers
extended and implemented classes

drumm: The memory usage grows as more modules are parsed in a single process. If the documentation is run in batches (using the queue as it is now) the memory requirements are not an issue. All of the above mentioned files ran in one process that used about 75MB of memory? The parser engine is not Drupal-dependent or Drupal-specifc. The user interface elements of the parser module are, of course, Drupal-version-specific. The attached patch is a replacement of the api_parse_php_file() routine.

Log in or register to post comments

Comment #71

drumm

he/him

NY, US

CreditAttribution: drumm commented 21 January 2010 at 06:52

Quite a lot of refactoring has happened in the API module since #38, to make this replacement more straightforward. I don't think #70 has fully caught up. An incomplete list of examples:
- Literal '@' are no longer used. the API_RE_TAG_START constant uses a negative lookbehind assertion to make sure it is not '\@', allowing escaping of @.
- Nested groups are handled in api_save_documentation(), adding api_documentation_nested_group() is unnecessary.
- In the added api_documentation_item(), @param, @return, and @see are parsed, this is also redundant with api_save_documentation().

This patch should not need to change any of the @tag handling, which I have moved out of api_parse_php_file(). I think all the separation of PHP parsing and doc comment parsing is already done. More work is probably needed for doc comment parsing, but not now and not in this issue.

Similarly, I'm not concerned about reimplementing api_format_php() now, it is doing okay for now. And yes, more work after this will be needed for classes, I don't expect one patch to solve everything. I would rather verify that files, functions, constants, and globals work okay with a new parser and proceed from there.

I looked into d.o's PHP memory configuration, and it is already quite high, probably for some project* processing. That should not be an issue as long as nothing is leaking memory like a sieve. The queue will run as many files as it can in 2.5 minutes.

I've been running API.d.o on released versions of the API module and the whole deployment system works a bit better with released versions. Unfortunately with the redesign and complexity of d.o, it will be awhile before API is running Drupal 7. It would be good to have a stable 6.x release of PGP.

Otherwise, the interesting part of this patch seems relatively straightforward, which is what I am looking for. I wouldn't say I'm committed to PGP quite yet. If the implementation is straightforward and the project is well-supported, then it should be a good way to go.

Log in or register to post comments

Comment #72

solotandem CreditAttribution: solotandem commented 21 January 2010 at 14:55

Thank you for reviewing the patch and posting your comments.

Regarding the refactoring since #38:
- I agree that where the patch code used the existing code's doc comment parsing, it should have been updated to use the refactored code with the API_RE_TAG_START constant, etc.
- Separation of doc comment parsing and nested groups from the php parsing makes sense, but placing the doc comment parsing into the api_save_documentation() routine seems a bit misplaced. Shouldn't there be a routine called api_parse_doc_comment() separate from api_save_documentation()? Leaving this parsing code in the api_documentation_item() routine seemed preferable to the refactored existing code (one can certainly argue that parsing the php file includes the doc comment until one saw fit to refactor the latter into a separate routine). You refer to this code as redundant in the patch, but the patch commented out this parsing in the api_save_documentation routine (as it seems ill-placed).

Regarding PGP:
- The absence of a 1.0 release is no indication the code is not stable. (As a personal practice, I do not commit code to the dev release that is knowingly unstable. I also expect to make an official 1.0 release soon.)
- The patch to convert the module to 6.x is sufficient for this purpose. This patch makes no changes to the parser engine, only the user interface elements which the API module will not use anyway. (That said, if the lack of a 6.x release is a show-stopper, I can make a 6.x release.)
- What is the basis for questioning the project being well-supported? It is a key element in radically changing the whole dynamic and timing of contributed module updates in conjunction with Coder Upgrade. Not to mention other uses, such as core module updates and security reviews.

Log in or register to post comments

Comment #73

drumm

he/him

NY, US

CreditAttribution: drumm commented 26 January 2010 at 00:20

File	Size
api.patch	5.75 KB

At some point, api_save_documentation() could be split up, so @tag parsing and saving is separate. This issue doesn't deal with @tag parsing, that can be tackled elsewhere. I'm happy with it all being in one place right now.

When this is all ready to go on api.d.o, a 6.x release would be great. Drupal.org is going to be 6.x for awhile, and I like running released versions.

I've been reshuffling code where I can during a DrupalCamp, with the goal of simplicity and a small change. The patch is quite small now, and shows that PGP can get this done cleanly. I have
- removed any changes to @tag parsing
- rolled functions that are called once, the ones that build docblock arrays, into the calling code
- removed globals

The attached patch is quite broken, I haven't had a chance to actually test, but I thought I would post in case someone wants to jump in.

A few questions
- What is the plan for line numbers?
- I noticed Grammar Parser has functions like ->getFunctions() and ->traverseCallback(). Could/should these be used instead of our own loop?

Log in or register to post comments

Comment #74

solotandem CreditAttribution: solotandem commented 27 January 2010 at 06:09

File	Size
300031_rework_PHP_parser_3.patch	9.98 KB

The attached patch follows the approach used in #73, but corrects the broken items. The two main corrections are reverting the $default_block code (the patch in #73 essentially relates all of the documentation items to a single file) and restoring the T_FUNCTION case to the switch statement at the end of api_documentation_loop().

For class methods, the patch stores class_name::function_name as the object_name in the api_documentation table. However, because the api module does its own parsing for function calls straight from the source code, the information included in the parser objects is lost when determining references. Because there can be function name collisions among classes, the code in api_parse_function_calls() and api_format_php needs to change. The references stored in the api_reference_storage table include bogus references among the namespace collisions, e.g. __construct.

To implement the documentation of classes, it seems to me we need 1) a game plan for storing and displaying the information, and 2) most likely, need to take further advantage of the grammar parser and what it knows about the contents of the file.

Also, for anyone interested in testing this, there is a change to PGP that requires a fresh download of the module, its D6 patch, and application of its D6 patch.

Log in or register to post comments

Comment #75

drumm

he/him

NY, US

CreditAttribution: drumm commented 28 January 2010 at 17:26

Looks like this is improving. I'm traveling again this weekend, but I should be able to find time to do another round, maybe commit on this.

I'm fine with handling the UI and DB changes. I'll try to keep those simple, but the UI could definitely use a little refactoring in the future.

For the two other uses of PGP, references and PHP formatting, I would like to save those for a subsequent patch so this stays manageable.

What was the plan for line numbers?

Thanks again for working on this.

Log in or register to post comments

Comment #76

drumm

he/him

NY, US

CreditAttribution: drumm commented 2 February 2010 at 08:01

All Simpletests now pass in API head, so that is something to look for. Right now it is really basic, just a file with a function to test some linking. I'll look into adding lower-level tests for {api_documentation} and related tables get filled properly from the test branch.

Log in or register to post comments

Comment #77

drumm

he/him

NY, US

CreditAttribution: drumm commented 2 February 2010 at 08:30

File	Size
api.patch	5.62 KB

My version of the patch which passes all tests, so some basics with files and functions are working.

Log in or register to post comments

Comment #78

solotandem CreditAttribution: solotandem commented 6 February 2010 at 18:06

File	Size
300031_rework_PHP_parser_4.patch	13.01 KB

The attached patch includes line numbers and restores the group and mainpage blocks.

In my comparison tests, it appears the current API module:

does not pick up file documentation comments
shows the line number as one greater than the actual line number

Log in or register to post comments

Comment #79

drumm

he/him

NY, US

CreditAttribution: drumm commented 10 February 2010 at 07:57

Status:

Needs work

» Fixed

Committed a version of this patch. I did have to special case the object_name of globals. The name, used in URLs and such, should not have a $, but the object_title should have a $.

One thing to keep in mind is this will be run from cron, so output will be emailed somewhere. "assign_variable = $sample_global in globalToString" happens now. Usually nothing should be print/echoed. Now classes and interfaces need the parsing nailed down, need to be stored, and have a UI.