Rework PHP parser
drumm - August 26, 2008 - 06:25
| Project: | API |
| Version: | 6.x-1.x-dev |
| Component: | Parser |
| Category: | task |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | needs work |
Description
The first pass of the PHP parser is to chunk up code into objects such as functions, constants and globals. Currently, this uses regular expressions and depends on things like functions starting with /^f/ and ending with the first /^}$/.
Instead, PHP's tokenizer should be used. This will fix the issues and pave the way for documenting object-oriented code. The API should be standardized so other file types, like CSS and JS, can be parsed in the future.
Unit tests will be required.

#1
#2
AWESOME! This is really great. I'd had the Reflection API in mind myself, but aside from the php4/5 question, I haven't really investigated the pros/cons of one vs. the other.
#3
Reflection API and PHP 5 are great, however, it requires including the code, which would not be an option for documenting all of contrib.
#4
The registry contains some code in registry.inc that could work as a starting point. It already scans files for objects and functions. Add the other required tokens of interest, i.e. T_COMMENT, T_DOC_COMMENT.
The coder_format script shipped with Coder module might also be worth to look at. It is capable of reformatting PHP files so they match Drupal's coding standards (style).
#5
I floated a thought to drumm in irc today; reposting it here as requested. Caveat: I don't understand how a token-based parser does/would/could work; my knowledge base is doing documentation using Reflection. drumm and I have talked about this before, and we basically ended up agreeing that both approaches would probably be useful. The big problem with a Reflection-based approach is the potential security hole: since the files have to be
require'd in before their contents can be reflected upon, it means that anything in global scope in those files will be executed. That's fine if you trust the code, but if we're doing, say, cron-based runs of everything in Drupal CVS...problems.But that's old news. The new stuff is a way of potentially getting around that security problem: instead of just requiring in the files to be read for documentation purposes, we manipulate the file first, THEN parse it in and go. Specifically, we read in the file, suck out everything that's in global scope and drop it into a dead-space function. We then write the file back out (unless we can just read it from memory without things getting REALLY hacky) with the dead-space function in it, THEN parse in that temporary file, and begin our Reflecting.
I'll also email either php-general or php-internals about this and see what I get back, aside from flames.
#6
The problem with the approach in #5 is that if you suck out anything non-global, you also lose anything inside a class, vis, methods. We have a growing number of classes that need documenting.
Keeping the string in memory should be easy in PHP 5 thanks to the memory:// stream wrapper.
Has anyone taken the time to look at the phpDocumentor project and how they handle it? They're syntax is close enough anyway that we could probably leverage much or all of that codebase, and get much better standardization with the rest of the PHP world in the process.
#7
I think you misread what I said? The suggestion was to suck out globals, not suck out non-globals. I think I may see the problem you're alluding to, but...well, could you clarify?
phpDocumentor is token-based, not Reflection-based. A lot of what they do could be re-used for a token-based approach, but is basically irrelevant to a Reflection-based approach. Arbitrary code execution is only an issue for Reflection-based systems (at least so far as we've thought it through), so there's not much to offer in that vein. But, most importantly!
This sparked a thought I had previously had, but that I only just started investigating now, and yep:
include()can take stream arguments other than conventional files. I need to investigate this more, but it seems like there are some very interesting potential options there...#8
I'm curious why we want to use reflection so badly? Remember we also want to keep global-level comment blocks. And that still doesn't solve the problem of having to parse all that code into memory, from which it cannot be removed. You're going to have to either break it up into a separate PHP process for each file or buy an awful lot of RAM. :-)
#9
Reflection would be a convenient way of saving a lot of work, since it comes with getDocComment(). However, it simply won' pick up globals or constants.
The tokenizer would basically be a state machine that iterates over the token array, keeping track of nesting level and looking for where things start & stop. I recently committed a small use of the PHP tokenizer in api_page_function(). It draws http://delocalizedham.com/node/10.
#10
@drumm:
Sorry, I don't quite follow what the idea is in that blog post you linked to.
It's becoming increasingly clear that some amount of manipulation of the files will be necessary if Reflection is to be used at all; said manipulation will almost certainly mean use of the tokenizer via any number of possible approaches. Which leads me to think that the best approach will probably end up being a mix of the two.
@Crell:
As drumm said: Reflection, because it can cut out a _lot_ of work. Also, Reflection, because extending it for new custom @tags is trivial. And maybe finally, Reflection, because it's really worth not reinventing the wheel and using the php's own internal parser when and where possible.
I don't see a reason why we need to force the API module to run in a mass loop over everything. There are some potential complexities and dependency issues, sure; but assuming we have a good underlying data storage schema for it, we can either a) batch it into parts then split it over cronjobs if needed, and/or b) save hash values of parsed files and only re-run when hashes differ.
#11
Things which this issue will be fixed. There should be unit tests for all.
* Heredoc with '}' in first column ending a function.
* globals recognized in strings #319978: $user is display twice in globals page (api.drupal.org)
* No objects or things inside of objects.
#12
I have a (mostly) working PHP tokenizer based parser for the API module. It is currently running the Zoop Framework API site (when the site is up... That server is a bit sick right now).
I wrote it based on the 5.x API module, but am about to port it to the 6.x branch. It currently handles classes, class methods and member variables. It does run into problems with duplicate function names in the global namespace, but it is pretty solid otherwise.
When I port it to 6.x I'd like to tackle #101308: Add commenting on API objects too, because I've been itching for that (and views support) for a while.
#13
This is now top priority now that there is a stable 6.x release.
The way I envision this working is that the tokenizer pass splits code into an array of source code (constants, globals, functions, objects) and associated documentation comments.
#14
So...I feel the need to apologize here.
I've been up and down through the various Reflection-based tools that are available, and I'm increasingly getting to be of the opinion that Reflection itself doesn't have that much to offer us. It does seem clear to me that there's a LOT from the phpdoc code base we could be borrowing (including a boatload of regex, iirc from my scan through it a month ago), but Reflection itself isn't really going to provide that much value-added. If we want to make it easy to add new @tag definitions, Reflection can help a bit, but it really doesn't seem to be the bee's knees in the way that I've been suggesting.
So, my apologies, but good luck with the reworking :)
#15
i've forked this project, to add classes+methods (more p regex) - but now i'm trying the tokenizer method.
Is anyone else working on this?
#16
yep. I've got a working classes+methods tokenizer based API module for 5.x, i'm in the middle of porting it to 6.x...
#17
@bobthecow can you share a gitrepo of your D5 version? I'm keen to use this week and help in anyway - i'm setting up this api module for another opensource project.
#18
@bobthecow can you share your D5 version? I can port it to 6 for you.
#19
Here is a first attempt at making a clean token()-based parser.
The parser itself works quite well, but several key features are still missing (but shouldn't be that hard to implement):
- code highlight (should be easy, we already have a decomposed form)
- group management
- promote classes to a first-class citizens in api.module
- globals, constants and member variables (should be fairly easy, too)
#20
Whoa.. thats a bit more object-oriented than I was expecting. I was just hoping for a simple loop over token_get_all() that returns array(array('documentation comment' => ..., 'code' => '...', 'start line' => ...), ...). I am cautiously optimistic about this much engineering.
I might not be able to review this in-detail for a few days, especially with this much code. More sample code can be added to tests/sample/ and tests can check that the parsing functions are working.
#21
I was too, at first. But the FSM that is required to implement this properly quickly overflowed my cognitive power. Seriously. So this is a divide and conquer approach.
This is still a work in progress, but the whole D7 code based is now parsed without error, which is encouraging.
#22
I also have some token based code from some earlier advantures of mine. It's nowhere near complete not at all but with copying it over to parser.inc of which quite some needs to be salvaged because the tokenizer won't tokenize the docblocks further for us I believe it can be a) finalized b) much closer to what Drumm wanted. However, for functions, this is a bit nicer. Feel free to disregard this piece or to usher me to finish it :)
#23
Yes, we have to improve/rebuild/find another doc block parser. That should wait for phase 2.
#24
Sounds like there is a lot of interest in getting this module updated. I've set up a github repo here, where I'll be working on my changes. It would be great if you all could join me :)
The initial commit is straight out of CVS HEAD.
#25
@bobthecow - can't track that github because it only has one commit
try forking from http://github.com/Br3nda/drupal-module-api/tree/master
i'll try applying you code to that.
#26
#27
The new grammar parser (possibly with a plug-in to to parse the Doxygen comment block items) will suffice for parsing and providing the information needed for this project.
I am willing to integrate the new parser into this project and could start on this any time. Does it make sense to add me as a co-maintainer?
#28
@solotandem: #19 is an already advanced research for a PHP grammar parser. It supports blocks, functions and classes. It could be easily extended to support expressions and function calls to. Take a look at it, it could help you kick-start your own project.
#29
@Damien: Thanks for the suggestion. However, the grammar parser currently supports all but eight of the PHP tokens. Let me know if it could be put to good use on this project. (Note: the grammar parser code has not been committed yet.)
#30
Damien, though #26 is not integrated with api.module yet, it is enough for what api.module needs and drumm asked for it because your code is a biit too much.
solotandem, you need to talk to drumm. If it supports classes and interfaces (mine does not) then it's great! We need that. I wanted to continue with those but never happened.
We have a doxygen parser in api module replacing that was never the target of this issue.
#31
chx: Yes it supports classes and interfaces.
#32
subscribe
I saw this issue and thought that solotandem's parse was a perfect fit.
Sounds like the integration and replacement of old parse will take a bit of work. Does it make sense for solotandem to become a co-maintainer instead of creating a giant patch of lots of little ones?
#33
Where is this parser to look at?
#34
The doc comment parser needs a lot of work and should be tackled in a separate issue; it will probably end up with a full rewrite. There are a number of small problems in other issues, such as @see basically not working. Lets keep this issue to the first pass, finding blocks of code and related doc comments. (The third pass is adding links at render.)
Chx's parser in #26 is currently preferred since it does what is needed, without extra complexity. Where is the plan that shows how PGP will move forward?
PGP looks like it is designed around saving outputted files. I am trying to move in the opposite direction, not relying on the file system and even removing the need to have a checkout. It looks like the Version Control API family of modules is getting or has the ability to read from CVS, so we no longer need to manage CVS checkout outside of Drupal. The target for all the intermediate parsing is the database so it can be queried.
#35
chx: the PGP reader and writer classes are committed to CVS (no release yet).
drumm: the PGP API is not file based. The module GUI supports reading from and writing to files, but the API does not know anything about the file system. The reader parses the source code string passed to it. This string could be anything from an expression to an entire file. The writer produces a source code string from the grammar statements (in an array) passed to it. Other code (apart from the PGP API classes) has to get the source code snippet to parse and do something with the source code string returned. The grammar array and source code string can be saved in a database record, written to file, etc.
PGP will have three main classes:
- PGPReader to parse the tokens into grammar statements
- PGPEditor to edit the statements
- PGPWriter to produce a source code string from the edited statements.
I have attached the following files (with a .txt extension added):
- The source file "database.inc" is an excerpt from a recent version of includes/database.inc.
- The grammar file "database.edited.inc" is a modified grammar array structure produced by the PGP parser.
- The file "structure.html" with notes on the grammar array structure in the edited file.
The "edited" file was produced by:
- parsing the source file using PGPReader.
- editing the grammar statements using (a pre-commit version of) PGPEditor to remove the function body "statements" and add a "text" element with the original function text (minus its doc comment).
- writing the text element using PGPWriter to reproduce the text of the function from the grammar statements.
I have left in the array structure the PHP open tag, Drupal ID tag, and whitespace tags, but these are easily ignored.
Future plans for PGP include convenience functions to produce:
- lists of classes, functions, variables, constants, etc.
- the information included in the "edited" file above
The "edited" file includes information I would expect the API documentation module to need. What other information is needed?
Does the PGP parser seem like a fit for this module? Please post any comments you may have about the parser as an issue on the PGP project.
#36
Wow, database.edited.inc_.txt looks really really impressive. For me this seems to be a great base for api module, it supports everything needed and much more than api's current parser does. We really need to handle this until D7 goes beta or something, the sooner the better. So for me using this pgp parser looks like the way forward. I don't think the rewriting part is needed for api.module though. And subscribe.
#37
Next release is soon, a change this big can't make it.
#38
The attached patch integrates the API module with the PGP module to parse source code files and produce documentation items.
To test the integration:
All columns have the same text with two exceptions:
To verify these results, you will need to download and install the latest PGP module committed after this comment and apply a simple patch to make it compatible with 6.x (as only a 7.x-dev release exists). The updated module and patch will be posted shortly.
In the process of integrating PGP, I found a few instances where the regular expressions used in API are opportunistic as far as what they expect to encounter in the text (and do not work in certain edge cases). I have not changed these but would expect to deal with them through the standard issue queue workflow.
PGP will automatically provide documentation items for undocumented code elements (functions, constants and globals). Also, PGP is able to provide the needed documentation items for classes and interfaces (I left commented out lines with the tokens related to same in the patch). To incorporate these items, what database schema changes are planned (for elements like scope modifiers)?
#39
About the exceptions:
* api_format_documentation_lists() should not have documentation. Its doc comment did not start with
/**until I committed a fix a couple minutes ago.* Globals should only be picked up when they are on their own, not within functions or classes. Working doc comments are at http://cvs.drupal.org/viewvc.py/drupal/contributions/docs/developer/glob....
* This parser doesn't need to worry about @ingroup or other tags, that is the next stage's job.
More notes:
* Freeing up memory is not necessary unless there are serious memory usage issues. I always recommend running cron via PHP CLI with Drush or drupal.sh, so it can have a sufficiently high memory allocation.
* Freeing memory would be be less of a concern if global variables were not used. Globals should be avoided.
* I'm guessing there may be future plans that make more sense, but PGPEditor looks like it just implements some singleton functions.
* To reiterate, I would like to see this PHP parser be separate from the doc comment parser. The "This needs to be reworked into a robust Doxygen-style comment parser." comment in pgp.editor.inc suggests these two stages of parsing might be intermingled. Any bugs in the doc comment parser should be filed in separate issues.
#40
Thanks for the review.
About your remarks:
* The PGP parser is not doing anything with @ingroup or other documentation tags (these are being handled in api/parser.inc).
* A robust Doxygen-style comment parser is needed somewhere (as the api module relies on opportunistic parsing that fails in places as evidenced by the display errors on api.drupal.org) with (as you write) its own issues.
* I freed up memory because I was testing it in real time from PHP CLI (but outside of cron, Drush or drupal.sh) and my PHP memory usage limit (128M) was being hit (granted I could have upped the setting).
* Regarding globals: $editor could easily be passed to the two functions using it; $nested_groups could be as well, but $docblocks would require a bit more care.
#41
Okay, I guess we will have to pay attention to memory usage, that is a lot.
I fully agree that our doxygen parser sucks, but one thing at a time. It does decently for now.
Keep up the good work. PGP will need a good Drupal 6 release since this will be running in production on Drupal 6.
#42
For the parsing of @tags, if my recollections from crawling through phpdoc's guts are accurate, it should be pretty easy to borrow from/build on their general approaches.
Given my interest in the @tags...and the fact that I pretty much completely neglected to do any of the help on this project I was hoping to last year...I'll offer to do the relevant research & coding to improve our @tag handling.
#43
sdboyer - please open a separate issue for the doc comment parser itself; it certainly needs some help. This issue is only for PHP code parsing.
#44
Any progress on this?
#45
Sorry, I've done basically no coding on my own time in the last month. No real good reason, just code block. I'm finally pushing through. I've got some other stuff that needs to be written over the next few days, but unless I'm a complete douche, I'll file that new issue early next week.
#46
+1
#47
Parsing only form.inc from Drupal 7 uses just under 326k with chx's parse script. This will of course go up a little in an actual implementation.
#48
I have been cleaning up parser.inc in preparation for the new parser. It now has 11.9% less code, removing a few duplicated sections. Next is to separate out doc comment parsing from the PHP parser, so there is a clean area to work around.
I think everything is still working, but it would have been good to add unit tests. Adding code to tests/sample and testing for the parsed code in simpletests would be a great contribution.
#49
Just a comment: I do not think that Drupal 7 should be released without this issue being fixed on api.drupal.org. As things stand now, as I'm sure you are aware, there is no presence at all on a.d.o for the now many classes that are key to Drupal 7 functionality, and it really cuts into the usefulness of api.drupal.org.
For instance, I was trying to figure out yesterday what had happened to the search API between D6 and D7, in order to refactor a patch that no longer applied after the DBTNG (whatever that stands for) changes. My patch added args to the search_simplify() function, and api.drupal.org currently shows in D7 that it was only being called during search indexing, not during searching. WTF? Can't be true... (you need to preprocess search terms the same when you index them as when you are searching).
So I had to turn on my grep-through-the-code fu, and finanally found out the reason is that it's called inside a PHP class instead of a normal function, and it's simply not there on api.drupal.org.
Just one example... Anyway, in my opinion, we should not release D7 without api.drupal.org having a way to look at docs for classes.
#50
solotandem is out of town for the next week or so, but he has re-factor the parser for a number of reasons, but I think we should move forward on this.
1) Replace current api module PHP parsing with parser module.
2) Design class pages and data structure.
3) Add additional integration to parse the new structures.
#51
Does "parser module mean pgp? Does "re-factor the parser" mean re-factor pgp? Is there an issue to track this re-factoring.
Most of the recent commits to parser.inc have been separating PHP and doc comment parsing. Any doc comment parsing is moving into api_save_documentation(). The remaining PHP parser should be relatively clean to replace.
#52
If anyone is looking to help, unit tests would be incredibly useful here.
#53
I think the refactoring phase of this development is done. I probably won't go ahead with swapping in a parser today, pending a status update on pgp.