The first pass of the PHP parser is to chunk up code into objects such as functions, constants and globals. Currently, this uses regular expressions and depends on things like functions starting with /^f/ and ending with the first /^}$/.

Instead, PHP's tokenizer should be used. This will fix the issues and pave the way for documenting object-oriented code. The API should be standardized so other file types, like CSS and JS, can be parsed in the future.

Unit tests will be required.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drumm’s picture

Component: Code » Parser
sdboyer’s picture

AWESOME! This is really great. I'd had the Reflection API in mind myself, but aside from the php4/5 question, I haven't really investigated the pros/cons of one vs. the other.

drumm’s picture

Reflection API and PHP 5 are great, however, it requires including the code, which would not be an option for documenting all of contrib.

sun’s picture

The registry contains some code in registry.inc that could work as a starting point. It already scans files for objects and functions. Add the other required tokens of interest, i.e. T_COMMENT, T_DOC_COMMENT.

The coder_format script shipped with Coder module might also be worth to look at. It is capable of reformatting PHP files so they match Drupal's coding standards (style).

sdboyer’s picture

I floated a thought to drumm in irc today; reposting it here as requested. Caveat: I don't understand how a token-based parser does/would/could work; my knowledge base is doing documentation using Reflection. drumm and I have talked about this before, and we basically ended up agreeing that both approaches would probably be useful. The big problem with a Reflection-based approach is the potential security hole: since the files have to be require'd in before their contents can be reflected upon, it means that anything in global scope in those files will be executed. That's fine if you trust the code, but if we're doing, say, cron-based runs of everything in Drupal CVS...problems.

But that's old news. The new stuff is a way of potentially getting around that security problem: instead of just requiring in the files to be read for documentation purposes, we manipulate the file first, THEN parse it in and go. Specifically, we read in the file, suck out everything that's in global scope and drop it into a dead-space function. We then write the file back out (unless we can just read it from memory without things getting REALLY hacky) with the dead-space function in it, THEN parse in that temporary file, and begin our Reflecting.

I'll also email either php-general or php-internals about this and see what I get back, aside from flames.

Crell’s picture

The problem with the approach in #5 is that if you suck out anything non-global, you also lose anything inside a class, vis, methods. We have a growing number of classes that need documenting.

Keeping the string in memory should be easy in PHP 5 thanks to the memory:// stream wrapper.

Has anyone taken the time to look at the phpDocumentor project and how they handle it? They're syntax is close enough anyway that we could probably leverage much or all of that codebase, and get much better standardization with the rest of the PHP world in the process.

sdboyer’s picture

The problem with the approach in #5 is that if you suck out anything non-global, you also lose anything inside a class, vis, methods.

I think you misread what I said? The suggestion was to suck out globals, not suck out non-globals. I think I may see the problem you're alluding to, but...well, could you clarify?

phpDocumentor is token-based, not Reflection-based. A lot of what they do could be re-used for a token-based approach, but is basically irrelevant to a Reflection-based approach. Arbitrary code execution is only an issue for Reflection-based systems (at least so far as we've thought it through), so there's not much to offer in that vein. But, most importantly!

Keeping the string in memory should be easy in PHP 5 thanks to the memory:// stream wrapper.

This sparked a thought I had previously had, but that I only just started investigating now, and yep: include() can take stream arguments other than conventional files. I need to investigate this more, but it seems like there are some very interesting potential options there...

Crell’s picture

I'm curious why we want to use reflection so badly? Remember we also want to keep global-level comment blocks. And that still doesn't solve the problem of having to parse all that code into memory, from which it cannot be removed. You're going to have to either break it up into a separate PHP process for each file or buy an awful lot of RAM. :-)

drumm’s picture

Reflection would be a convenient way of saving a lot of work, since it comes with getDocComment(). However, it simply won' pick up globals or constants.

The tokenizer would basically be a state machine that iterates over the token array, keeping track of nesting level and looking for where things start & stop. I recently committed a small use of the PHP tokenizer in api_page_function(). It draws http://delocalizedham.com/node/10.

sdboyer’s picture

@drumm:

Sorry, I don't quite follow what the idea is in that blog post you linked to.

It's becoming increasingly clear that some amount of manipulation of the files will be necessary if Reflection is to be used at all; said manipulation will almost certainly mean use of the tokenizer via any number of possible approaches. Which leads me to think that the best approach will probably end up being a mix of the two.

@Crell:

As drumm said: Reflection, because it can cut out a _lot_ of work. Also, Reflection, because extending it for new custom @tags is trivial. And maybe finally, Reflection, because it's really worth not reinventing the wheel and using the php's own internal parser when and where possible.

I don't see a reason why we need to force the API module to run in a mass loop over everything. There are some potential complexities and dependency issues, sure; but assuming we have a good underlying data storage schema for it, we can either a) batch it into parts then split it over cronjobs if needed, and/or b) save hash values of parsed files and only re-run when hashes differ.

drumm’s picture

Things which this issue will be fixed. There should be unit tests for all.

* Heredoc with '}' in first column ending a function.
* globals recognized in strings #319978: $user is display twice in globals page (api.drupal.org)
* No objects or things inside of objects.

bobthecow’s picture

I have a (mostly) working PHP tokenizer based parser for the API module. It is currently running the Zoop Framework API site (when the site is up... That server is a bit sick right now).

I wrote it based on the 5.x API module, but am about to port it to the 6.x branch. It currently handles classes, class methods and member variables. It does run into problems with duplicate function names in the global namespace, but it is pretty solid otherwise.

When I port it to 6.x I'd like to tackle #101308: Add commenting on API objects too, because I've been itching for that (and views support) for a while.

drumm’s picture

Version: master » 6.x-1.x-dev
Priority: Normal » Critical

This is now top priority now that there is a stable 6.x release.

The way I envision this working is that the tokenizer pass splits code into an array of source code (constants, globals, functions, objects) and associated documentation comments.

sdboyer’s picture

So...I feel the need to apologize here.

I've been up and down through the various Reflection-based tools that are available, and I'm increasingly getting to be of the opinion that Reflection itself doesn't have that much to offer us. It does seem clear to me that there's a LOT from the phpdoc code base we could be borrowing (including a boatload of regex, iirc from my scan through it a month ago), but Reflection itself isn't really going to provide that much value-added. If we want to make it easy to add new @tag definitions, Reflection can help a bit, but it really doesn't seem to be the bee's knees in the way that I've been suggesting.

So, my apologies, but good luck with the reworking :)

Shiny’s picture

i've forked this project, to add classes+methods (more p regex) - but now i'm trying the tokenizer method.
Is anyone else working on this?

bobthecow’s picture

yep. I've got a working classes+methods tokenizer based API module for 5.x, i'm in the middle of porting it to 6.x...

Shiny’s picture

@bobthecow can you share a gitrepo of your D5 version? I'm keen to use this week and help in anyway - i'm setting up this api module for another opensource project.

Shiny’s picture

@bobthecow can you share your D5 version? I can port it to 6 for you.

Damien Tournoud’s picture

FileSize
32.36 KB

Here is a first attempt at making a clean token()-based parser.

The parser itself works quite well, but several key features are still missing (but shouldn't be that hard to implement):
- code highlight (should be easy, we already have a decomposed form)
- group management
- promote classes to a first-class citizens in api.module
- globals, constants and member variables (should be fairly easy, too)

drumm’s picture

Whoa.. thats a bit more object-oriented than I was expecting. I was just hoping for a simple loop over token_get_all() that returns array(array('documentation comment' => ..., 'code' => '...', 'start line' => ...), ...). I am cautiously optimistic about this much engineering.

I might not be able to review this in-detail for a few days, especially with this much code. More sample code can be added to tests/sample/ and tests can check that the parsing functions are working.

Damien Tournoud’s picture

Status: Active » Needs work

I was just hoping for a simple loop over token_get_all() that returns array(array('documentation comment' => ..., 'code' => '...', 'start line' => ...), ...).

I was too, at first. But the FSM that is required to implement this properly quickly overflowed my cognitive power. Seriously. So this is a divide and conquer approach.

More sample code can be added to tests/sample/ and tests can check that the parsing functions are working.

This is still a work in progress, but the whole D7 code based is now parsed without error, which is encouraging.

chx’s picture

FileSize
2.95 KB

I also have some token based code from some earlier advantures of mine. It's nowhere near complete not at all but with copying it over to parser.inc of which quite some needs to be salvaged because the tokenizer won't tokenize the docblocks further for us I believe it can be a) finalized b) much closer to what Drumm wanted. However, for functions, this is a bit nicer. Feel free to disregard this piece or to usher me to finish it :)

drumm’s picture

Yes, we have to improve/rebuild/find another doc block parser. That should wait for phase 2.

bobthecow’s picture

Sounds like there is a lot of interest in getting this module updated. I've set up a github repo here, where I'll be working on my changes. It would be great if you all could join me :)

The initial commit is straight out of CVS HEAD.

Shiny’s picture

@bobthecow - can't track that github because it only has one commit
try forking from http://github.com/Br3nda/drupal-module-api/tree/master

i'll try applying you code to that.

chx’s picture

FileSize
5.92 KB
solotandem’s picture

The new grammar parser (possibly with a plug-in to to parse the Doxygen comment block items) will suffice for parsing and providing the information needed for this project.

I am willing to integrate the new parser into this project and could start on this any time. Does it make sense to add me as a co-maintainer?

Damien Tournoud’s picture

@solotandem: #19 is an already advanced research for a PHP grammar parser. It supports blocks, functions and classes. It could be easily extended to support expressions and function calls to. Take a look at it, it could help you kick-start your own project.

solotandem’s picture

@Damien: Thanks for the suggestion. However, the grammar parser currently supports all but eight of the PHP tokens. Let me know if it could be put to good use on this project. (Note: the grammar parser code has not been committed yet.)

chx’s picture

Damien, though #26 is not integrated with api.module yet, it is enough for what api.module needs and drumm asked for it because your code is a biit too much.

solotandem, you need to talk to drumm. If it supports classes and interfaces (mine does not) then it's great! We need that. I wanted to continue with those but never happened.

We have a doxygen parser in api module replacing that was never the target of this issue.

solotandem’s picture

chx: Yes it supports classes and interfaces.

boombatower’s picture

subscribe

I saw this issue and thought that solotandem's parse was a perfect fit.

Sounds like the integration and replacement of old parse will take a bit of work. Does it make sense for solotandem to become a co-maintainer instead of creating a giant patch of lots of little ones?

chx’s picture

Where is this parser to look at?

drumm’s picture

The doc comment parser needs a lot of work and should be tackled in a separate issue; it will probably end up with a full rewrite. There are a number of small problems in other issues, such as @see basically not working. Lets keep this issue to the first pass, finding blocks of code and related doc comments. (The third pass is adding links at render.)

Chx's parser in #26 is currently preferred since it does what is needed, without extra complexity. Where is the plan that shows how PGP will move forward?

PGP looks like it is designed around saving outputted files. I am trying to move in the opposite direction, not relying on the file system and even removing the need to have a checkout. It looks like the Version Control API family of modules is getting or has the ability to read from CVS, so we no longer need to manage CVS checkout outside of Drupal. The target for all the intermediate parsing is the database so it can be queried.

solotandem’s picture

chx: the PGP reader and writer classes are committed to CVS (no release yet).

drumm: the PGP API is not file based. The module GUI supports reading from and writing to files, but the API does not know anything about the file system. The reader parses the source code string passed to it. This string could be anything from an expression to an entire file. The writer produces a source code string from the grammar statements (in an array) passed to it. Other code (apart from the PGP API classes) has to get the source code snippet to parse and do something with the source code string returned. The grammar array and source code string can be saved in a database record, written to file, etc.

PGP will have three main classes:
- PGPReader to parse the tokens into grammar statements
- PGPEditor to edit the statements
- PGPWriter to produce a source code string from the edited statements.

I have attached the following files (with a .txt extension added):
- The source file "database.inc" is an excerpt from a recent version of includes/database.inc.
- The grammar file "database.edited.inc" is a modified grammar array structure produced by the PGP parser.
- The file "structure.html" with notes on the grammar array structure in the edited file.

The "edited" file was produced by:
- parsing the source file using PGPReader.
- editing the grammar statements using (a pre-commit version of) PGPEditor to remove the function body "statements" and add a "text" element with the original function text (minus its doc comment).
- writing the text element using PGPWriter to reproduce the text of the function from the grammar statements.

I have left in the array structure the PHP open tag, Drupal ID tag, and whitespace tags, but these are easily ignored.

Future plans for PGP include convenience functions to produce:
- lists of classes, functions, variables, constants, etc.
- the information included in the "edited" file above

The "edited" file includes information I would expect the API documentation module to need. What other information is needed?

Does the PGP parser seem like a fit for this module? Please post any comments you may have about the parser as an issue on the PGP project.

Frando’s picture

Wow, database.edited.inc_.txt looks really really impressive. For me this seems to be a great base for api module, it supports everything needed and much more than api's current parser does. We really need to handle this until D7 goes beta or something, the sooner the better. So for me using this pgp parser looks like the way forward. I don't think the rewriting part is needed for api.module though. And subscribe.

drumm’s picture

Priority: Critical » Normal

Next release is soon, a change this big can't make it.

solotandem’s picture

The attached patch integrates the API module with the PGP module to parse source code files and produce documentation items.

To test the integration:

  • I created documentation for the following files using the existing code and the patch code:
    • includes/common.inc
    • includes/locale.inc
    • api
  • I saved the api tables created from each set of code.
  • I compared the columns in the two sets of tables.

All columns have the same text with two exceptions:

  • The existing code did not create an item for the function api_format_documentation_lists.
  • The patch code included items for 5 uncommented globals. (The existing code does not locate them as they are undocumented. I am not sure whether they should be included or not.)
  • The patch code included "@ingroup themeable" in the documentation column for the theme_locale_languages_overview_form item.

To verify these results, you will need to download and install the latest PGP module committed after this comment and apply a simple patch to make it compatible with 6.x (as only a 7.x-dev release exists). The updated module and patch will be posted shortly.

In the process of integrating PGP, I found a few instances where the regular expressions used in API are opportunistic as far as what they expect to encounter in the text (and do not work in certain edge cases). I have not changed these but would expect to deal with them through the standard issue queue workflow.

PGP will automatically provide documentation items for undocumented code elements (functions, constants and globals). Also, PGP is able to provide the needed documentation items for classes and interfaces (I left commented out lines with the tokens related to same in the patch). To incorporate these items, what database schema changes are planned (for elements like scope modifiers)?

drumm’s picture

About the exceptions:
* api_format_documentation_lists() should not have documentation. Its doc comment did not start with /** until I committed a fix a couple minutes ago.
* Globals should only be picked up when they are on their own, not within functions or classes. Working doc comments are at http://cvs.drupal.org/viewvc.py/drupal/contributions/docs/developer/glob....
* This parser doesn't need to worry about @ingroup or other tags, that is the next stage's job.

More notes:
* Freeing up memory is not necessary unless there are serious memory usage issues. I always recommend running cron via PHP CLI with Drush or drupal.sh, so it can have a sufficiently high memory allocation.
* Freeing memory would be be less of a concern if global variables were not used. Globals should be avoided.
* I'm guessing there may be future plans that make more sense, but PGPEditor looks like it just implements some singleton functions.
* To reiterate, I would like to see this PHP parser be separate from the doc comment parser. The "This needs to be reworked into a robust Doxygen-style comment parser." comment in pgp.editor.inc suggests these two stages of parsing might be intermingled. Any bugs in the doc comment parser should be filed in separate issues.

solotandem’s picture

Thanks for the review.

About your remarks:
* The PGP parser is not doing anything with @ingroup or other documentation tags (these are being handled in api/parser.inc).
* A robust Doxygen-style comment parser is needed somewhere (as the api module relies on opportunistic parsing that fails in places as evidenced by the display errors on api.drupal.org) with (as you write) its own issues.
* I freed up memory because I was testing it in real time from PHP CLI (but outside of cron, Drush or drupal.sh) and my PHP memory usage limit (128M) was being hit (granted I could have upped the setting).
* Regarding globals: $editor could easily be passed to the two functions using it; $nested_groups could be as well, but $docblocks would require a bit more care.

drumm’s picture

Okay, I guess we will have to pay attention to memory usage, that is a lot.

I fully agree that our doxygen parser sucks, but one thing at a time. It does decently for now.

Keep up the good work. PGP will need a good Drupal 6 release since this will be running in production on Drupal 6.

sdboyer’s picture

For the parsing of @tags, if my recollections from crawling through phpdoc's guts are accurate, it should be pretty easy to borrow from/build on their general approaches.

Given my interest in the @tags...and the fact that I pretty much completely neglected to do any of the help on this project I was hoping to last year...I'll offer to do the relevant research & coding to improve our @tag handling.

drumm’s picture

sdboyer - please open a separate issue for the doc comment parser itself; it certainly needs some help. This issue is only for PHP code parsing.

drumm’s picture

Priority: Normal » Critical

Any progress on this?

sdboyer’s picture

Sorry, I've done basically no coding on my own time in the last month. No real good reason, just code block. I'm finally pushing through. I've got some other stuff that needs to be written over the next few days, but unless I'm a complete douche, I'll file that new issue early next week.

Carl Johan’s picture

+1

drumm’s picture

Parsing only form.inc from Drupal 7 uses just under 326k with chx's parse script. This will of course go up a little in an actual implementation.

drumm’s picture

I have been cleaning up parser.inc in preparation for the new parser. It now has 11.9% less code, removing a few duplicated sections. Next is to separate out doc comment parsing from the PHP parser, so there is a clean area to work around.

I think everything is still working, but it would have been good to add unit tests. Adding code to tests/sample and testing for the parsed code in simpletests would be a great contribution.

jhodgdon’s picture

Just a comment: I do not think that Drupal 7 should be released without this issue being fixed on api.drupal.org. As things stand now, as I'm sure you are aware, there is no presence at all on a.d.o for the now many classes that are key to Drupal 7 functionality, and it really cuts into the usefulness of api.drupal.org.

For instance, I was trying to figure out yesterday what had happened to the search API between D6 and D7, in order to refactor a patch that no longer applied after the DBTNG (whatever that stands for) changes. My patch added args to the search_simplify() function, and api.drupal.org currently shows in D7 that it was only being called during search indexing, not during searching. WTF? Can't be true... (you need to preprocess search terms the same when you index them as when you are searching).

So I had to turn on my grep-through-the-code fu, and finanally found out the reason is that it's called inside a PHP class instead of a normal function, and it's simply not there on api.drupal.org.

Just one example... Anyway, in my opinion, we should not release D7 without api.drupal.org having a way to look at docs for classes.

boombatower’s picture

solotandem is out of town for the next week or so, but he has re-factor the parser for a number of reasons, but I think we should move forward on this.

1) Replace current api module PHP parsing with parser module.
2) Design class pages and data structure.
3) Add additional integration to parse the new structures.

drumm’s picture

Does "parser module mean pgp? Does "re-factor the parser" mean re-factor pgp? Is there an issue to track this re-factoring.

Most of the recent commits to parser.inc have been separating PHP and doc comment parsing. Any doc comment parsing is moving into api_save_documentation(). The remaining PHP parser should be relatively clean to replace.

drumm’s picture

If anyone is looking to help, unit tests would be incredibly useful here.

drumm’s picture

I think the refactoring phase of this development is done. I probably won't go ahead with swapping in a parser today, pending a status update on pgp.

Frando’s picture

bump. Is there any news about this? Is PGP usable for api.module by now? We really need to tackle this prior to a D7 release and hopefully also prior to betas, as otherwise all our classes are completely undocumented on api.drupal.org, which is quite fatal.

Is there a plan on how to procede? Which parts need help?

solotandem’s picture

PGP is usable for api.module. However, to handle interfaces and classes, aren't some changes needed to the api module tables and the corresponding arrays of values in the code? Let me know and I will post a patch to integrate this module with the grammar parser module.

Frando’s picture

Yes, the internal structure of api.module will have to be extended to support classes and interfaces.

And we want to have crosslinking from interfaces to classes that implement it and of course crosslinking in class hierarchies.

webchick’s picture

Now that 7.x alpha1 is out, this is even more critical, since it actively blocks people from porting their modules. Any chance someone could chisel out some time to work on this /very/ soon?

chx’s picture

OK so this issue is cursed. Theoretically PGP could work with API. In reality it's not been rerolled or pushed for months and tonight when I wanted to do something useful I wasted that time with debating whether to use it or not. Great.

chx’s picture

And again why it's just Jim and me? There is never any help not even for the most important things that really everyone wants. No, we are doing Hitler and Drupal videos instead, so fucking lots easier!

chx’s picture

We discussed with Neil that for simplicity we will store things like Database::getConnection as a function name. Further enhancement would be handling return types -- link the interface automatically indicated in the doxygen.

All this, however will not solve all the problems. Just with a static code analyzer it's rather tricky to figure out where the doxygen of a given method lies as we document at interfaces not in the classes themselves. But no we squarely refuse to add a simple marker to the method names which would help...

aismail’s picture

The heavy-lifting (a.k.a. parsing) doesn't need to be done on the Drupal side. We could use an external tool for that which is good at what it does.

I'd suggest using a third-party parser, written in whatever language, such as the one from http://www.phpcompiler.org/. The existing ones will have numerous unit tests and the maintenance work wouldn't be on the Drupal side, which has limited interest in investing work in such a project (api.drupal.org being one of the main uses).

Usually these kind of parsers output an Abstract Syntax Tree, which is a tree representation of one or more source files. Functions, variables, classes are all nodes in that tree, together with the associated comments (if parser doesn't ignore them) and with information like "line number & column" (where the given node appears). What api.drupal.org needs is a pretty basic traversal of that AST.

Now, I see 2 approaches, depending on the language in which the parser is written and on its export abilities:
1) Export the AST to a file (by calling an external process), which is then used as input for a Drupal module which re-creates the AST structure in memory and walks it
2) Write the traversal in the same language as the parser, linking them in a big executable. The output should then be exported to a standardized format (such as xml, CSV), simple enough for a Drupal module to parse and insert into the schema of API module.

2) is probably simpler to write (since re-creating the AST in memory based on its export might be tedious) but 1) keeps all the Drupal-specific code in a Drupal module. Opinions? If it was my choice, I'd probably choose 1).

chx’s picture

FYI: phc can dump the AST in an XML file. However, we lose comments so I am not 100% this is a viable solution, after all.

chx’s picture

Another candidate is http://www.eclipse.org/pdt/articles/ast/PHP_AST.html but we need someone with Java skills to write a little command line tool so we can run with it.

CorniI’s picture

Hi,
I got working installations of both phc (dataflow branch) and roadsend rphp.
Both dumps show the XML output of cache-install.inc
For rphp, i had to remove the type-hinted array, as type-hints aren't supported yet in the grammar. Also rphp is unsuitable for the given task currently, because it's AST dumper and parser is too incomplete to parse all of drupal. This might look different in a half year, but well...
phc warns that it's line-numbers may be a bit off, so this needs to be hand-controlled. This is a bug in their parser they (pbiggar) don't know the reason for.
@#61: I'd probably implement everything as an AST visitor in the host language and fill the database from that visitor, this should be the most straight-forward. API-Module then would just be responsible for presenting the database contents. This would also scale quite good, because the memory consumption of a good parser should be much lower than it is for php files (require'd or not).
Corni

boombatower’s picture

Or we use the existing parser that can be controlled from php and provides everything we need...? All the alternatives require another method of importing the data into PHP and maintaining/interfacing with an external language/library.

Crell’s picture

If the existing parser provided everything we needed, this issue wouldn't be here. We don't need to custom write everything if existing systems can do it for us.

boombatower’s picture

Clarification as the many comment above point to PGP a project on Drupal.org which is what I have been referring and just needs a re-roll of patch which solotandem is working on, but is running into issues with the way API module is designed as it seems to do parsing on every page load which is rather wasteful. He is trying to come up with a simple way to integrate without redesigning it to do all the parsing at the beginning for the time being.

My suggestion would be to serialize whatever information is displayed, but doesn't need to be stored in a structured manor for searching or whatever, but parsing (however limited) is not what we want to do on every page load.

solotandem’s picture

I will post a re-rolled patch by tomorrow. I have the integration code re-rolled and am comparing the results of using the existing API module code to that with the Grammar Parser as the parsing engine. A little bit of due diligence.

drumm’s picture

For PGP - what have you found for it's memory usage? Is that still a problem? Is it running without patching on Drupal 6?

The patch I'm expecting is a relatively straightforward replacement of api_parse_php_file()'s implementation. This is all done at parse time, anything API does on page load shouldn't be part of this issue.

solotandem’s picture

The attached patch integrates the API module with the Grammar Parser module to parse source code files and produce documentation items. I added a constant to allow easy switching between the existing code and the integrated code to compare the documentation results. This would be eliminated in a final patch.

Further integration is possible. For example, in api_format_php(), the call to token_get_all is redundant and could be eliminated as the Grammar Parser already made this call and has this information stored in the grammar objects.

I tested the integration in the same fashion as described in #38 and with the same observations. In addition:

  • The existing code does not include @file comments.
  • The patch code includes items for the class functions in the pgp module. (I realize the current code does not include these as this is closely related to the topic of this issue.)
  • The patch code does not store the line numbers. (They can be included in a later patch.)

To verify these results, download and install the latest PGP release created after this comment and apply a simple patch to make it compatible with 6.x (as only a 7.x-dev release exists).

This patch treats the class functions the same as the non-class functions as the details of what to store remain to be fleshed out. In the function signature, I dropped the word 'function' and left the scope modifiers. On display, the existing code drops the function name and only shows the scope modifier. Again, I left this until the details are worked out. We need to decide how to handle:

  • scope modifiers
  • extended and implemented classes

drumm: The memory usage grows as more modules are parsed in a single process. If the documentation is run in batches (using the queue as it is now) the memory requirements are not an issue. All of the above mentioned files ran in one process that used about 75MB of memory? The parser engine is not Drupal-dependent or Drupal-specifc. The user interface elements of the parser module are, of course, Drupal-version-specific. The attached patch is a replacement of the api_parse_php_file() routine.

drumm’s picture

Quite a lot of refactoring has happened in the API module since #38, to make this replacement more straightforward. I don't think #70 has fully caught up. An incomplete list of examples:
- Literal '@' are no longer used. the API_RE_TAG_START constant uses a negative lookbehind assertion to make sure it is not '\@', allowing escaping of @.
- Nested groups are handled in api_save_documentation(), adding api_documentation_nested_group() is unnecessary.
- In the added api_documentation_item(), @param, @return, and @see are parsed, this is also redundant with api_save_documentation().

This patch should not need to change any of the @tag handling, which I have moved out of api_parse_php_file(). I think all the separation of PHP parsing and doc comment parsing is already done. More work is probably needed for doc comment parsing, but not now and not in this issue.

Similarly, I'm not concerned about reimplementing api_format_php() now, it is doing okay for now. And yes, more work after this will be needed for classes, I don't expect one patch to solve everything. I would rather verify that files, functions, constants, and globals work okay with a new parser and proceed from there.

I looked into d.o's PHP memory configuration, and it is already quite high, probably for some project* processing. That should not be an issue as long as nothing is leaking memory like a sieve. The queue will run as many files as it can in 2.5 minutes.

I've been running API.d.o on released versions of the API module and the whole deployment system works a bit better with released versions. Unfortunately with the redesign and complexity of d.o, it will be awhile before API is running Drupal 7. It would be good to have a stable 6.x release of PGP.

Otherwise, the interesting part of this patch seems relatively straightforward, which is what I am looking for. I wouldn't say I'm committed to PGP quite yet. If the implementation is straightforward and the project is well-supported, then it should be a good way to go.

solotandem’s picture

Thank you for reviewing the patch and posting your comments.

Regarding the refactoring since #38:
- I agree that where the patch code used the existing code's doc comment parsing, it should have been updated to use the refactored code with the API_RE_TAG_START constant, etc.
- Separation of doc comment parsing and nested groups from the php parsing makes sense, but placing the doc comment parsing into the api_save_documentation() routine seems a bit misplaced. Shouldn't there be a routine called api_parse_doc_comment() separate from api_save_documentation()? Leaving this parsing code in the api_documentation_item() routine seemed preferable to the refactored existing code (one can certainly argue that parsing the php file includes the doc comment until one saw fit to refactor the latter into a separate routine). You refer to this code as redundant in the patch, but the patch commented out this parsing in the api_save_documentation routine (as it seems ill-placed).

Regarding PGP:
- The absence of a 1.0 release is no indication the code is not stable. (As a personal practice, I do not commit code to the dev release that is knowingly unstable. I also expect to make an official 1.0 release soon.)
- The patch to convert the module to 6.x is sufficient for this purpose. This patch makes no changes to the parser engine, only the user interface elements which the API module will not use anyway. (That said, if the lack of a 6.x release is a show-stopper, I can make a 6.x release.)
- What is the basis for questioning the project being well-supported? It is a key element in radically changing the whole dynamic and timing of contributed module updates in conjunction with Coder Upgrade. Not to mention other uses, such as core module updates and security reviews.

drumm’s picture

FileSize
5.75 KB

At some point, api_save_documentation() could be split up, so @tag parsing and saving is separate. This issue doesn't deal with @tag parsing, that can be tackled elsewhere. I'm happy with it all being in one place right now.

When this is all ready to go on api.d.o, a 6.x release would be great. Drupal.org is going to be 6.x for awhile, and I like running released versions.

I've been reshuffling code where I can during a DrupalCamp, with the goal of simplicity and a small change. The patch is quite small now, and shows that PGP can get this done cleanly. I have
- removed any changes to @tag parsing
- rolled functions that are called once, the ones that build docblock arrays, into the calling code
- removed globals

The attached patch is quite broken, I haven't had a chance to actually test, but I thought I would post in case someone wants to jump in.

A few questions
- What is the plan for line numbers?
- I noticed Grammar Parser has functions like ->getFunctions() and ->traverseCallback(). Could/should these be used instead of our own loop?

solotandem’s picture

The attached patch follows the approach used in #73, but corrects the broken items. The two main corrections are reverting the $default_block code (the patch in #73 essentially relates all of the documentation items to a single file) and restoring the T_FUNCTION case to the switch statement at the end of api_documentation_loop().

For class methods, the patch stores class_name::function_name as the object_name in the api_documentation table. However, because the api module does its own parsing for function calls straight from the source code, the information included in the parser objects is lost when determining references. Because there can be function name collisions among classes, the code in api_parse_function_calls() and api_format_php needs to change. The references stored in the api_reference_storage table include bogus references among the namespace collisions, e.g. __construct.

To implement the documentation of classes, it seems to me we need 1) a game plan for storing and displaying the information, and 2) most likely, need to take further advantage of the grammar parser and what it knows about the contents of the file.

Also, for anyone interested in testing this, there is a change to PGP that requires a fresh download of the module, its D6 patch, and application of its D6 patch.

drumm’s picture

Looks like this is improving. I'm traveling again this weekend, but I should be able to find time to do another round, maybe commit on this.

I'm fine with handling the UI and DB changes. I'll try to keep those simple, but the UI could definitely use a little refactoring in the future.

For the two other uses of PGP, references and PHP formatting, I would like to save those for a subsequent patch so this stays manageable.

What was the plan for line numbers?

Thanks again for working on this.

drumm’s picture

All Simpletests now pass in API head, so that is something to look for. Right now it is really basic, just a file with a function to test some linking. I'll look into adding lower-level tests for {api_documentation} and related tables get filled properly from the test branch.

drumm’s picture

FileSize
5.62 KB

My version of the patch which passes all tests, so some basics with files and functions are working.

solotandem’s picture

The attached patch includes line numbers and restores the group and mainpage blocks.

In my comparison tests, it appears the current API module:

  • does not pick up file documentation comments
  • shows the line number as one greater than the actual line number
drumm’s picture

Status: Needs work » Fixed

Committed a version of this patch. I did have to special case the object_name of globals. The name, used in URLs and such, should not have a $, but the object_title should have a $.

One thing to keep in mind is this will be run from cron, so output will be emailed somewhere. "assign_variable = $sample_global in globalToString" happens now. Usually nothing should be print/echoed. Now classes and interfaces need the parsing nailed down, need to be stored, and have a UI.

drumm’s picture

The followup to make sure we get classes parsed and stored: #710774: Ensure all class-related items are parsed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.