Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
The FileCopy plugin currently (barely) supports remote file sources, using copy(). It's a dirty hack; we should replace it with a plugin dedicated to remote file processing, that uses the Drupal API (read: Guzzle).
Proposed resolution
- Create a
Download
process plugin to handle remote files through the Drupal API. - Remove remote file handling capabilities from the FileCopy process plugin, so it can use
file_unmanaged_copy()
instead ofcopy()
.
The plugin should support a similar configuration structure to FileCopy. Something like:
uri:
plugin: download
source: source_relative_path
guzzle_options:
base_uri: http://example.com/
# anything else that goes into Guzzlehttp\Client::__construct()
Remaining tasks
Write patch
User interface changes
n/a
API changes
Yes, if you were using FileCopy with a remote file source, you need to now use the RemoteFile process plugin in your migration yml.
Data model changes
n/a
Comment | File | Size | Author |
---|---|---|---|
#57 | followup-2783075-57.patch | 696 bytes | claudiu.cristea |
#50 | 2783075-50.patch | 20.79 KB | claudiu.cristea |
Comments
Comment #2
chx CreditAttribution: chx at Smartsheet commentedComment #3
chx CreditAttribution: chx at Smartsheet commentedComment #4
ohthehugemanatee CreditAttribution: ohthehugemanatee at Amazee Labs commentedHere's a first stab at a Download plugin and tests. My local environment is borked for filesystem-related tests, so let's see how the testbot does with it.
This patch can't be completed until #2695297: Refactor EntityFile and use process plugins instead gets in, because the FileCopy plugin won't exist in core until then... so there's nowhere to remove copy() from.
Comment #6
chx CreditAttribution: chx at Smartsheet commentedI will admit I am not particularly well versed in Guzzle API. I thought you needed to pass the options to the Client constructor but I see request also has options. Could anyone find documentation what each of these are ?
Comment #7
chx CreditAttribution: chx at Smartsheet commentedAlso http://stackoverflow.com/a/16947245/308851
Comment #8
ohthehugemanatee CreditAttribution: ohthehugemanatee commentedYou can pass the options to the constructor and get a response in a single call, or build the request in steps. Agreed that it's better to stream directly to the fs, though.
Also, what's the right way to get the mocked http_client to the plugin? Do I have to mock the whole service container?
Comment #9
chx CreditAttribution: chx at Smartsheet commentedNo, you instantiate the class direct with a mock ClientInterface -- you need to fix the typehint while at it.
Comment #10
ohthehugemanatee CreditAttribution: ohthehugemanatee commentedHm. This version streams the file directly to the filesystem... but there's less to test. Since the actual file write is handled by Guzzle, it's outside of the SUT. So I really only have to test that the file is created with the right name.
I feel like this test is more or less immaterial. The only (little bit) of complexity in the plugin is deciding if the directory is writable, and that's the part we don't test at all. We're basically just testing a wrapper around file_destination() here. Thoughts?
Comment #11
ohthehugemanatee CreditAttribution: ohthehugemanatee commentedComment #12
chx CreditAttribution: chx at Smartsheet commentedIf you manage to get away with not writing a test please file an issue to remove every source plugin unit test. We need your magic for those too.
Comment #13
ohthehugemanatee CreditAttribution: ohthehugemanatee commentedThe only way I can see to add more test coverage here is to duplicate tests on Guzzle, file_destination, or file_prepare_directory . Simple plugin with simple tests. Honestly the tests that we DO have are already more duplicative of file_destination than I would like.
This is ready for human review/testing.
Comment #14
phenaproximaI like this approach and would like to see it in Migrate, and I'd like to see a new process plugin be created to delegate the dirty work of file transfer to either the existing file_copy plugin or this new download plugin, depending on whether the input value is local or remote. Assigning to myself for review.
Comment #15
phenaproximaThis makes it sound as though the plugin supports local files, which it does not. Can it be rephrased?
Wrong plugin ID.
Missing description. Also, if possible it would preferable to type hint to an interface here, if one exists, rather than a concrete class.
Before we do this, can we call
parse_url($source)
to validate that it's a real URL, and throw an \InvalidArgumentException or MigrateException if it's not? Just for sanity checking purposes.Not a big deal, but I would rather use sprintf() here instead of embedding method calls directly in the string.
This is lifted right out of file_copy. I'm not sure if this would be too far out of scope, but IMHO this should be inherited from a new base class that both FileCopy and this class can extend.
Same here.
I'd like to see tests added that verify the correct exceptions are thrown if the destination directory is not writeable.
Missing description.
Missing doc comment, and there's an extra empty line.
This method does not throw MigrateException.
It would be useful to test that the Download plugin handles HTTP failures as expected.
Comment #16
chx CreditAttribution: chx at Smartsheet commented@phenaproxima are you still working on this?
Comment #17
phenaproxima@chx: Yes, I still want this to get into Migrate. I don't want to actually code the patch, though, so that I can maintain RTBC privileges...unless you're willing to OK it once it's ready?
Comment #18
chx CreditAttribution: chx at Smartsheet commentedSure I will review. I was only asking because you assigned it to yourself. Please, finish it, I will review.
Comment #19
phenaproximaOkay, this oughta pass the tests! All of my comments are either addressed, or otherwise obviated.
Comment #20
chx CreditAttribution: chx at Smartsheet commentedI wouldn't use the long syntax these days for a brand new file. [] is your friend.
Oh sweet summer child, you didn't think PHP has a gentle error passing way just like that? http://php.net/manual/en/function.fopen.php
I would use a ?: for getOverwriteMode and perhaps remove the method entirely. Is there a reason
is not working?
Why do we need to realpath the directory...? I thought mkdir supports streams https://api.drupal.org/api/drupal/core%21lib%21Drupal%21Core%21StreamWra...
Yay for guzzle_options. However,
$this->httpClient->get
is ought to work. https://github.com/guzzle/guzzle/blob/master/docs/quickstart.rstThis will break your test.
http://php.net/parse_url
Why on earth are we validating the URL anyways? The real litmus test will be done by Guzzle: if it downloads, it's valid if it doesn't who cares whether because it's 404 or because someone tried to download a nonURL?
Comment #21
phenaproxima@chx: this patch addresses all of your good points.
Regarding the error triggered by fopen() -- I discovered this while trying to write a test of a non-writable destination URI, which with core's PHPUnit configuration causes the test to fail, and rightfully so. That's why there's no explicit test of fopen() failing.
I'm marking this patch for review so that the tests will run, but even if it looks good, we're not quite done here -- I still need to remove remote support from FileCopy.
Comment #22
chx CreditAttribution: chx at Smartsheet commented> That's why there's no explicit test of fopen() failing.
No, the problem is the lack of @
Comment #23
phenaproximaI didn't understand why we wanted to suppress fopen errors, but looking at it again I think I do. We want to stop PHP's error handling so we can try to prepare the directory first before we fail with an exception. Seems legit. Fixed and added a test of that.
Comment #24
chx CreditAttribution: chx at Smartsheet commentedWonderful, wonderful. Now on to removing the hack from the FileCopy plugin before we release 8.2.0. with that in it.
Comment #25
phenaproximaThis oughta do 'er. Also did a bit of cleanup/renaming -- CopyFileTest should have always been FileCopyTest.
Comment #26
mikeryanThis regresses the migrations currently using file_copy - they will only work with local files. Thus, you would no longer be able to enter http://www.example.com/ for the source_base_path in migrate_drupal_ui. There needs to be some means to handle both - either migrate_drupal_ui swapping file_copy for download when it sees an HTTP url, or a wrapper process plugin that will select one or the other as needed.
Comment #27
chx CreditAttribution: chx at Smartsheet commentedThis is why the ur-issue introducing this hack should never have been committed in the first place: it was cleanly a hack and that hack has been repeatedly used to justify this capability and now it's called a regression now that we clean up. It was bad enough to not block FileCopy on this issue; we won't block unhacking this finally on UI problems. There's always the next commit.
Comment #28
phenaproximaI concur with @mikeryan. We should maintain backwards compatibility and accommodate naive migrations that try to feed a remote URL into FileCopy. It shouldn't be too difficult to analyze the input to FileCopy and have it invoke the download plugin if it the input does not represent a local resource.
Comment #29
chx CreditAttribution: chx at Smartsheet commentedFollowup.
Comment #30
phenaproximaThis isn't RC eligible, so it's never going to get into 8.2.x. Rerolled against 8.3.x.
Comment #31
phenaproximaThere's no need for this patch to break BC, or to wait for a follow-up issue to implement a BC layer. I added one, with a test.
Comment #34
phenaproximaComment #35
phenaproximaDiscussed briefly with @alexpott, and we think it's OK to commit this as a BC breaker, with a change record.
Comment #36
phenaproximaFixing them failures.
Comment #37
chx CreditAttribution: chx at Smartsheet commentedhttp://docs.guzzlephp.org/en/latest/request-options.html#sink doesn't document whether Guzzle closes the sink or not. I am really wary of this because if it doesn't then downloading lots of files will end in tears. Please investigate -- while I am unaware of a simple boolean check whether a stream resource is closed or not, you can check its type:
Edit: to be clear I do not mean to change the patch to include a get_resource_type call but to temporarily add one and if it's not closed then add an fclose to the codebase. Or just add the fclose() and run a test and see whether you get a warning of trying to close an already closed stream.
Comment #38
heddnI don't think this actually *breaks* BC. It just gives us some additional options and can be added in a non-BC breaking manner. Removing tag. It can be re-added if you guys disagree.
Comment #39
phenaproxima@chx: I tested this manually after adding an fclose() call after Guzzle does its thing. I can confirm that fclose() produces a warning: "supplied resource is not a valid stream resource". I take that to mean it has closed the sink after downloading.
Comment #40
claudiu.cristeaThis is a very nice feature and patch. I tested it on a real project and works very well. I see that @chx's concern from #37 has a reply in #39.
I don't understand why this cannot go in 8.2.x as we are not yet in beta stability.
Comment #41
claudiu.cristea.
Comment #42
alexpottShould we return here or should we still do the file_exists() check - because after this it should exist - right?
Are there any BC implications of making this change? According to the IS there are. We need a CR to tell people exactly what to do.
Comment #43
claudiu.cristeaYes, I think it make sense to test the existence first.
No, the "BC break" was removed in #31.
Comment #44
phenaproximaWait, what? Won't putting the file_exists() check before the isLocalUri() call break BC like a twig? If the incoming value is a remote URL, file_exists() will return FALSE and will throw an exception before the download plugin can be invoked. I don't see how #43 maintains BC...?
Comment #45
claudiu.cristea@phenaproxima, quoting from http://php.net/manual/function.file-exists.php
Comment #46
phenaproxima"Some URL wrappers" is not reassuring. Given that custom stream wrappers are a thing, FileCopy has no way of knowing if the return value from file_exists() will be accurate.
It is definitely not a BC break if we do the file_exists() check after the isLocalUri() check; on the other hand, if file_exists() come first, the level of BC breakage depends on the whims of the stream wrapper(s) involved, and how the user's PHP interpreter is configured. I can see it causing many hard-to-trace, use case-specific bugs, and as far as I can tell there is no need for us to go down that path.
Can we restore it back to the way it was? What do we gain by moving the file_exists() check? If leaving it after isLocalUri() means we have to write a change record, I'm happy to do that and keep it robust and clear, rather than introduce potentially unpredictable WTFs.
Comment #47
mikeryanIf http were one of the stream wrappers file_exists() works with, we wouldn't have needed any of this - the file_copy plugin would just use file_unmanaged_copy() and it would just work.
This does make me think, however, that there is a bit of a BC break here - the download plugin is HTTP-specific, so if someone previously was using, say, ftp://, that won't work any more. We haven't documented or shown any examples of anything other than http and local files, however, so that might not be a concern.
Comment #48
phenaproxima@mikeryan: That seems like a somewhat different issue to me. It means we'd have to turn the download plugin into a delegator (one for Guzzle, one for FTP, one for SCP, one for carrier pigeon, etc.), but that's outside the scope of this issue.
Comment #50
claudiu.cristeaHm,
file_exists()
doesn't work because HTTP & HTTPS does not supportstat()
(see http://php.net/manual/en/wrappers.http.php).Reuploading the patch from #36 and setting back to RTBC as @alexpott's question was answered in #43. Quoting:
Sorry for creating confusion.
Comment #51
alexpottAll I meant is can we remove the return? Because after calling this the file will exist right?
Comment #52
phenaproximaI don't think we should remove the return, because the download plugin might not ultimately decide to save the downloaded data to the same place that FileCopy would have. At the moment it does, but FileCopy should not be making assumptions about Download's inner workings. Once it delegates the transformation to the download plugin, that should be where it ends for FileCopy.
Comment #53
alexpottCommitted and pushed 17c15fb to 8.3.x and 1598399 to 8.2.x. Thanks!
As migrate is still experimental committing to both branches.
Comment #56
neclimdulFileTestBase is not a unit test so this is not a unit test. phpunit --testsuite=unit is broken as it requires a database after this commit.
Comment #57
claudiu.cristea@neclimdul, that's right
Comment #58
claudiu.cristeaComment #59
phenaproximaPreemptively re-RTBC on the assumption that Drupal CI will pass it.
Comment #60
neclimdul+1. Manually tested and all is well. Thanks for the quick follow up.
Comment #61
alexpottCommitted and pushed 9e822c0 to 8.3.x and b22f95a to 8.2.x. Thanks!
Opening a new issue would have been better tbh - this just pollutes the history because now there are two commits with the same title :(