I have created a file migration class that appears to work, except that most of the destination files are corrupted and many of the corrupted ones are renamed with "small-" prepended on the filenames. A few make it through unscathed.

The corrupted ones, whether images or pdfs, are turned into text files that contain 5,012 bytes:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">

    <title>Page not found | www.spjdc.org</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="/sites/spjdc.org/files/spj_garland_favicon.ico" type="image/x-icon" />
    <link type="text/css" rel="stylesheet" media="all" href="/sites/default/files/css/css_5e40096dc67e4dcc3537190ff6616236.css" />
<link type="text/css" rel="stylesheet" media="print" href="/sites/default/files/css/css_c9488a49c43be6c2ed81036bc7d3cd86.css" />
    <script type="text/javascript" src="/sites/all/modules/jquery_update/replace/jquery/1.3/jquery.min.js?G"></script>
<script type="text/javascript" src="/misc/drupal.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/google_analytics/googleanalytics.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/img_assist/img_assist.js?G"></script>
<script type="text/javascript" src="/sites/default/files/jstimer/timer.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/nice_menus/superfish/js/superfish.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/nice_menus/superfish/js/jquery.bgiframe.min.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/nice_menus/superfish/js/jquery.hoverIntent.minified.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/nice_menus/nice_menus.js?G"></script>
<script type="text/javascript" src="/sites/all/modules/og/og.js?G"></script>
<script type="text/javascript">
<!--//--><![CDATA[//><!--
jQuery.extend(Drupal.settings, { "basePath": "/", "googleanalytics": { "trackOutbound": 1, "trackMailto": 1, "trackDownload": 1, "trackDownloadExtensions": "7z|aac|arc|arj|asf|asx|avi|bin|csv|doc|exe|flv|gif|gz|gzip|hqx|jar|jpe?g|js|mp(2|3|4|e?g)|mov(ie)?|msi|msp|pdf|phps|png|ppt|qtm?|ra(m|r)?|sea|sit|tar|tgz|torrent|txt|wav|wma|wmv|wpd|xls|xml|z|zip" }, "nice_menus_options": { "delay": 800, "speed": 1 } });
//--><!]]>
</script>
    <!--[if lt IE 7]>
      <link type="text/css" rel="stylesheet" media="all" href="/sites/all/themes/spj_garland/fix-ie.css" />    <![endif]-->

    <script type="text/javascript">

      var _gaq = _gaq || [];
      _gaq.push(['_setAccount', 'UA-16563412-1']);
      _gaq.push(['_trackPageview']);

      (function() {
        var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
        ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
        var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
      })();

    </script>

  </head>
  <body><script type="text/javascript">
//<![CDATA[
new Image().src = "/cdn-cgi/ping?cf[location]=404&cf[js]=1";
//]]>
</script>
<noscript>
<img src="/cdn-cgi/ping?cf[location]=404&amp;cf[js]=0" alt=""/>
</noscript>

<!-- Layout -->
  <div id="header-region" class="clear-block"></div>

    <div id="wrapper">
    <div id="container" class="clear-block">

      <div id="header">
        <div id="logo-floater">
        <h1><a href="/" title=""><img src="/sites/spjdc.org/files/spj_garland_logo.png" alt="" id="logo" /></a></h1>        </div>

                          
      </div> <!-- /header -->

      
      <div id="center"><div id="squeeze"><div class="right-corner"><div class="left-corner">
                                        <h2>Page not found</h2>                                                  <div class="clear-block">
            The requested page could not be found.          </div>
              <div id="footer"><div id="block-block-2" class="clear-block block block-block">


  <div class="content"><!--paging_filter--><div>
	<hr />
	<center>
		<strong>&copy;<a href="http://www.spjdc.org/copyright"> 2008 - 2013, D.C. Professional Chapter, S.P.J. All rights reserved.</a></strong></center>
</div>
</div>
</div>
</div>
                   

      </div></div></div></div> <!-- /.left-corner, /.right-corner, /#squeeze, /#center -->

      
    </div> <!-- /container -->

  </div>
<!-- /layout -->

  <script type="text/javascript">
<!--//--><![CDATA[//><!--
var _gaq = _gaq || [];_gaq.push(["_setAccount", "UA-16563412-1"]);_gaq.push(['_setCustomVar', 1, "User roles", "anonymous user", 1]);_gaq.push(["_trackPageview", "/404.html?page=" + document.location.pathname + document.location.search + "&from=" + document.referrer]);(function() {var ga = document.createElement("script");ga.type = "text/javascript";ga.async = true;ga.src = ("https:" == document.location.protocol ? "https://ssl" : "http://www") + ".google-analytics.com/ga.js";var s = document.getElementsByTagName("script")[0];s.parentNode.insertBefore(ga, s);})();
//--><!]]>
</script>
  </body>
</html>

.
This obviously is being pulled from my page.tpl.php and style.css files. One possible cause is that the only way I have been able to make the migration work is to set the source_dir to "http://www.spjdc.org/sites/default/files." Any attempt to use 'sites/default/files' or an absolute path ends with now files being copied and error messages telling me that happened.

I have turned off all media-related modules, ImageMagick, etc., but nothing fixes this problem. My only apparent solution is just to copy the filesystem into the new site and run the migration set as FILE EXISTS REUSE.

Has anyone else run across this problem?

Comments

simon.westyn’s picture

I mostly use the file_unmanaged_save_data() function first to transfer the files to the local machine

$file = @file_get_contents('http://url/pictures/' . $filename . '.' . $extension);
$filedump = file_unmanaged_save_data($file, $dest, FILE_EXISTS_REPLACE);
$filepath = drupal_realpath($filedump);
$current_row->FILE = $filepath;
$current_row->FILEDEST = $filedump;

and then pass the $filepath to the destination file field, which will save the file in the folder you've entered in the field properties. Then afterward, in the complete() function i'll delete the temp file.

function complete($entity, stdClass $row) {
        if (isset($row->FILEDEST)) {
            unlink($row->FILEDEST);
        }
}

If you're having problems with specific extensions, maybe you could check your destination field properties for allowed extensions?
Ow, and one more thing, a while ago I had a problem with big binary files from the source DB getting corrupted. Then I found out that I needed this rule of code to fix it:
ini_set("odbc.defaultlrl", "20480K");
Note that this was using a ODBC connection, not MySQL.

Hope this helps and good luck!

mikeryan’s picture

Category: bug » support
Status: Active » Postponed (maintainer needs more info)

Buried in the middle you'll see "Page not found". So, it appears the URLs you're using don't resolve to actual files, and the server is returning this HTML rather than a 404 HTTP code - the PHP copy() function happily just copies the result to a local file. You need to check the URLs that are being generated and figure out what the right pattern is for your files - to see exactly what URLs Migrate is putting together based on the options to your migration you can instrument MigrateFileUri::copyFile() (looking at $this->sourcePath).

rsbecker’s picture

When I use a relative or absolute server path it is clear that the files are found and migrate is attempting to copy them to the right places. But I get an error message saying the file could not be copied and no record is created in the field's table. When I use the domain, i.e. http://www.example.com, the record is created, but I get te mangled files.

The only way I have been able to make it work is by copying files and then using FILE_EXISTS_REUSE and preserve_files TRUE.

mikeryan’s picture

Don't forget to reset the status to "active" when you reply - anything in the "postponed (maintainer needs more info)" status is at the bottom of the priority list, because presumably no more info has been provided.

The file is not "mangled" - it is an HTML page saying "Page not found". The question is, why is the source site returning that page to Migrate instead of the desired file? The first thing to do is to print $this->sourcePath in MigrateFileUri::copyFile() and see exactly what URL Migrate is trying to use. If it looks like it should work, try it directly in a browser and see what you get.

I'm now reminded that I did have a mysterious circumstance like this where the URLs were clearly valid and worked fine in the browser, but failed in Migrate. It turned out there was a .htaccess rule to prevent just such web scraping, it'd be worth checking for that as well.

mikeryan’s picture

Status: Postponed (maintainer needs more info) » Closed (cannot reproduce)

No further response.

mikeryan’s picture

Issue summary: View changes

Added info