Linux, MySQL, php

Importing Huge CSV File to MySQL via PHP

Yes there are several techniques to import CSV to MySQL. I have encountered this task several times already the those previous experiences worked out seamlessly, but this one is a bit complicated and requires a lot of effort in order to accomplish.

Available Solutions

phpMyAdmin

Yes there is phpMyAdmin. It has a tool that allows me to import CSV online but the problem is that it is limited only to 2MB files. I can modify it to increase a little bit but I am going to import millions of records which is up to 400MB file size. It could go wrong with maximum execution limit by PHP.

MySQL LOAD DATA LOCAL INFILE

Another option will be uploading the CSV file and use the SSH to execute MySQL commands to import that file. That should be the safest way to do it but the problem is that filenames and table names are in Japanese characters. Take for example this table name: T005_棚番M. If I am going to use that command, I will simply type in the terminal:

LOAD DATA LOCAL INFILE '/dbo_T005_棚番M.txt' 
INTO TABLE T005_棚番M
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
(field1, filed2, field3);

But that would be very difficult. I don’t know how, but I can’t type those characters on the terminal and simple copy and paste won’t work. Another thing is that the file encoding is Shift_JIS but the database is UTF-8 (since this application was a migration from ASP+MSSQL to PHP+MySQL. Imagine how complicated it could be for my more than 20 tables.

PHP’s fgetcsv

I could create a script the parse the CSV file and insert them to the database using fgetcsv. That would be the nearest solution. However, I run into several problems. First – the file’s internal encoding is Shift_JIS but the database is UTF-8, so I need to convert the records line by line before inserting them. The problem is that fgetcsv cannot seem to properly parse the file in Shift_JIS. It does not separate fields correctly.

The Solution

Therefore, I have decided for a solution. I will create my own script combining all those resources I’ve found on the internet. First of all, I’m going to rename my files since it seems that PHP can’t recognize filenames in Japanese characters.

Here is what it looks like in the declaration in PHP.

	$fileNames = array(
					'dbo_D99_CMD.txt'					=> 'D99_CMD',
					'dbo_D99_Tmp.txt'					=> 'D99_Tmp',
					'dbo_dtproperties.txt'				=> 'dtproperties',
					'dbo_sysconstraints.txt'			=> 'sysconstraints',
					'dbo_syssegments.txt'				=> 'syssegments',
					'dbo_T001_1001.txt'					=> 'T001_コミックM',
					'dbo_T002_1002.txt'					=> 'T002_メーカーG',
					'dbo_T004_1003.txt'					=> 'T004_ジャンルM',
					'dbo_T005_1004.txt'					=> 'T005_棚番M',
					'dbo_T006_1005.txt'					=> 'T006_アーティスト',
					'dbo_T007_1006.txt'					=> 'T007_古本発注在庫M',
					'dbo_T008_1007.txt'					=> 'T008_旧新刊発注在庫M',
					'dbo_T009_1008.txt'					=> 'T009_棚番位置M',
					'dbo_T104_1009.txt'					=> 'T104_在庫M',
					'dbo_T201_1010.txt'					=> 'T201_店舗M',
					'dbo_T202_1011.txt'					=> 'T202_卸業M',
					'dbo_T203_1012.txt'					=> 'T203_個人M',
					'dbo_T900_1013.txt'					=> 'T900_発注ログ',
					'dbo_T901_1014.txt'					=> 'T901_セットM',
					'dbo_TA001_1015.txt'				=> 'TA001_コミックG',
					'dbo_TY001_1016.txt'				=> 'TY001_新刊予約M',
					'dbo_TY002_1017.txt'				=> 'TY002_新刊予約在庫M',
					'dbo_X_4672.txt'					=> 'X_4672',
					'dbo_Z001_1018.txt'					=> 'Z001_メールM',
					'dbo_Z999_1019.txt'					=> 'Z999_定数M'
					);

The keys are the table names and the values are the filenames. The next step would be opening the file (in READ mode) to get the file handle.

$root = dirname(__FILE__);
$file = fopen($root . '/data20090331/' . $fileName, 'r');

If I need to clear the existing content, I will simply execute this:

$del = $db->sql_query("delete from `$tableName`");

The task is not that simple, but to make it clear, I will put first the basics. I am going to read the file line by line, then each line parse it to separate each field. But before parsing it, convert it first to UTF-8.

$buffer = fgets($file, 4096);
$buffer = trim(mb_convert_encoding($buffer, 'utf-8', 'sjis-win'), "\n");
$data = parseCsvLine($buffer);

And the code for parseCSVLine looks like this:

	function parseCsvLine($line)
	{	
		// splits lines with a comma but ignores a comma wrapped with double quotes
		$csv = preg_split('/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/', $line);
		$ret = array();
		
		foreach ($csv as $c)
		{
			if (preg_match('/\"/', $c))
			{
				$ret[] = trim($c, '"');
			}
			else
			{
				$ret[] = $c;
			}
		}
		
		return $ret;
	}

It splits the line into an array. Double quotes are removed on string values. The next step would be inserting the row into the database like this:

		$csv = array();
		foreach ($data as $d)
		{
			$csv[] = mysql_real_escape_string($d);
		}
		$sql = "INSERT INTO `$tableName` VALUES('" . implode("','", $csv) . "')";
		//echo htmlspecialchars($sql), '<br /><br /><br />';
		$result = $db->sql_query($sql);
		if (!$result)
		{
			echo htmlspecialchars($sql), '<br /><br /><br />';
			print_r($db->sql_error());
			exit;
		}

Of course I need to escape first the data before inserting them. But as I’ve said, it is not that simple. There are several limitations the I need to consider.

Limitations

Some files are small, some are too large

Some files are small enough that the script above is sufficient. But there are also those very large files from 33MB to 400MB which are subject to limitations such as PHP maximum execution time and maximum file uploads (which I can’t modify on the host). Therefore I need to create few batches where I group all those small files on the first batch and the rest as individual imports.

Inserting row line by line is very slow on large files

Inserting a row line by line is indeed very slow on MySQL. The simple solution is to accumulate those rows like 600 rows and insert it using a single INSERT command. It looks like these:

INSERT INTO $tableName VALUES
(val1, val2, val3),
(val1, val2, val3),
...
up to 600 rows
...
(val1, val2, val3);

Some CSV files (large or small) are problematic

Some CSV files are problematic. There are files where some fields are slightly corrupted (although it is easy to correct manually). If I’m on the middle of the import process (actually I’ve done this) then suddenly the script terminates because one of the column is corrupted, I have to edit the file and re-upload again. Assuming I have done this on a 50MB CSV file, can you imagine how hard to find the very exact line of that corrupt column? After correcting it, I have to re-upload it again which usually takes 5 to 10 minutes because of connection speed.

Why not edit online?

No, that’s impossible on SSH or Linux terminal. All those Japanese characters are garbled on the terminal so there is no way.

One of the CSV is really too large

One of the CSV is 400MB+ therefore it cannot be imported on a single process. Therefore I need to split them into smaller parts then import one by one per batch.

This is my full dirty code:

<?php

	ini_set('display_errors', 'On');
	ini_set('max_execution_time', '1200');
	ini_set('max_input_time', '1200');
	define('IN_PAGE', 1);
	header ('Content-type: text/html; charset=utf-8');

	require('include/common2.php');
	
	$fileNames = array(
					'dbo_D99_CMD.txt'					=> 'D99_CMD',
					'dbo_D99_Tmp.txt'					=> 'D99_Tmp',
					'dbo_dtproperties.txt'				=> 'dtproperties',
					'dbo_sysconstraints.txt'			=> 'sysconstraints',
					'dbo_syssegments.txt'				=> 'syssegments',
					'dbo_T001_1001.txt'					=> 'T001_コミックM',
					'dbo_T002_1002.txt'					=> 'T002_メーカーG',
					'dbo_T004_1003.txt'					=> 'T004_ジャンルM',
					'dbo_T005_1004.txt'					=> 'T005_棚番M',
					'dbo_T006_1005.txt'					=> 'T006_アーティスト',
					'dbo_T007_1006.txt'					=> 'T007_古本発注在庫M',
					'dbo_T008_1007.txt'					=> 'T008_旧新刊発注在庫M',
					'dbo_T009_1008.txt'					=> 'T009_棚番位置M',
					'dbo_T104_1009.txt'					=> 'T104_在庫M',
					'dbo_T201_1010.txt'					=> 'T201_店舗M',
					'dbo_T202_1011.txt'					=> 'T202_卸業M',
					'dbo_T203_1012.txt'					=> 'T203_個人M',
					'dbo_T900_1013.txt'					=> 'T900_発注ログ',
					'dbo_T901_1014.txt'					=> 'T901_セットM',
					'dbo_TA001_1015.txt'				=> 'TA001_コミックG',
					'dbo_TY001_1016.txt'				=> 'TY001_新刊予約M',
					'dbo_TY002_1017.txt'				=> 'TY002_新刊予約在庫M',
					'dbo_X_4672.txt'					=> 'X_4672',
					'dbo_Z001_1018.txt'					=> 'Z001_メールM',
					'dbo_Z999_1019.txt'					=> 'Z999_定数M'
					);
					
	$batch1 = array(
					'dbo_D99_CMD.txt'					=> 'D99_CMD',
					'dbo_D99_Tmp.txt'					=> 'D99_Tmp',
					'dbo_dtproperties.txt'				=> 'dtproperties',
					'dbo_sysconstraints.txt'			=> 'sysconstraints',
					'dbo_syssegments.txt'				=> 'syssegments',
					'dbo_T002_1002.txt'					=> 'T002_メーカーG',
					'dbo_T004_1003.txt'					=> 'T004_ジャンルM',
					'dbo_T006_1005.txt'					=> 'T006_アーティスト',
					'dbo_T007_1006.txt'					=> 'T007_古本発注在庫M',
					'dbo_T008_1007.txt'					=> 'T008_旧新刊発注在庫M',
					'dbo_T009_1008.txt'					=> 'T009_棚番位置M',
					'dbo_T201_1010.txt'					=> 'T201_店舗M',
					'dbo_T202_1011.txt'					=> 'T202_卸業M',
					'dbo_T203_1012.txt'					=> 'T203_個人M',
					'dbo_T901_1014.txt'					=> 'T901_セットM',
					'dbo_TY001_1016.txt'				=> 'TY001_新刊予約M',
					'dbo_X_4672.txt'					=> 'X_4672',
					'dbo_Z999_1019.txt'					=> 'Z999_定数M'
					);
					
	$batch2 = array(
					'dbo_T001_1001.txt'					=> 'T001_コミックM'
					);
					
	$batch3 = array(
					'dbo_T005_1004.txt'					=> 'T005_棚番M'
					);
					
	$batch4 = array(
					'dbo_T104_1009.txt'					=> 'T104_在庫M'
					);
					
	$batch5 = array(
					'dbo_TY002_1017.txt'				=> 'TY002_新刊予約在庫M'
					);
	
	$noTable = array(
					'dbo_T900_1013.txt'					=> 'T900_発注ログ',
					'dbo_TA001_1015.txt'				=> 'TA001_コミックG'
					);
					
	$harmfullContents = array(
					'dbo_Z001_1018.txt'					=> 'Z001_メールM'
					);
					
	echo '<pre>';
	print_r($fileNames);
	echo '</pre>';
	
	echo '<h1>Importing CSV Files to Manken</h1>';
	
	//loop all those files
	
	$root = dirname(__FILE__);
	
	foreach ($batch1 as $f => $v)
	{
		saveCsv($f, $v);
	}
	
	function saveCsv($fileName, $tableName)
	{
		global $db, $root;
		
		$file = fopen($root . '/data20090331/' . $fileName, 'r');
		if ($file)
		{
			$del = 1; //$db->sql_query("delete from `$tableName`");
			if ($del)
			{
				echo 'Inserting records into ', $tableName, ' from ', $fileName, '<br />';
				
				//read file line by line
				//insert 20 rows maximum in one query to optimize performance
				$sqlBuffer = '';
				$sqlCounter = 0;
				while (!feof($file))
				{
					$buffer = fgets($file, 4096);
					
					if ($buffer)
					{
						$buffer = trim(mb_convert_encoding($buffer, 'utf-8', 'sjis-win'), "\n");
						$data = parseCsvLine($buffer);
						
						$csv = array();
						foreach ($data as $d)
						{
							$csv[] = mysql_real_escape_string($d);
						}
						if ($sqlCounter < 600)
						{
							//add the current sql to buffer
							if ($sqlBuffer == '')
							{
								//don't add comma in front, this is the first of val in the buffer
								$sqlBuffer .= "VALUES('" . implode("','", $csv) . "')";
							}
							else
							{
								$sqlBuffer .= ", ('" . implode("','", $csv) . "')";
							}
							$sqlCounter++;
						}
						else
						{
							$sqlBuffer .= ", ('" . implode("','", $csv) . "')";
							//commit the buffer to the database
							commitBuffer($sqlBuffer, $tableName);
							//clear and reset buffer
							$sqlBuffer = '';
							$sqlCounter = 0;
						}

					} //end if buffer
				} //end while
				if ($sqlCounter > 0 && $sqlBuffer != '')
				{
					//if buffer is not empty, then we have uncommited rows, commit it now
					commitBuffer($sqlBuffer, $tableName);
					//clear and reset buffer
					$sqlBuffer = '';
					$sqlCounter = 0;
				}
			}
			else
			{
				print_r($db->sql_error());
				exit;
			}
			
		}
		fclose($file);
		
	}
	
	function parseCsvLine($line)
	{	
		// splits lines with a comma but ignores a comma wrapped with double quotes
		$csv = preg_split('/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/', $line);
		$ret = array();
		
		foreach ($csv as $c)
		{
			if (preg_match('/\"/', $c))
			{
				$ret[] = trim($c, '"');
			}
			else
			{
				$ret[] = $c;
			}
		}
		
		return $ret;
	}
	
	function commitBuffer($sqlBuffer, $tableName)
	{
		global $db;
		//commit the buffer to the database
		$sql = "INSERT INTO `$tableName` " . $sqlBuffer;
		//echo htmlspecialchars($sql), '<br /><br /><br />';
		$result = $db->sql_query($sql);
		if (!$result)
		{
			echo htmlspecialchars($sql), '<br /><br /><br />';
			print_r($db->sql_error());
			exit;
		}
		return true;
	}
	
	
	echo 'And now its done!';

No it’s not done yet! When the actual import process takes place, I need to change this line per batch.

foreach ($batch1 as $f => $v)

I need to change the batch number every time I import. Batch 4 is the largest file so I need to split them into smaller files. First, I tried it on Windows Wordpad, but it seems an endless task so I search the internet and found this one (splitting files):

http://www.computerhope.com/unix/usplit.htm

On my tests, it seems that the process can handle only up to 200K rows per batch so I split my largest file like this:

split -l 200000 dbo_T104_1009.txt dbo_T104_1009_part

That would split the file into several pieces each containing 200k lines and their filenames starts with dbo_T104_1009_part. Of course I need to change the filenames to make it work on my script. When say I need to import the part2 of the file, I will modify this line.

$del = $db->sql_query("delete from `$tableName`");
//into
$del = 1; //$db->sql_query("delete from `$tableName`");

So that it will not delete my previous imports!

A very hard work indeed! Please note that the ini_set command does not work on my host since it seems to be disabled so I’m limited to 30 second execution.

3 thoughts on “Importing Huge CSV File to MySQL via PHP”

  1. Hi ! I need to create my file uploading for CSV in PHP. Can you help me with that ?

Leave a reply

Your email address will not be published. Required fields are marked *