Saturday, January 27, 2018

Use Sqoop to perform incremental data loading (--incremental=lastmodified for timestamp column)

[donghua@cdh-vm ~]$ hdfs dfs -cat /user/donghua/t2/part-m-00000
1,2018-01-27 04:50:07.0
2,2018-01-27 04:50:18.0
[donghua@cdh-vm ~]$ 

MariaDB [employees]> insert into t2 values(3,current_timestamp());
Query OK, 1 row affected (0.01 sec)

MariaDB [employees]> update t2 set last_updated_at=current_timestamp() where id=2;
Query OK, 1 row affected (0.07 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [employees]> select * from t2;
+----+---------------------+
| id | last_updated_at     |
+----+---------------------+
|  1 | 2018-01-27 04:50:07 |
|  2 | 2018-01-27 05:10:14 |
|  3 | 2018-01-27 05:09:45 |
+----+---------------------+
3 rows in set (0.00 sec)

MariaDB [employees]> 


[donghua@cdh-vm ~]$ sqoop import --connect jdbc:mysql://cdh-vm.dbaglobe.com/employees  --username employee_user --password password --table t2  -m 1 --target-dir=/user/donghua/t2 --check-column=last_updated_at --incremental=lastmodified --last-value='2018-01-27 05:06:03.0'
Warning: /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/01/27 05:11:59 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.1
18/01/27 05:11:59 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/01/27 05:12:00 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/01/27 05:12:00 INFO tool.CodeGenTool: Beginning code generation
18/01/27 05:12:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t2` AS t LIMIT 1
18/01/27 05:12:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t2` AS t LIMIT 1
18/01/27 05:12:00 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-donghua/compile/631be22fe0124698ede97beba0c8288e/t2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/01/27 05:12:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-donghua/compile/631be22fe0124698ede97beba0c8288e/t2.jar
18/01/27 05:12:02 ERROR tool.ImportTool: Import failed: --merge-key or --append is required when using --incremental lastmodified and the output directory exists.

[donghua@cdh-vm ~]$ sqoop import --connect jdbc:mysql://cdh-vm.dbaglobe.com/employees  --username employee_user --password password --table t2  -m 1 --target-dir=/user/donghua/t2 --check-column=last_updated_at --incremental=lastmodified --last-value='2018-01-27 05:06:03.0' --merge-key=id
Warning: /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
18/01/27 05:12:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.13.1
18/01/27 05:12:40 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
18/01/27 05:12:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
18/01/27 05:12:40 INFO tool.CodeGenTool: Beginning code generation
18/01/27 05:12:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t2` AS t LIMIT 1
18/01/27 05:12:41 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t2` AS t LIMIT 1
18/01/27 05:12:41 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
Note: /tmp/sqoop-donghua/compile/2e4f90897221b505b822c323c3cb2b41/t2.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
18/01/27 05:12:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-donghua/compile/2e4f90897221b505b822c323c3cb2b41/t2.jar
18/01/27 05:12:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `t2` AS t LIMIT 1
18/01/27 05:12:43 INFO tool.ImportTool: Incremental import based on column `last_updated_at`
18/01/27 05:12:43 INFO tool.ImportTool: Lower bound value: '2018-01-27 05:06:03.0'
18/01/27 05:12:43 INFO tool.ImportTool: Upper bound value: '2018-01-27 05:12:43.0'
18/01/27 05:12:43 WARN manager.MySQLManager: It looks like you are importing from mysql.
18/01/27 05:12:43 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
18/01/27 05:12:43 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
18/01/27 05:12:43 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
18/01/27 05:12:43 INFO mapreduce.ImportJobBase: Beginning import of t2
18/01/27 05:12:43 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
18/01/27 05:12:43 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
18/01/27 05:12:43 INFO client.RMProxy: Connecting to ResourceManager at cdh-vm.dbaglobe.com/192.168.56.10:8032
18/01/27 05:12:47 INFO db.DBInputFormat: Using read commited transaction isolation
18/01/27 05:12:47 INFO mapreduce.JobSubmitter: number of splits:1
18/01/27 05:12:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517023991003_0016
18/01/27 05:12:48 INFO impl.YarnClientImpl: Submitted application application_1517023991003_0016
18/01/27 05:12:48 INFO mapreduce.Job: The url to track the job: http://cdh-vm.dbaglobe.com:8088/proxy/application_1517023991003_0016/
18/01/27 05:12:48 INFO mapreduce.Job: Running job: job_1517023991003_0016
18/01/27 05:12:55 INFO mapreduce.Job: Job job_1517023991003_0016 running in uber mode : false
18/01/27 05:12:55 INFO mapreduce.Job:  map 0% reduce 0%
18/01/27 05:13:01 INFO mapreduce.Job:  map 100% reduce 0%
18/01/27 05:13:02 INFO mapreduce.Job: Job job_1517023991003_0016 completed successfully
18/01/27 05:13:02 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=175177
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=48
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=4073
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4073
Total vcore-milliseconds taken by all map tasks=4073
Total megabyte-milliseconds taken by all map tasks=6256128
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=61
CPU time spent (ms)=920
Physical memory (bytes) snapshot=196935680
Virtual memory (bytes) snapshot=2785828864
Total committed heap usage (bytes)=155713536
File Input Format Counters 
Bytes Read=0
File Output Format Counters 
Bytes Written=48
18/01/27 05:13:02 INFO mapreduce.ImportJobBase: Transferred 48 bytes in 19.1241 seconds (2.5099 bytes/sec)
18/01/27 05:13:02 INFO mapreduce.ImportJobBase: Retrieved 2 records.
18/01/27 05:13:02 INFO tool.ImportTool: Final destination exists, will run merge job.
18/01/27 05:13:02 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
18/01/27 05:13:02 INFO client.RMProxy: Connecting to ResourceManager at cdh-vm.dbaglobe.com/192.168.56.10:8032
18/01/27 05:13:07 INFO input.FileInputFormat: Total input paths to process : 2
18/01/27 05:13:07 INFO mapreduce.JobSubmitter: number of splits:2
18/01/27 05:13:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1517023991003_0017
18/01/27 05:13:08 INFO impl.YarnClientImpl: Submitted application application_1517023991003_0017
18/01/27 05:13:08 INFO mapreduce.Job: The url to track the job: http://cdh-vm.dbaglobe.com:8088/proxy/application_1517023991003_0017/
18/01/27 05:13:08 INFO mapreduce.Job: Running job: job_1517023991003_0017
18/01/27 05:13:15 INFO mapreduce.Job: Job job_1517023991003_0017 running in uber mode : false
18/01/27 05:13:15 INFO mapreduce.Job:  map 0% reduce 0%
18/01/27 05:13:20 INFO mapreduce.Job:  map 50% reduce 0%
18/01/27 05:13:24 INFO mapreduce.Job:  map 100% reduce 0%
18/01/27 05:13:31 INFO mapreduce.Job:  map 100% reduce 100%
18/01/27 05:13:31 INFO mapreduce.Job: Job job_1517023991003_0017 completed successfully
18/01/27 05:13:31 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=90
FILE: Number of bytes written=526653
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=386
HDFS: Number of bytes written=72
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=6496
Total time spent by all reduces in occupied slots (ms)=3317
Total time spent by all map tasks (ms)=6496
Total time spent by all reduce tasks (ms)=3317
Total vcore-milliseconds taken by all map tasks=6496
Total vcore-milliseconds taken by all reduce tasks=3317
Total megabyte-milliseconds taken by all map tasks=9977856
Total megabyte-milliseconds taken by all reduce tasks=5094912
Map-Reduce Framework
Map input records=4
Map output records=4
Map output bytes=96
Map output materialized bytes=122
Input split bytes=290
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=122
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=211
CPU time spent (ms)=1900
Physical memory (bytes) snapshot=1147371520
Virtual memory (bytes) snapshot=8375828480
Total committed heap usage (bytes)=1154482176
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=96
File Output Format Counters 
Bytes Written=72
18/01/27 05:13:31 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
18/01/27 05:13:31 INFO tool.ImportTool:  --incremental lastmodified
18/01/27 05:13:31 INFO tool.ImportTool:   --check-column last_updated_at
18/01/27 05:13:31 INFO tool.ImportTool:   --last-value 2018-01-27 05:12:43.0
18/01/27 05:13:31 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
[donghua@cdh-vm ~]$ hdfs dfs -ls /user/donghua/t2/
Found 2 items
-rw-r--r--   1 donghua supergroup          0 2018-01-27 05:13 /user/donghua/t2/_SUCCESS
-rw-r--r--   1 donghua supergroup         72 2018-01-27 05:13 /user/donghua/t2/part-r-00000
[donghua@cdh-vm ~]$ hdfs dfs -cat /user/donghua/t2/part-r-00000
1,2018-01-27 04:50:07.0
2,2018-01-27 05:10:14.0
3,2018-01-27 05:09:45.0
[donghua@cdh-vm ~]$