知识点
相关文章
更多最近更新
更多使用Sqoop在HDFS和RDBMS之间导数据
2019-03-28 14:21|来源: 网络
SQOOP是一款开源的工具,主要用于在Hadoop与传统的数据库间进行数据的传递,下面从SQOOP用户手册上摘录一段描述
Sqoopis a tool designed to transfer data between Hadoop and relational databases.You can use Sqoop to import data from a relational database management system(RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System(HDFS),transform the data in Hadoop MapReduce, and then export the data backinto an RDBMS.
SQOOP是Cloudera公司开源的一款在HDFS以及数据库之间传输数据的软件。内部通过JDBC连接HADOOP以及数据库,因此从理论上来讲,只要是支持JDBC的数据库,SQOOP都可以兼容。并且,SQOOP不仅能把数据以文件的形式导入到HDFS上,还可以直接导入数据到HBASE或者HIVE中。
下面是一些性能测试数据,仅供参考:
表名:tb_keywords
行数:11628209
数据文件大小:1.4G
|
HDFS –> DB |
DB -> HDFS |
SQOOP |
428s |
166s |
HDFS<->FILE<->DB |
209s |
105s |
从结果上来看,以FILE作为中转方式性能是要高于SQOOP的。原因如下:
1、 本质上SQOOP使用的是JDBC,效率不会比MYSQL自带的到导入\导出工具效率高
2、 以导入数据到DB为例,SQOOP的设计思想是分阶段提交,也就是说假设一个表有1K行,那么它会先读出100行(默认值),然后插入,提交,再读取100行……如此往复
即便如此,SQOOP也是有优势的,比如说使用的便利性,任务执行的容错性等。在一些测试环境中如果需要的话可以考虑把它拿来作为一个工具使用。
下面是一些操作记录
[wanghai01@tc-crm-rd01.tc.baidu.com bin]$ sh export.sh
Fri Sep 23 20:15:47 CST 2011
11/09/23 20:15:48 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
11/09/23 20:15:48 INFO tool.CodeGenTool: Beginning code generation
11/09/23 20:15:48 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:48 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:48 INFO orm.CompilationManager: HADOOP_HOME is /home/wanghai01/hadoop/hadoop-0.20.2/bin/..
11/09/23 20:15:48 INFO orm.CompilationManager: Found hadoop core jar at: /home/wanghai01/hadoop/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
11/09/23 20:15:49 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-wanghai01/compile/eb16aae87a119b93acb3bc6ea74b5e97/tb_keyword_data_201104.java to /home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/./tb_keyword_data_201104.java
11/09/23 20:15:49 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wanghai01/compile/eb16aae87a119b93acb3bc6ea74b5e97/tb_keyword_data_201104.jar
11/09/23 20:15:49 INFO mapreduce.ExportJobBase: Beginning export of tb_keyword_data_201104
11/09/23 20:15:49 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:15:49 INFO input.FileInputFormat: Total input paths to process : 1
11/09/23 20:15:49 INFO input.FileInputFormat: Total input paths to process : 1
11/09/23 20:15:49 INFO mapred.JobClient: Running job: job_201109211521_0012
11/09/23 20:15:50 INFO mapred.JobClient: map 0% reduce 0%
11/09/23 20:16:04 INFO mapred.JobClient: map 1% reduce 0%
11/09/23 20:16:10 INFO mapred.JobClient: map 2% reduce 0%
11/09/23 20:16:13 INFO mapred.JobClient: map 3% reduce 0%
11/09/23 20:16:19 INFO mapred.JobClient: map 4% reduce 0%
11/09/23 20:16:22 INFO mapred.JobClient: map 5% reduce 0%
11/09/23 20:16:25 INFO mapred.JobClient: map 6% reduce 0%
11/09/23 20:16:31 INFO mapred.JobClient: map 7% reduce 0%
11/09/23 20:16:34 INFO mapred.JobClient: map 8% reduce 0%
11/09/23 20:16:41 INFO mapred.JobClient: map 9% reduce 0%
11/09/23 20:16:44 INFO mapred.JobClient: map 10% reduce 0%
11/09/23 20:16:50 INFO mapred.JobClient: map 11% reduce 0%
11/09/23 20:16:53 INFO mapred.JobClient: map 12% reduce 0%
11/09/23 20:16:56 INFO mapred.JobClient: map 13% reduce 0%
11/09/23 20:17:02 INFO mapred.JobClient: map 14% reduce 0%
11/09/23 20:17:05 INFO mapred.JobClient: map 15% reduce 0%
11/09/23 20:17:11 INFO mapred.JobClient: map 16% reduce 0%
11/09/23 20:17:14 INFO mapred.JobClient: map 17% reduce 0%
11/09/23 20:17:17 INFO mapred.JobClient: map 18% reduce 0%
11/09/23 20:17:23 INFO mapred.JobClient: map 19% reduce 0%
11/09/23 20:17:25 INFO mapred.JobClient: map 20% reduce 0%
11/09/23 20:17:28 INFO mapred.JobClient: map 21% reduce 0%
11/09/23 20:17:34 INFO mapred.JobClient: map 22% reduce 0%
11/09/23 20:17:37 INFO mapred.JobClient: map 23% reduce 0%
11/09/23 20:17:43 INFO mapred.JobClient: map 24% reduce 0%
11/09/23 20:17:46 INFO mapred.JobClient: map 25% reduce 0%
11/09/23 20:17:49 INFO mapred.JobClient: map 26% reduce 0%
11/09/23 20:17:55 INFO mapred.JobClient: map 27% reduce 0%
11/09/23 20:17:58 INFO mapred.JobClient: map 28% reduce 0%
11/09/23 20:18:04 INFO mapred.JobClient: map 29% reduce 0%
11/09/23 20:18:07 INFO mapred.JobClient: map 30% reduce 0%
11/09/23 20:18:10 INFO mapred.JobClient: map 31% reduce 0%
11/09/23 20:18:16 INFO mapred.JobClient: map 32% reduce 0%
11/09/23 20:18:19 INFO mapred.JobClient: map 33% reduce 0%
11/09/23 20:18:25 INFO mapred.JobClient: map 34% reduce 0%
11/09/23 20:18:28 INFO mapred.JobClient: map 35% reduce 0%
11/09/23 20:18:31 INFO mapred.JobClient: map 36% reduce 0%
11/09/23 20:18:37 INFO mapred.JobClient: map 37% reduce 0%
11/09/23 20:18:40 INFO mapred.JobClient: map 38% reduce 0%
11/09/23 20:18:46 INFO mapred.JobClient: map 39% reduce 0%
11/09/23 20:18:49 INFO mapred.JobClient: map 40% reduce 0%
11/09/23 20:18:52 INFO mapred.JobClient: map 41% reduce 0%
11/09/23 20:18:58 INFO mapred.JobClient: map 42% reduce 0%
11/09/23 20:19:01 INFO mapred.JobClient: map 43% reduce 0%
11/09/23 20:19:04 INFO mapred.JobClient: map 44% reduce 0%
11/09/23 20:19:10 INFO mapred.JobClient: map 45% reduce 0%
11/09/23 20:19:13 INFO mapred.JobClient: map 46% reduce 0%
11/09/23 20:19:19 INFO mapred.JobClient: map 47% reduce 0%
11/09/23 20:19:22 INFO mapred.JobClient: map 48% reduce 0%
11/09/23 20:19:25 INFO mapred.JobClient: map 49% reduce 0%
11/09/23 20:19:34 INFO mapred.JobClient: map 50% reduce 0%
11/09/23 20:19:37 INFO mapred.JobClient: map 52% reduce 0%
11/09/23 20:19:40 INFO mapred.JobClient: map 53% reduce 0%
11/09/23 20:19:43 INFO mapred.JobClient: map 54% reduce 0%
11/09/23 20:19:46 INFO mapred.JobClient: map 55% reduce 0%
11/09/23 20:19:49 INFO mapred.JobClient: map 56% reduce 0%
11/09/23 20:19:52 INFO mapred.JobClient: map 57% reduce 0%
11/09/23 20:19:55 INFO mapred.JobClient: map 58% reduce 0%
11/09/23 20:19:58 INFO mapred.JobClient: map 59% reduce 0%
11/09/23 20:20:01 INFO mapred.JobClient: map 60% reduce 0%
11/09/23 20:20:04 INFO mapred.JobClient: map 62% reduce 0%
11/09/23 20:20:07 INFO mapred.JobClient: map 63% reduce 0%
11/09/23 20:20:10 INFO mapred.JobClient: map 64% reduce 0%
11/09/23 20:20:13 INFO mapred.JobClient: map 65% reduce 0%
11/09/23 20:20:16 INFO mapred.JobClient: map 66% reduce 0%
11/09/23 20:20:19 INFO mapred.JobClient: map 67% reduce 0%
11/09/23 20:20:22 INFO mapred.JobClient: map 68% reduce 0%
11/09/23 20:20:25 INFO mapred.JobClient: map 69% reduce 0%
11/09/23 20:20:28 INFO mapred.JobClient: map 70% reduce 0%
11/09/23 20:20:31 INFO mapred.JobClient: map 72% reduce 0%
11/09/23 20:20:34 INFO mapred.JobClient: map 73% reduce 0%
11/09/23 20:20:37 INFO mapred.JobClient: map 74% reduce 0%
11/09/23 20:20:40 INFO mapred.JobClient: map 75% reduce 0%
11/09/23 20:20:43 INFO mapred.JobClient: map 76% reduce 0%
11/09/23 20:20:46 INFO mapred.JobClient: map 77% reduce 0%
11/09/23 20:20:49 INFO mapred.JobClient: map 78% reduce 0%
11/09/23 20:20:52 INFO mapred.JobClient: map 80% reduce 0%
11/09/23 20:20:55 INFO mapred.JobClient: map 81% reduce 0%
11/09/23 20:20:58 INFO mapred.JobClient: map 82% reduce 0%
11/09/23 20:21:01 INFO mapred.JobClient: map 83% reduce 0%
11/09/23 20:21:04 INFO mapred.JobClient: map 84% reduce 0%
11/09/23 20:21:07 INFO mapred.JobClient: map 85% reduce 0%
11/09/23 20:21:10 INFO mapred.JobClient: map 86% reduce 0%
11/09/23 20:21:13 INFO mapred.JobClient: map 87% reduce 0%
11/09/23 20:21:22 INFO mapred.JobClient: map 88% reduce 0%
11/09/23 20:21:28 INFO mapred.JobClient: map 89% reduce 0%
11/09/23 20:21:37 INFO mapred.JobClient: map 90% reduce 0%
11/09/23 20:21:47 INFO mapred.JobClient: map 91% reduce 0%
11/09/23 20:21:53 INFO mapred.JobClient: map 92% reduce 0%
11/09/23 20:22:02 INFO mapred.JobClient: map 93% reduce 0%
11/09/23 20:22:11 INFO mapred.JobClient: map 94% reduce 0%
11/09/23 20:22:17 INFO mapred.JobClient: map 95% reduce 0%
11/09/23 20:22:26 INFO mapred.JobClient: map 96% reduce 0%
11/09/23 20:22:32 INFO mapred.JobClient: map 97% reduce 0%
11/09/23 20:22:41 INFO mapred.JobClient: map 98% reduce 0%
11/09/23 20:22:47 INFO mapred.JobClient: map 99% reduce 0%
11/09/23 20:22:53 INFO mapred.JobClient: map 100% reduce 0%
11/09/23 20:22:55 INFO mapred.JobClient: Job complete: job_201109211521_0012
11/09/23 20:22:55 INFO mapred.JobClient: Counters: 6
11/09/23 20:22:55 INFO mapred.JobClient: Job Counters
11/09/23 20:22:55 INFO mapred.JobClient: Launched map tasks=4
11/09/23 20:22:55 INFO mapred.JobClient: Data-local map tasks=4
11/09/23 20:22:55 INFO mapred.JobClient: FileSystemCounters
11/09/23 20:22:55 INFO mapred.JobClient: HDFS_BYTES_READ=1392402240
11/09/23 20:22:55 INFO mapred.JobClient: Map-Reduce Framework
11/09/23 20:22:55 INFO mapred.JobClient: Map input records=11628209
11/09/23 20:22:55 INFO mapred.JobClient: Spilled Records=0
11/09/23 20:22:55 INFO mapred.JobClient: Map output records=11628209
11/09/23 20:22:55 INFO mapreduce.ExportJobBase: Transferred 1.2968 GB in 425.642 seconds (3.1198 MB/sec)
11/09/23 20:22:55 INFO mapreduce.ExportJobBase: Exported 11628209 records.
Fri Sep 23 20:22:55 CST 2011
###############
[wanghai01@tc-crm-rd01.tc.baidu.com bin]$ sh import.sh
Fri Sep 23 20:40:33 CST 2011
11/09/23 20:40:33 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
11/09/23 20:40:33 INFO tool.CodeGenTool: Beginning code generation
11/09/23 20:40:33 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:33 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:33 INFO orm.CompilationManager: HADOOP_HOME is /home/wanghai01/hadoop/hadoop-0.20.2/bin/..
11/09/23 20:40:33 INFO orm.CompilationManager: Found hadoop core jar at: /home/wanghai01/hadoop/hadoop-0.20.2/bin/../hadoop-0.20.2-core.jar
11/09/23 20:40:34 ERROR orm.CompilationManager: Could not rename /tmp/sqoop-wanghai01/compile/a913cede5621df95376a26c1af737ee2/tb_keyword_data_201104.java to /home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/./tb_keyword_data_201104.java
11/09/23 20:40:34 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-wanghai01/compile/a913cede5621df95376a26c1af737ee2/tb_keyword_data_201104.jar
11/09/23 20:40:34 WARN manager.MySQLManager: It looks like you are importing from mysql.
11/09/23 20:40:34 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
11/09/23 20:40:34 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
11/09/23 20:40:34 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
11/09/23 20:40:34 INFO mapreduce.ImportJobBase: Beginning import of tb_keyword_data_201104
11/09/23 20:40:34 INFO manager.MySQLManager: Executing SQL statement: SELECT t.* FROM `tb_keyword_data_201104` AS t LIMIT 1
11/09/23 20:40:40 INFO mapred.JobClient: Running job: job_201109211521_0014
11/09/23 20:40:41 INFO mapred.JobClient: map 0% reduce 0%
11/09/23 20:40:54 INFO mapred.JobClient: map 25% reduce 0%
11/09/23 20:40:57 INFO mapred.JobClient: map 50% reduce 0%
11/09/23 20:41:36 INFO mapred.JobClient: map 75% reduce 0%
11/09/23 20:42:00 INFO mapred.JobClient: map 100% reduce 0%
11/09/23 20:43:19 INFO mapred.JobClient: Job complete: job_201109211521_0014
11/09/23 20:43:19 INFO mapred.JobClient: Counters: 5
11/09/23 20:43:19 INFO mapred.JobClient: Job Counters
11/09/23 20:43:19 INFO mapred.JobClient: Launched map tasks=4
11/09/23 20:43:19 INFO mapred.JobClient: FileSystemCounters
11/09/23 20:43:19 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1601269219
11/09/23 20:43:19 INFO mapred.JobClient: Map-Reduce Framework
11/09/23 20:43:19 INFO mapred.JobClient: Map input records=11628209
11/09/23 20:43:19 INFO mapred.JobClient: Spilled Records=0
11/09/23 20:43:19 INFO mapred.JobClient: Map output records=11628209
11/09/23 20:43:19 INFO mapreduce.ImportJobBase: Transferred 1.4913 GB in 165.0126 seconds (9.2544 MB/sec)
11/09/23 20:43:19 INFO mapreduce.ImportJobBase: Retrieved 11628209 records.
Fri Sep 23 20:43:19 CST 2011
import.sh和export.sh中的主要命令如下
/home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/sqoop import --connect jdbc:mysql://XXXX/crm --username XX --password XX --table tb_keyword_data_201104 --split-by winfo_id --target-dir /user/wanghai01/data/ --fields-terminated-by '\t' --lines-terminated-by '\n' --input-null-string '' --input-null-non-string ''
/home/wanghai01/cloudera/sqoop-1.2.0-CDH3B4/bin/sqoop export --connect jdbc:mysql://XXXX/crm --username XX --password XX --table tb_keyword_data_201104 --export-dir /user/wanghai01/data/ --fields-terminated-by '\t' --lines-terminated-by '\n' --input-null-string '' --input-null-non-string ''
相关问答
更多-
我遇到一个问题 就是使用sqoop 从oracle往hdfs中导数据[2022-12-19]
plsql 或者toad 可以到处oracle中的数据及表 要不你就写一个程序直接读取oracle中的数据插入mysql中 -
为了从hive表直接导出,需要引用HCATALOG。 对于目的地表,案件也很重要。 所以对于MySQL来说,表格应该是小写的,但如果是Oracle,那么它将是更高的。 以下是cloudera中的示例,其中目标数据库是MySQL。 我不需要使用--hcatalog-home或--hcatalog-database,但根据您的设置,可能需要它。 sqoop export --connect jdbc:mysql:// localhost / retail_db --username root -P --tabl ...
-
如何sqoop将oracle clob数据导入到hdfs上的avro文件(How to sqoop to import oracle clob data to avro files on hdfs)[2022-05-04]
更新:找到解决方案,我们需要为clob列添加clob map-column-java 。 例如:如果列名是clob那么我们为sqoop传递map-column-java clob=string以导入clob列。 Update: Found the solution, We need to add --map-column-java for the clob columns. For Eg: If the column name is clob then we have pass --map-column-j ... -
在大写中使用表名。 sqoop import -connect jdbc:oracle:thin:system/system@192.xxx.xx.x:1521:xe -username system -P -table EMPLOYEE -columns "ID" -target-dir sqoopoutput1 -m Use table name in capitals. sqoop import -connect jdbc:oracle:thin:system/system@192.xxx.xx.x ...
-
我不得不删除sqoop / lib目录下的mysql-connector-java-5.0.8-bin.jar,我下载了最新的mysql-connector-java-5.1.28-bin.jar并将其复制到sqoop / lib文件夹,现在我没有看到从服务器收到“未知的初始字符集索引'224'。” 错误了。 从搜索该错误,我下载的旧mysql驱动程序有一个错误( - http://forums.mysql.com/read.php?39,139534,256214#msg-256214 ) I had t ...
-
Sqoop已将您的数据导入为逗号分隔的文本文件。 它支持许多其他文件格式,可以使用下面列出的参数激活它们 mSqoop参数控制导入命令Argument的文件格式 --as-avrodatafile Data is imported as Avro files. --as-sequencefile Data is imported as Sequence Files. --as-textfile The default file format, with imported data ...
-
您可以随时使用任何后端代码来执行此操作:从数据库读取数据并将数据流写入HDFS。 然后,在您的应用程序配置中,您可以进行任何所需的定制(线程,超时,数据批量等)。 这是相当直接的解决方案。 由于某种原因,我们曾尝试过这一次,我不记得了。 但大多数情况下我们使用sqoop,在这里没有问题。 你也可以做一个数据库副本(sime类副本),除了你的sqoop作业以外,任何外部系统都不会使用它。 所以用户选择不会影响性能。 You could always do it maually with any back-en ...
-
要使Teradata使用Cloudera发行版正常工作,您需要执行以下操作: 将Teradata JDBC jar安装在/ var / lib / sqoop中。 对我来说,这些是terajdbc4.jar和tdgssconfig.jar。 安装由Teradata提供支持的Cloudera Connector或安装在文件系统某处的Teradata连接器 (我更喜欢/ var / lib / sqoop)。 在/etc/sqoop/conf/managers.d/中,创建一个文件(任何名称)并添加com.cl ...
-
您正在使用错误的工具来完成工作。 Sqoop将启动一系列进程(在数据节点上),每个进程都与数据库建立连接(参见num-mapper),它们将分别提取数据集的一部分。 我不认为你可以用Spark实现读取并行性。 使用Sqoop获取数据集,然后使用Spark处理它。 You are using the wrong tools for the job. Sqoop will launch a slew of processes (on the datanodes) that will each make a co ...
-
我相信进口和出口的工作原理类似于交易原则。 由于Sqoop将导出过程分解为多个事务, 因此失败的导出作业可能会导致将部分数据提交到数据库 。 这可能进一步导致后续作业由于某些情况下的插入冲突而失败,或者导致其他作业中的重复数据。 解决方案您可以通过--staging-table选项指定登台表来解决此问题,该选项充当用于暂存导出数据的辅助表。 分阶段数据最终在单个事务中移动到目标表。 I believe import and export works on the similar principles of ...