hive> create table student_info2 like student_info; OK Time taken: 0.73 seconds hive> show tables; OK employee student_info student_info2 student_school_info student_school_info_external_partition student_school_info_partition Time taken: 0.15 seconds, Fetched: 6 row(s)
根据已经存在的表,使用as关键字,创建一个与查询结果字段一致的表,同时将查询结果数据插入到新表
1
create table student_info3 as select * from student_info;
只有student_id,name两个字段的表
1
create table student_info4 as select student_id,name from student_info;
表重命名
student_info4表重命名为student_id_name
1
alter table student_info4 rename to student_id_name;
hive> alter table student_info3 add columns(gender string comment '性别'); OK Time taken: 0.185 seconds
删除列或修改列
修改列,将继续存在的列再定义一遍,需要替换的列重新定义
1 2 3
hive> alter table student_info3 replace columns(student_id string,name string,age int,origin string,gender2 int); OK Time taken: 0.422 seconds
删除列,将继续存在的列再定义一遍,需要删除的列不再定义
1 2 3
hive> alter table student_info3 replace columns(student_id string,name string,age int,origin string); OK Time taken: 0.529 seconds
2.分桶表使用
创建分桶表
按照指定字段取它的hash散列值分桶,创建学生入学信息分桶表
字段名称
类型
注释
分桶字段
student_id
string
学生ID
是
name
string
姓名
否
age
int
年龄
否
origin
string
学院ID
否
1 2 3 4 5 6 7 8 9 10 11
create table rel.student_info_bucket( student_id string, name string, age int, origin string ) clustered by (student_id) sorted by (student_id asc) into 4 buckets row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
分桶表插入数据
向student_info_bucket分桶表插入数据
1 2 3 4 5 6
set hive.enforce.bucketing = true; set mapreduce.job.reduces=4; insert overwrite table student_info_bucket select student_id,name,age,origin from student_info cluster by(student_id);
这样才能保证有多少桶就生成多少个文件。 如果定义了按照分桶字段排序,需要在从其他表查询数据过程中将数据按照分区字段排序之后插入各个桶中,分桶表并不会将各分桶中的数据排序。排序和分桶的字段相同的时候使用Cluster by(字段),cluster by 默认按照分桶字段在桶内升序排列,如果需要在桶内降序排列,使用distribute by (col) sort by (col desc)组合实现。
3.导出数据
使用insert将student_info表数据导出到本地指定路径
1 2
insert overwrite local directory '/home/hadoop/apps/hive_test_data/export_data' row format delimited fields terminated by '\t' select * from student_info;
create table rel.a( id int, name string ) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
create table rel.b( id int, name string ) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
导入数据
1 2 3 4 5 6 7 8 9 10
hive> load data local inpath '/home/hadoop/apps/hive_test_data/a_join_data' into table a; Loading data to table rel.a Table rel.a stats: [numFiles=1, totalSize=61] OK Time taken: 1.79 seconds hive> load data local inpath '/home/hadoop/apps/hive_test_data/b_join_data' into table b; Loading data to table rel.b Table rel.b stats: [numFiles=1, totalSize=38] OK Time taken: 0.562 seconds
select * from a inner join b on a.id=b.id; ..... OK 1 a 1 AA 2 b 2 BB 3 c 3 CC 6 f 6 FF Time taken: 44.337 seconds, Fetched: 4 row(s)
full outer join或full join
两个表通过id关联,把两个表的数据全部查询出来
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
OK 1 a 1 AA 2 b 2 BB 3 c 3 CC 4 d NULL NULL 5 e NULL NULL 6 f 6 FF 7 g NULL NULL 8 h NULL NULL 9 i NULL NULL 10 j NULL NULL 11 k NULL NULL 12 l NULL NULL 13 m NULL NULL 14 n NULL NULL NULL NULL 20 TT NULL NULL 21 UU NULL NULL 22 vv
left join
左连接时,左表中出现的join字段都保留,右表没有连接上的都为空
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
OK 1 a 1 AA 2 b 2 BB 3 c 3 CC 4 d NULL NULL 5 e NULL NULL 6 f 6 FF 7 g NULL NULL 8 h NULL NULL 9 i NULL NULL 10 j NULL NULL 11 k NULL NULL 12 l NULL NULL 13 m NULL NULL 14 n NULL NULL
right join
右连接时,右表中出现的join字段都保留,左表没有连接上的都是空
1
select * from a right join b on a.id=b.id;
1 2 3 4 5 6 7 8 9
OK 1 a 1 AA 2 b 2 BB 3 c 3 CC 6 f 6 FF NULL NULL 20 TT NULL NULL 21 UU NULL NULL 22 vv Time taken: 25.188 seconds, Fetched: 7 row(s)
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive对应于如下语句:
1
select a.id,a.name from a left semi join b on a.id = b.id;
1 2 3 4 5 6
OK 1 a 2 b 3 c 6 f Time taken: 27.42 seconds, Fetched: 4 row(s)
map side join
使用分布式缓存将小表数据加载都各个map任务中,在map端完成join,map任务输出后,不需要将数据拷贝到reducer阶段再进行join, 降低的数据在网络节点之间传输的开销。多表关联数据倾斜优化的一种手段。多表连接,如果只有一个表比较大,其他表都很小, 则join操作会转换成一个只包含map的Job。运行日志中会出现Number of reduce tasks is set to 0 since there’s no reduce operator 没有reduce的提示。 例如:
1
select /*+ mapjoin(b) */ a.id, a.name from a join b on a.id = b.id
1 2 3 4 5 6 7
Total MapReduce CPU Time Spent: 1 seconds 320 msec OK 1 a 2 b 3 c 6 f Time taken: 25.538 seconds, Fetched: 4 row(s)
5.Hive内置函数
创建用户评分表
1 2 3 4 5 6 7 8 9
create table rel.user_core_info( user_id string, age int, gender string, core int ) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
导入数据
1
load data local inpath '/home/hadoop/apps/hive_test_data/user_core.txt' into table rel.user_core_info;
条件函数 case when
语法1:CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
说明:如果a等于b,那么返回c;如果a等于d,那么返回e;否则返回f 例如:
1 2 3 4
hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end; OK one Time taken: 0.152 seconds, Fetched: 1 row(s)
语法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
说明:如果a为TRUE,则返回b;如果c为TRUE,则返回d;否则返回e 例如:
1 2 3 4
hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end; OK one Time taken: 0.33 seconds, Fetched: 1 row(s)
查询用户评分表,每个年龄段的最大评分值
1 2 3 4 5 6
select gender, case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end, max(core) max_core from rel.user_core_info group by gender, case when age<=20 then 'p0' when age>20 and age<=50 then 'p1' when age>=50 then 'p3' else 'p0' end;
结果为:
1 2 3 4 5 6 7 8
OK female p0 90 female p1 95 female p3 90 male p0 80 male p1 80 male p3 80 Time taken: 28.461 seconds, Fetched: 6 row(s)
如出现以下异常信息: Failed to execute goal on project hive-exec: Could not resolve dependencies for project org.apache.hive:hive-exec:jar:2.3.0: Could not find artifact org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde in alimaven (http://maven.aliyun.com/nexus/content/groups/public/) -> [Help 1]
hive> add jar /home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar; Added [/home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar] to class path Added resources: [/home/hadoop/apps/HiveUdfPro-1.0-SNAPSHOT.jar]
创建临时函数与开发好的java class关联
1
hive> create temporary function age_partition as 'com.yongliang.udf.AgePartitionFunction';
在hql中使用自定义的函数
1 2 3 4 5 6
select gender, age_partition(age), max(core) max_core from rel.user_core_info group by gender, age_partition(age);
结果为:
1 2 3 4 5 6 7
OK female p0 90 female p1 95 female p2 90 male p0 80 male p1 80 male p2 80