Hive基本操作-摩杜云开发者社区

Hive基本操作_字段

DDL

数据定义语言（data-define-lauguage ）

1.数据库操作

1.1建库

create database if not exists dbname;
if not exists 防止报错

1.2切换库

use dbname;

1.3查询正在使用的库

select current_database();

1.4查询库列表

show databases;
show databases like "*test";

1.5查询库的详细描述信息

desc database dbname;

1.6删除数据库

drop database if exists dbname;  默认只能删除空数据库中没有表的
drop database dbname cascade;  级联删除
drop database dbname;=== drop database dbname restrict;

2.表操作

2.1建表语句

案例

95002,刘晨,女,19,IS

创建数据库

create database if not exists test_hive;
use test_hive;

2.1.1创建一个内部表

create table if not exists stu_managed(sid int,name string,sex string,age int,dept string) row format delimited fields terminated by ",";

2.1.2创建一个外部表

create external table if not exists stu_external(sid int,name string,sex string,age int,dept string) row format delimited fields terminated by ",";

2.1.3创建一个分区表

分区字段   dept 
create table if not exists stu_ptn_dept(sid int,name string,sex string,age int) partitioned by (dept string) row format delimited fields terminated by "," location "/user/hivedata/stu_ptn";

分区字段 sex
create table if not exists stu_ptn_sex(sid int,name string,age int,dept string) partitioned by (sex string) row format delimited fields terminated by ",";

2.1.4创建一个分桶表

分桶字段  age    分桶个数  3
create table if not exists stu_buk(sid int,name string,sex string,age int,dept string) clustered by (age) sorted by(sid) into 3 buckets row format delimited fields terminated by ",";

2.1.5创建一个复制表

仅仅复制表结构，不复制表数据和表属性
like 
create table tbname like tbname1;
create external table stu_copy like stu_buk;
注意：表属性取决于自己建表时候是否指定(external)的

2.1.6ctas建表

create table tbname as select ....
将一个sql的查询结果存储在一个表中
create table tbname as select * from stu_managed;

2.2查看表的详细信息

desc tbname;
desc extended tbname;查看表的扩展信息
desc extended stu_managed;
desc formatted tbname; 格式化显示表的详细信息
desc formatted stu_managed;

2.3删除表

drop tbname;

2.4查看表列表

show tables;
show tables in dbname;  指定数据库下的所有表
show tables in test_test_hive;
show tables like "stu*";

2.5修改表

2.5.1修改表名

元数据库
alter table tbname rename to tbname_new;
alter table stu_copy01 rename to stu_copy;

2.5.2修改表的列信息

添加列

alter table tbname add columns(col type);
alter table stu_test add columns(family string);

修改列

alter table tbname change old_col new_col type;
修改列名
alter table stu_test change family fa string;
修改列的类型
只支持小->大	不支持大->小
int--> string 
alter table stu_test change sid sid string;
string ---> int 
alter table stu_test change sid sid int;

替换列

alter table tbname replace columns (col type);
alter table stu_test replace columns (name string);

2.5.3修改表的分区信息( 针对分区表)

stu_ptn_dept 分区字段，分区表相当于对原始表，字段分成两部分存储。一部分建表字段，另一部分是分区字段，分区字段中分区字段代表的是分区标志。

添加分区，指定一个分区字段对应的值

alter table tbname add partition(需要添加的分区);
alter table stu_ptn_dept add partition(dept="CS");  这个分区存储都是dept 为CS的数据
alter table tbname add partition(需要添加的分区) partition(分区) partition(分区);
alter table stu_ptn_dept add partition(dept="IS") partition(dept="MA");

删除分区

alter table tbname drop partition(分区名);
alter table stu_ptn_dept drop partition(dept="CS");

修改分区

修改分区的存储路径 
    1.添加分区的时候需要指定一个新的路径
    默认分区路径在表路径下
    alter table stu_ptn_dept add partition(dept="CS") location "/user/ptn/data/dept";
    2.修改已经添加过的分区的路径 set location 
    alter table stu_ptn_dept partition(dept="CS") set location "/user/hivedata/stu_ptn"; 
    表名后 partition 指定需要操作的分区

查询某一个表的分区

show partitions tbname; 查询某一个表的所有分区

清空表

truncate table tbname;
清空表数据保留表结构,不适用外部表

查看详细建表语句

show create table table_name;
show create table stu_managed;

DML

数据管理语言(data-manager-lauguage)

1.数据导入

1.1load 数据加载

local 代表从本地加载数据，数据存储在linux hive所在节点的本地的，不加local代表数据从hdfs加载的

1.1.1从本地加载数据

load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_managed;
相当于将本地文件复制、上传到hive表所在的路径下

1.1.2从hdfs加载数据

load data inpath "/hive_data" into table stu_managed;
相当于将hdfs的数据移动到 hive表所在的路径下
hive数据加载的本质就是将文件传入hive的表所在的hdfs的路径下
    
如果手动将一个数据直接上传到hive的表路径下hive是否可以解析到？
hadoop fs -put student.txt /user/hive/warehouse/test_hive.db/stu_managed/stu_sj
可以识别的   
换句话说hive 本质是可以解析表路径下的所有数据的，只要将数据放在表路径下就都可以解析到。hive相当于hdfs的一个路径的使用者。

1.2数据插入

1.2.1单条数据插入

效率很低
insert into table tbname values();
insert into table stu_managed values(1,"zs","f",45,"CS");

1.2.2单重数据插入

执行一次插入操作，插入多条数据，将一个表的查询结果插入到另一个表中
insert into table tbname select .....
insert into table stu_external select * from stu_managed;

1.2.3多重数据插入

对原始数据表扫描一次，可以插入到多个表中或一个表的多个分区中
需求：
将stu_managed   age <=18    stu01   age <=19  stu02  剩下 stu03 
from tbname 
insert into table tbname01 select ... where .....
insert into table tbname02 select .... where .... 

from stu_managed 
insert into table stu01 select * where age <= 18 
insert into table stu02 select * where age>18 and age<=19 
insert into table stu03 select * where age>19;
多重>单重>单条

1.3分区表的数据加载

stu_ptn_dept dept String hdfs://test_hive/user/hivedata/stu_ptn
stu_ptn_sex sex string 默认路径

1.3.1静态分区加载

加载数据的时候手动静态指定分区值,手动添加一个分区
alter table stu_ptn_dept add partition(dept="CS");

load方式

load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_ptn_dept partition(dept="CS");
注意： 使用load方式在进行数据加载的时候不会进行数据过滤,一定要十分确定的情况下才是用这种方式

insert方式

从一个非分区表中查询数据然后插入到分区表中是用来数据过滤的
单重数据插入 
insert into stu_ptn_dept partition(dept="CS") select sid,name,sex,age from stu_managed01 where dept="CS";
多重 
from stu_managed01 
insert into stu_ptn_dept partition(dept="IS") select sid,name,sex,age where dept="IS" 
insert into stu_ptn_dept partition(dept="MA") select sid,name,sex,age where dept="MA"; 


from stu_managed01 
    insert into stu_ptn_sex partition(sex="man") select sid,name,age,dept where sex='男' 
    insert into stu_ptn_sex partition(sex="women") select sid,name,age,dept where sex='女';

1.3.2动态分区加载

静态分区加载数据缺陷：必须足够了解数据知道每一个分区名，不适用分区很多的时候。所以需要动态分区，根据数据自动生成分区不需要自己制定分区名。动态分区不支持 load方式，只能使用insert方式。

单重插入

insert into table tbname partition(分区字段) select ...需要查询分区字段分区字段放在最后

insert into table stu_ptn_dept partition(dept) select sid,name,sex,age,dept 
from stu_managed01;

hive中默认的动态分区是关闭的要进行更改
set hive.exec.dynamic.partition.mode=nonstrict;

1.4案例补充

日期  /year=2019/month=12/day=
多级分区 分区字段有多个 >=2
partitioned by (dept string,age int)
多级分区    
    字段靠前  高级分区
    字段靠后  低级分区
字段靠前的  类似  一级目录  字段靠后的  类似 子目录
目录结构
/stu_ptn/dept="CS"/age=17
/stu_ptn/dept="CS"/age=18
/stu_ptn/dept="CS"/age=19

/stu_ptn/dept="IS"/age=19
/stu_ptn/dept="IS"/age=20
建表：
create table if not exists stu_ptn_dj(sid int,name string,sex string) partitioned by (dept string,age int) row format delimited fields terminated by ",";

数据插入：
静态 
insert into table stu_ptn_dj partition(dept="CS",age=18) select sid,name,sex from stu_managed01 where dept="CS" and age=18;

一动一静
静态
insert into table stu_ptn_dj partition(dept="MA",age) select sid,name,sex,age from stu_managed01 where dept="MA"; 
动态
insert into table stu_ptn_dj partition(dept,age) select sid,name,sex,dept,age from stu_managed01;
注意： 分区表的操作一定是指定操作的分区

1.5分桶表的数据加载

stu_buk    age   3   sort sid 1 
将数据按照分桶规则（分区算法） 分到不同的桶中
分桶规则：
    分桶  string 
    分桶字段 .hash % 分桶个数
    分桶   int 
    分桶字段 % 分桶个数 
分桶表不支持load的方式的
load data local inpath "/home/hadoop/hive_data/student.txt" into table stu_buk;  不支持这种方式 

只能用insert方式 
insert into table stu_buk select * from stu_managed01;
日志：
Cannot run job locally: Number of reducers (= 3) is more than 1
底层启动  分桶个数对应的  reducetask的个数
一个分桶--- mr一个分区
桶1  age%3=0 
95005,刘刚,男,18,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95011,包小柏,男,18,MA
95013,冯伟,男,21,CS
95015,王君,男,18,MA
95017,王风娟,女,18,IS
95020,赵钱,男,21,IS

桶2   age%3=1 
95002,刘晨,女,19,IS
95003,刘敏,女,22,MA
95004,张立,男,19,IS
95007,易思玲,女,19,MA
95010,孔小涛,男,19,CS
95014,王小丽,女,19,CS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS

桶3  age%3 =2 
95001   李勇    男      20      CS
95006   孙庆    男      23      CS
95012   孙花    女      20      CS
95021   周二    男      17      MA
95022   郑明    男      20      MA

select * from stu_buk;查询是全表的所有桶
查询某一个桶的语法
select * from tbname tablesample (bucket x out of y)
y  桶簇个数   一个或多个桶组成的一簇
    将桶分了   多个桶簇  平分 
    y=3   一个桶簇 ==  一个桶 
    提取某一个整桶数据   桶簇== 桶 
x 取的桶簇编号
    1开始的  顺序递增的
    
select * from stu_buk tablesample(bucket 1 out of 3);  取第一个桶簇

查询操作

join where group by order by having limit 语法顺序：select .... from ...join .... on ... where ....group by ....having.....order by ....limit .....

1.join

注意：
    1）hive支持等值连接，不支持非等值连接
    select * a join b on a.id=b.id;
    2）hive有多个连接条件的时候支持and连接，不支持or连接
    select * a join b on a.id=b.id and a.name=b.name;
    select * a join b on a.id=b.id or a.name=b.name;  不支持 
    3）hive支持多表连接

建表

create table if not exists a(id int,name string) row format delimited fields terminated by "\t";

加载数据

load data local inpath "/home/hadoop/hive_data/a" into table a;

建表

create table if not exists b(id int,age int) row format delimited fields terminated by "\t";

加载数据

load data local inpath "/home/hadoop/hive_data/b" into table b;

1.1内连接

inner join |join 
两个或多个表的   交集  公共的部分
两个或多个表   都有的关联键
select * from a join b on a.id=b.id;

a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43

1.2外连接

1.2.1左外连接

left outer join|left join 
以join左侧的表为主表的，主表中有几条数据，最终关联出来就几条数据，右表填充，右表能关联上，则关联，不能关联上补位null
select * from a left join b on a.id=b.id;
a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43
5       xh      NULL    NULL

1.2.2右外连接

right outer join | right join 
主表为右表，右表数据不变，左表补充，有就补没有 null
select * from a right join b on a.id=b.id;
a.id    a.name  b.id    b.age
1       zs      1       12
2       ls      2       23
3       ww      3       43
NULL    NULL    4       11

1.2.3全外连接

full outer join | full join 
左表和右表的关联键的并集如果左表有就添加左表没有就 null 右表有添加没有就 null 
select * from a full join b on a.id=b.id;
1       zs      1       12
2       ls      2       23
3       ww      3       43
NULL    NULL    4       11
5       xh      NULL    NULL

1.3半连接

semi join、left semi join 
mysql中:a表中在b中包含的id的数据 
select * from a where id in (select id from b);

mysql中查询包含不包含  有 in/exists 语法查询  
在hive中1 版本中不支持这种语法,2版本支持这种语法但是效率极低这个时候需要有一个替代方案
1）内连接
取a表的数据
select a.* from a join b on a.id=b.id;
a.id    a.name
1       zs
2       ls
3       ww
2)left semi join  in/exists左半连接
求两个表的内连接输出左表数据,左表在右表的关联键的数据
select * from a left semi join b on a.id=b.id;
a.id    a.name
1       zs
2       ls
3       ww

2.group by 分组

将group by 后面的字段相同的分到一组中 
案例：求每一个部门的最大年龄分组字段：部门dept，求max 
select 
dept d,max(age) max_age 
from stu_managed01 
group by dept;
sql 语句  ： 1)select    from   2)group by  3)select后面的字段  4）order by 
注意：
1）group by是在select之前执行的所有不可以使用select后面字段的别名
2）group by 使用的时候会限制select后面的字段，可以使用聚合函数max、min、avg、count、sum

3.where 和 having

是用来过滤的 where 聚合函数 having  
where过滤的是聚合之前的数据为聚合函数准备数据 
案例：求每个部门中年龄大于19的人数。分组字段:部门dept,求count使用聚合函数，对数据做过滤 age >19
select 
dept,count(*) 
from stu_managed01 
where age>19 
group by dept;

having 执行顺序在聚合函数之后是对聚合结果做的过滤
案例：求部门中总人数大于7的所有部门。分组字段dept,求count，过滤having count>7 
select 
dept,count(*) totalcount 
from stu_managed01 
group by dept 
having totalcount>7;

需求：
每个部门中年龄>=19的人数多余4个人的部门。分组字段dept,求 count过滤age>=19 where 过滤 count >4

select 
dept,count(*) totalcount 
from stu_managed01 
where age >=19 
group by dept 
having totalcount>4;

where -> group by -> select ->  having

4.order by

是用来指定排序的

hive中order by的位置：order by |sort by |distribute by |cluster by

4.1order by 全局排序

对所有数据按照指定字段进行排序的
select * from stu_managed01 order by age,sid;
无论数据量多大都进行全局排序

4.2sort by 局部排序

针对每一个reducetask进行排序，每一个reducetask的内部结果有序的   全局是无序的，当只有一个reducetask的时候 效果=== order by 。 

select * from stu_managed01 sort by age;
默认情况下reducetask是1个  

In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
  256000000
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>  1009
In order to set a constant number of reducers:
    设定reducetask的个数  默认-1 
  set mapreduce.job.reduces=<number>

set mapreduce.job.reduces=3;
select * from stu_managed01 sort by age;
问题：
如何分配数据到每一个reducetask的？
字段.hash % 3   字段 每一行随机选
结果：
reducetask0 
95008   李娜    女      18      CS
95011   包小柏  男      18      MA
95004   张立    男      19      IS
95014   王小丽  女      19      CS
95019   邢小丽  女      19      IS
95001   李勇    男      20      CS
95012   孙花    女      20      CS
95020   赵钱    男      21      IS
95003   刘敏    女      22      MA

reducetask1
95009   梦圆圆  女      18      MA
95017   王风娟  女      18      IS
95005   刘刚    男      18      MA
95018   王一    女      19      IS
95007   易思玲  女      19      MA
95010   孔小涛  男      19      CS
95022   郑明    男      20      MA
95013   冯伟    男      21      CS

reducetask2
95021   周二    男      17      MA
95015   王君    男      18      MA
95002   刘晨    女      19      IS
95006   孙庆    男      23      CS

sort by 无法进行每一个reducetask的数据分配

4.3distribute by 指定字段

指定每一个reducetask分配字段的
分区|分桶
按照指定的字段将数据分到不同的reducetask中，分的算法是字段.hash % reducetasks。使用的时候distribute by 指定数据分配 + sort by  指定排序

select * from stu_managed01 distribute by age sort by age desc;
这里的 distribute by的字段 和 sort by的字段  随意指定
reducetask0   age %3 =0 
reducetask1   age %3 =1
reducetask2   age %3 =2 

4）cluster by 字段   
按照指定字段分配数据并按照指定字段进行升序排序
= distribute by 字段 + sort by 字段    distribute by 和 sort by 的字段为同一个字段升序

select * from stu_managed01 cluster by age;

select * from stu_managed01 distribute by age sort by age;

注意： distribute by + sort by 功能  > cluster by

5.hive的数据导出

将hql语句的查询结果导出到本地或者hdfs的文件

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement 说明：加上local 出到本地不加导出hdfs

5.1单重导出

insert overwrite local directory "/home/hadoop/hive_data/ag19" select * from stu_managed01 where age>19;

5.2多重导出

对同一个表的不同查询结果  导出到不同的路径下
from tbname 
insert overwrite local directory "" select ... where ...
insert overwrite local directory "" select ... where ...

from stu_managed01 
insert overwrite directory "/data/age18" select * where age=18 
insert overwrite directory "/data/age19" select * where age=19;