Problem Statement: How Hive partitioning improves the query performance?
Solution: Lets say in a organization, you have lot of employees belong to different departments say sales, finance, marketing. And in your local file system, you have stored the employee informations as different files such as sales, finance, marketing.
Each file contains the employee details and having employee id and name as below.
Sales
---------
1001 saroj
1002 kumar
Finance
--------
1003 rout
1004 skr
Marketing
---------
1005 Nishant
1006 Nishanko
Now, We want to create hive tables in HDFS based on partitioning. So, lets create the hive table as below
hive> create table employee(id int, name string) partition by(department string)
> row format delimited
> fields terminated by ‘\t';
> load data local inpath '/Users/saroj/employeedata/sales' into table employee partition(department='sales');
> load data local inpath '/Users/saroj/employeedata/finance' into table employee partition(department='finance');
> load data local inpath '/Users/saroj/employeedata/marketing' into table employee partition(department='marketing');
Now if you login to HDFS, you will see the folders and files are created as below inside the hive warehouse.
if you type hadoop fs -ls /user/hive/warehouse/myschema/employee;
drwxr-x— root /user/hive/warehouse/myschema/employee/department=sales
drwxr-x— root /user/hive/warehouse/myschema/employee/department=finance
drwxr-x— root /user/hive/warehouse/myschema/employee/department=marketing
Now you can fetch the result based on the partition which will not do a full table scan and give you better performance result.
hive> select * from myschema.employee where department='sales';
id name
-----------------------------------------------
1001 saroj
1002 Kumar
