大数据分析 – SQL 简介
大数据分析 – SQL 简介
SQL 代表结构化查询语言。它是传统数据仓库和大数据技术中用于从数据库中提取数据的最广泛使用的语言之一。为了演示 SQL 的基础知识,我们将使用示例。为了专注于语言本身,我们将在 R 中使用 SQL。就编写 SQL 代码而言,这与在数据库中所做的完全一样。
SQL 的核心是三个语句:SELECT、FROM 和 WHERE。以下示例使用了最常见的 SQL 用例。导航到文件夹bda/part2/SQL_introduction并打开SQL_introduction.Rproj文件。然后打开 01_select.R 脚本。为了在 R 中编写 SQL 代码,我们需要安装sqldf包,如下面的代码所示。
# Install the sqldf package install.packages('sqldf') # load the library library('sqldf') library(nycflights13) # We will be working with the fligths dataset in order to introduce SQL # Let’s take a look at the table str(flights) # Classes 'tbl_d', 'tbl' and 'data.frame': 336776 obs. of 16 variables: # $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... # $ month : int 1 1 1 1 1 1 1 1 1 1 ... # $ day : int 1 1 1 1 1 1 1 1 1 1 ... # $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... # $ dep_delay: num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... # $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... # $ arr_delay: num 11 20 33 -18 -25 12 19 -14 -8 8 ... # $ carrier : chr "UA" "UA" "AA" "B6" ... # $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... # $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... # $ origin : chr "EWR" "LGA" "JFK" "JFK" ... # $ dest : chr "IAH" "IAH" "MIA" "BQN" ... # $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... # $ distance : num 1400 1416 1089 1576 762 ... # $ hour : num 5 5 5 5 5 5 5 5 5 5 ... # $ minute : num 17 33 42 44 54 54 55 57 57 58 ...
select 语句用于从表中检索列并对其进行计算。ej1 中演示了最简单的 SELECT 语句。我们还可以创建新变量,如ej2所示。
### SELECT statement ej1 = sqldf(" SELECT dep_time ,dep_delay ,arr_time ,carrier ,tailnum FROM flights ") head(ej1) # dep_time dep_delay arr_time carrier tailnum # 1 517 2 830 UA N14228 # 2 533 4 850 UA N24211 # 3 542 2 923 AA N619AA # 4 544 -1 1004 B6 N804JB # 5 554 -6 812 DL N668DN # 6 554 -4 740 UA N39463 # In R we can use SQL with the sqldf function. It works exactly the same as in a database # The data.frame (in this case flights) represents the table we are querying and goes in the FROM statement # We can also compute new variables in the select statement using the syntax: # old_variables as new_variable ej2 = sqldf(" SELECT arr_delay - dep_delay as gain, carrier FROM flights ") ej2[1:5, ] # gain carrier # 1 9 UA # 2 16 UA # 3 31 AA # 4 -17 B6 # 5 -19 DL
SQL 最常用的特性之一是 group by 语句。这允许为另一个变量的不同组计算数值。打开脚本 02_group_by.R。
### GROUP BY # Computing the average ej3 = sqldf(" SELECT avg(arr_delay) as mean_arr_delay, avg(dep_delay) as mean_dep_delay, carrier FROM flights GROUP BY carrier ") # mean_arr_delay mean_dep_delay carrier # 1 7.3796692 16.725769 9E # 2 0.3642909 8.586016 AA # 3 -9.9308886 5.804775 AS # 4 9.4579733 13.022522 B6 # 5 1.6443409 9.264505 DL # 6 15.7964311 19.955390 EV # 7 21.9207048 20.215543 F9 # 8 20.1159055 18.726075 FL # 9 -6.9152047 4.900585 HA # 10 10.7747334 10.552041 MQ # 11 11.9310345 12.586207 OO # 12 3.5580111 12.106073 UA # 13 2.1295951 3.782418 US # 14 1.7644644 12.869421 VX # 15 9.6491199 17.711744 WN # 16 15.5569853 18.996330 YV # Other aggregations ej4 = sqldf(" SELECT avg(arr_delay) as mean_arr_delay, min(dep_delay) as min_dep_delay, max(dep_delay) as max_dep_delay, carrier FROM flights GROUP BY carrier ") # We can compute the minimun, mean, and maximum values of a numeric value ej4 # mean_arr_delay min_dep_delay max_dep_delay carrier # 1 7.3796692 -24 747 9E # 2 0.3642909 -24 1014 AA # 3 -9.9308886 -21 225 AS # 4 9.4579733 -43 502 B6 # 5 1.6443409 -33 960 DL # 6 15.7964311 -32 548 EV # 7 21.9207048 -27 853 F9 # 8 20.1159055 -22 602 FL # 9 -6.9152047 -16 1301 HA # 10 10.7747334 -26 1137 MQ # 11 11.9310345 -14 154 OO # 12 3.5580111 -20 483 UA # 13 2.1295951 -19 500 US # 14 1.7644644 -20 653 VX # 15 9.6491199 -13 471 WN # 16 15.5569853 -16 387 YV ### We could be also interested in knowing how many observations each carrier has ej5 = sqldf(" SELECT carrier, count(*) as count FROM flights GROUP BY carrier ") ej5 # carrier count # 1 9E 18460 # 2 AA 32729 # 3 AS 714 # 4 B6 54635 # 5 DL 48110 # 6 EV 54173 # 7 F9 685 # 8 FL 3260 # 9 HA 342 # 10 MQ 26397 # 11 OO 32 # 12 UA 58665 # 13 US 20536 # 14 VX 5162 # 15 WN 12275 # 16 YV 601
SQL 最有用的特性是连接。连接意味着我们希望使用一列将表 A 和表 B 合并到一个表中以匹配两个表的值。有不同类型的连接,实际上,开始使用这些将是最有用的:内连接和左外连接。
# Let’s create two tables: A and B to demonstrate joins. A = data.frame(c1 = 1:4, c2 = letters[1:4]) B = data.frame(c1 = c(2,4,5,6), c2 = letters[c(2:5)]) A # c1 c2 # 1 a # 2 b # 3 c # 4 d B # c1 c2 # 2 b # 4 c # 5 d # 6 e ### INNER JOIN # This means to match the observations of the column we would join the tables by. inner = sqldf(" SELECT A.c1, B.c2 FROM A INNER JOIN B ON A.c1 = B.c1 ") # Only the rows that match c1 in both A and B are returned inner # c1 c2 # 2 b # 4 c ### LEFT OUTER JOIN # the left outer join, sometimes just called left join will return the # first all the values of the column used from the A table left = sqldf(" SELECT A.c1, B.c2 FROM A LEFT OUTER JOIN B ON A.c1 = B.c1 ") # Only the rows that match c1 in both A and B are returned left # c1 c2 # 1 <NA> # 2 b # 3 <NA> # 4 c