4、Spark 函数_m/n/o/p/q/r

序号	类型	地址
1	Spark 函数	1、Spark函数_符号
2	Spark 函数	2、Spark 函数_a/b/c
3	Spark 函数	3、Spark 函数_d/e/f/j/h/i/j/k/l
4	Spark 函数	4、Spark 函数_m/n/o/p/q/r
5	Spark 函数	5、Spark函数_s/t
6	Spark 函数	6、Spark 函数_u/v/w/x/y/z

文章目录

- 13、M
- - make_date
  - make_dt_interval
  - make_interval
  - make_timestamp
  - make_timestamp_ltz
  - make_timestamp_ntz
  - make_valid_utf8
  - make_ym_interval
  - map
  - map_concat
  - map_contains_key
  - map_entries
  - map_filter
  - map_from_arrays
  - map_from_entries
  - map_keys
  - map_values
  - map_zip_with
  - mask
  - max
  - max_by
  - md5
  - mean
  - median
  - min
  - min_by
  - minute
  - mod
  - mode
  - monotonically_increasing_id
  - month
  - monthname
  - months_between
- 14、N
- - named_struct
  - nanvl
  - negative
  - next_day
  - not
  - now
  - nth_value
  - ntile
  - nullif
  - nullifzero
  - nvl
  - nvl2
- 15、O
- - octet_length
  - or
  - overlay
- 16、O
- - parse_json
  - parse_url
  - percent_rank
  - percentile
  - percentile_approx
  - percentile_cont
  - percentile_disc
  - pi
  - pmod
  - posexplode
  - posexplode_outer
  - position
  - positive
  - pow
  - power
  - printf
- 17、Q
- - quarter
- 18、R
- - radians
  - raise_error
  - rand
  - randn
  - random
  - randstr
  - range
  - rank
  - reduce
  - reflect
  - regexp
  - regexp_count
  - regexp_extract
  - regexp_extract_all
  - regexp_instr
  - regexp_like
  - regexp_replace
  - regexp_substr
  - regr_avgx
  - regr_avgy
  - regr_count
  - regr_intercept
  - regr_r2
  - regr_slope
  - regr_sxx
  - regr_sxy
  - regr_syy
  - repeat
  - replace
  - reverse
  - right
  - rint
  - rlike
  - round
  - row_number
  - rpad
  - rtrim

13、M

make_date

make_date(year, month, day) - Create date from year, month and day fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Arguments:

year - the year to represent, from 1 to 9999
month - the month-of-year to represent, from 1 (January) to 12 (December)
day - the day-of-month to represent, from 1 to 31

Examples:

sql 复制代码

> SELECT make_date(2013, 7, 15);
 2013-07-15
> SELECT make_date(2019, 7, NULL);
 NULL

Since: 3.0.0

make_dt_interval

make_dt_interval([days[, hours[, mins[, secs]]]]) - Make DayTimeIntervalType duration from days, hours, mins and secs.

Arguments:

days - the number of days, positive or negative
hours - the number of hours, positive or negative
mins - the number of minutes, positive or negative
secs - the number of seconds with the fractional part in microsecond precision.

Examples:

sql 复制代码

> SELECT make_dt_interval(1, 12, 30, 01.001001);
 1 12:30:01.001001000
> SELECT make_dt_interval(2);
 2 00:00:00.000000000
> SELECT make_dt_interval(100, null, 3);
 NULL

Since: 3.2.0

make_interval

make_interval([years[, months[, weeks[, days[, hours[, mins[, secs]]]]]]]) - Make interval from years, months, weeks, days, hours, mins and secs.

Arguments:

years - the number of years, positive or negative
months - the number of months, positive or negative
weeks - the number of weeks, positive or negative
days - the number of days, positive or negative
hours - the number of hours, positive or negative
mins - the number of minutes, positive or negative
secs - the number of seconds with the fractional part in microsecond precision.

Examples:

sql 复制代码

> SELECT make_interval(100, 11, 1, 1, 12, 30, 01.001001);
 100 years 11 months 8 days 12 hours 30 minutes 1.001001 seconds
> SELECT make_interval(100, null, 3);
 NULL
> SELECT make_interval(0, 1, 0, 1, 0, 0, 100.000001);
 1 months 1 days 1 minutes 40.000001 seconds

Since: 3.0.0

make_timestamp

make_timestamp(year, month, day, hour, min, sec[, timezone]) - Create timestamp from year, month, day, hour, min, sec and timezone fields. The result data type is consistent with the value of configuration spark.sql.timestampType. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Arguments:

year - the year to represent, from 1 to 9999
month - the month-of-year to represent, from 1 (January) to 12 (December)
day - the day-of-month to represent, from 1 to 31
hour - the hour-of-day to represent, from 0 to 23
min - the minute-of-hour to represent, from 0 to 59
sec - the second-of-minute and its micro-fraction to represent, from 0 to 60. The value can be either an integer like 13 , or a fraction like 13.123. If the sec argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp.
timezone - the time zone identifier. For example, CET, UTC and etc.

Examples:

sql 复制代码

> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887);
 2014-12-28 06:30:45.887
> SELECT make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET');
 2014-12-27 21:30:45.887
> SELECT make_timestamp(2019, 6, 30, 23, 59, 60);
 2019-07-01 00:00:00
> SELECT make_timestamp(2019, 6, 30, 23, 59, 1);
 2019-06-30 23:59:01
> SELECT make_timestamp(null, 7, 22, 15, 30, 0);
 NULL

Since: 3.0.0

make_timestamp_ltz

make_timestamp_ltz(year, month, day, hour, min, sec[, timezone]) - Create the current timestamp with local time zone from year, month, day, hour, min, sec and timezone fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Arguments:

year - the year to represent, from 1 to 9999
month - the month-of-year to represent, from 1 (January) to 12 (December)
day - the day-of-month to represent, from 1 to 31
hour - the hour-of-day to represent, from 0 to 23
min - the minute-of-hour to represent, from 0 to 59
sec - the second-of-minute and its micro-fraction to represent, from 0 to 60. If the sec argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp.
timezone - the time zone identifier. For example, CET, UTC and etc.

Examples:

sql 复制代码

> SELECT make_timestamp_ltz(2014, 12, 28, 6, 30, 45.887);
 2014-12-28 06:30:45.887
> SELECT make_timestamp_ltz(2014, 12, 28, 6, 30, 45.887, 'CET');
 2014-12-27 21:30:45.887
> SELECT make_timestamp_ltz(2019, 6, 30, 23, 59, 60);
 2019-07-01 00:00:00
> SELECT make_timestamp_ltz(null, 7, 22, 15, 30, 0);
 NULL

Since: 3.4.0

make_timestamp_ntz

make_timestamp_ntz(year, month, day, hour, min, sec) - Create local date-time from year, month, day, hour, min, sec fields. If the configuration spark.sql.ansi.enabled is false, the function returns NULL on invalid inputs. Otherwise, it will throw an error instead.

Arguments:

year - the year to represent, from 1 to 9999
month - the month-of-year to represent, from 1 (January) to 12 (December)
day - the day-of-month to represent, from 1 to 31
hour - the hour-of-day to represent, from 0 to 23
min - the minute-of-hour to represent, from 0 to 59
sec - the second-of-minute and its micro-fraction to represent, from 0 to 60. If the sec argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp.

Examples:

sql 复制代码

> SELECT make_timestamp_ntz(2014, 12, 28, 6, 30, 45.887);
 2014-12-28 06:30:45.887
> SELECT make_timestamp_ntz(2019, 6, 30, 23, 59, 60);
 2019-07-01 00:00:00
> SELECT make_timestamp_ntz(null, 7, 22, 15, 30, 0);
 NULL

Since: 3.4.0

make_valid_utf8

make_valid_utf8(str) - Returns the original string if str is a valid UTF-8 string, otherwise returns a new string whose invalid UTF8 byte sequences are replaced using the UNICODE replacement character U+FFFD.

Arguments:

str - a string expression

Examples:

sql 复制代码

> SELECT make_valid_utf8('Spark');
 Spark
> SELECT make_valid_utf8(x'61');
 a
> SELECT make_valid_utf8(x'80');
 �
> SELECT make_valid_utf8(x'61C262');
 a�b

Since: 4.0.0

make_ym_interval

make_ym_interval([years[, months]]) - Make year-month interval from years, months.

Arguments:

years - the number of years, positive or negative
months - the number of months, positive or negative

Examples:

sql 复制代码

> SELECT make_ym_interval(1, 2);
 1-2
> SELECT make_ym_interval(1, 0);
 1-0
> SELECT make_ym_interval(-1, 1);
 -0-11
> SELECT make_ym_interval(2);
 2-0

Since: 3.2.0

map

map(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.

Examples:

sql 复制代码

> SELECT map(1.0, '2', 3.0, '4');
 {1.0:"2",3.0:"4"}

Since: 2.0.0

map_concat

map_concat(map, ...) - Returns the union of all the given maps

Examples:

sql 复制代码

> SELECT map_concat(map(1, 'a', 2, 'b'), map(3, 'c'));
 {1:"a",2:"b",3:"c"}

Since: 2.4.0

map_contains_key

map_contains_key(map, key) - Returns true if the map contains the key.

Examples:

sql 复制代码

> SELECT map_contains_key(map(1, 'a', 2, 'b'), 1);
 true
> SELECT map_contains_key(map(1, 'a', 2, 'b'), 3);
 false

Since: 3.3.0

map_entries

map_entries(map) - Returns an unordered array of all entries in the given map.

Examples:

sql 复制代码

> SELECT map_entries(map(1, 'a', 2, 'b'));
 [{"key":1,"value":"a"},{"key":2,"value":"b"}]

Since: 3.0.0

map_filter

map_filter(expr, func) - Filters entries in a map using the function.

Examples:

sql 复制代码

> SELECT map_filter(map(1, 0, 2, 2, 3, -1), (k, v) -> k > v);
 {1:0,3:-1}

Since: 3.0.0

map_from_arrays

map_from_arrays(keys, values) - Creates a map with a pair of the given key/value arrays. All elements in keys should not be null

Examples:

sql 复制代码

> SELECT map_from_arrays(array(1.0, 3.0), array('2', '4'));
 {1.0:"2",3.0:"4"}

Since: 2.4.0

map_from_entries

map_from_entries(arrayOfEntries) - Returns a map created from the given array of entries.

Examples:

sql 复制代码

> SELECT map_from_entries(array(struct(1, 'a'), struct(2, 'b')));
 {1:"a",2:"b"}

Since: 2.4.0

map_keys

map_keys(map) - Returns an unordered array containing the keys of the map.

Examples:

sql 复制代码

> SELECT map_keys(map(1, 'a', 2, 'b'));
 [1,2]

Since: 2.0.0

map_values

map_values(map) - Returns an unordered array containing the values of the map.

Examples:

sql 复制代码

> SELECT map_values(map(1, 'a', 2, 'b'));
 ["a","b"]

Since: 2.0.0

map_zip_with

map_zip_with(map1, map2, function) - Merges two given maps into a single map by applying function to the pair of values with the same key. For keys only presented in one map, NULL will be passed as the value for the missing key. If an input map contains duplicated keys, only the first entry of the duplicated key is passed into the lambda function.

Examples:

sql 复制代码

> SELECT map_zip_with(map(1, 'a', 2, 'b'), map(1, 'x', 2, 'y'), (k, v1, v2) -> concat(v1, v2));
 {1:"ax",2:"by"}
> SELECT map_zip_with(map('a', 1, 'b', 2), map('b', 3, 'c', 4), (k, v1, v2) -> coalesce(v1, 0) + coalesce(v2, 0));
 {"a":1,"b":5,"c":4}

Since: 3.0.0

mask

mask(input[, upperChar, lowerChar, digitChar, otherChar]) - masks the given string value. The function replaces characters with 'X' or 'x', and numbers with 'n'. This can be useful for creating copies of tables with sensitive information removed.

Arguments:

input - string value to mask. Supported types: STRING, VARCHAR, CHAR
upperChar - character to replace upper-case characters with. Specify NULL to retain original character. Default value: 'X'
lowerChar - character to replace lower-case characters with. Specify NULL to retain original character. Default value: 'x'
digitChar - character to replace digit characters with. Specify NULL to retain original character. Default value: 'n'
otherChar - character to replace all other characters with. Specify NULL to retain original character. Default value: NULL

Examples:

sql 复制代码

> SELECT mask('abcd-EFGH-8765-4321');
  xxxx-XXXX-nnnn-nnnn
> SELECT mask('abcd-EFGH-8765-4321', 'Q');
  xxxx-QQQQ-nnnn-nnnn
> SELECT mask('AbCD123-@$#', 'Q', 'q');
  QqQQnnn-@$#
> SELECT mask('AbCD123-@$#');
  XxXXnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q');
  QxQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q');
  QqQQnnn-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd');
  QqQQddd-@$#
> SELECT mask('AbCD123-@$#', 'Q', 'q', 'd', 'o');
  QqQQdddoooo
> SELECT mask('AbCD123-@$#', NULL, 'q', 'd', 'o');
  AqCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, 'd', 'o');
  AbCDdddoooo
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, 'o');
  AbCD123oooo
> SELECT mask(NULL, NULL, NULL, NULL, 'o');
  NULL
> SELECT mask(NULL);
  NULL
> SELECT mask('AbCD123-@$#', NULL, NULL, NULL, NULL);
  AbCD123-@$#

Since: 3.4.0

max

max(expr) - Returns the maximum value of expr.

Examples:

sql 复制代码

> SELECT max(col) FROM VALUES (10), (50), (20) AS tab(col);
 50

Since: 1.0.0

max_by

max_by(x, y) - Returns the value of x associated with the maximum value of y.

Examples:

sql 复制代码

> SELECT max_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 20) AS tab(x, y);
 b

Note:

The function is non-deterministic so the output order can be different for those associated the same values of x.

Since: 3.0.0

md5

md5(expr) - Returns an MD5 128-bit checksum as a hex string of expr.

Examples:

sql 复制代码

> SELECT md5('Spark');
 8cde774d6f7333752ed72cacddb05126

Since: 1.5.0

mean

mean(expr) - Returns the mean calculated from values of a group.

Examples:

sql 复制代码

> SELECT mean(col) FROM VALUES (1), (2), (3) AS tab(col);
 2.0
> SELECT mean(col) FROM VALUES (1), (2), (NULL) AS tab(col);
 1.5

Since: 1.0.0

median

median(col) - Returns the median of numeric or ANSI interval column col.

Examples:

sql 复制代码

> SELECT median(col) FROM VALUES (0), (10) AS tab(col);
 5.0
> SELECT median(col) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-5

Since: 3.4.0

min

min(expr) - Returns the minimum value of expr.

Examples:

sql 复制代码

> SELECT min(col) FROM VALUES (10), (-1), (20) AS tab(col);
 -1

Since: 1.0.0

min_by

min_by(x, y) - Returns the value of x associated with the minimum value of y.

Examples:

sql 复制代码

> SELECT min_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 20) AS tab(x, y);
 a

Note:

The function is non-deterministic so the output order can be different for those associated the same values of x.

Since: 3.0.0

minute

minute(timestamp) - Returns the minute component of the string/timestamp.

Examples:

sql 复制代码

> SELECT minute('2009-07-30 12:58:59');
 58

Since: 1.5.0

mod

expr1 % expr2, or mod(expr1, expr2) - Returns the remainder after expr1/expr2.

Examples:

sql 复制代码

> SELECT 2 % 1.8;
 0.2
> SELECT MOD(2, 1.8);
 0.2

Since: 2.3.0

mode

mode(col[, deterministic]) - Returns the most frequent value for the values within col. NULL values are ignored. If all the values are NULL, or there are 0 rows, returns NULL. When multiple values have the same greatest frequency then either any of values is returned if deterministic is false or is not defined, or the lowest value is returned if deterministic is true. mode() WITHIN GROUP (ORDER BY col) - Returns the most frequent value for the values within col (specified in ORDER BY clause). NULL values are ignored. If all the values are NULL, or there are 0 rows, returns NULL. When multiple values have the same greatest frequency only one value will be returned. The value will be chosen based on sort direction. Return the smallest value if sort direction is asc or the largest value if sort direction is desc from multiple values with the same frequency.

Examples:

sql 复制代码

> SELECT mode(col) FROM VALUES (0), (10), (10) AS tab(col);
 10
> SELECT mode(col) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-10
> SELECT mode(col) FROM VALUES (0), (10), (10), (null), (null), (null) AS tab(col);
 10
> SELECT mode(col, false) FROM VALUES (-10), (0), (10) AS tab(col);
 0
> SELECT mode(col, true) FROM VALUES (-10), (0), (10) AS tab(col);
 -10
> SELECT mode() WITHIN GROUP (ORDER BY col) FROM VALUES (0), (10), (10) AS tab(col);
 10
> SELECT mode() WITHIN GROUP (ORDER BY col) FROM VALUES (0), (10), (10), (20), (20) AS tab(col);
 10
> SELECT mode() WITHIN GROUP (ORDER BY col DESC) FROM VALUES (0), (10), (10), (20), (20) AS tab(col);
 20

Since: 3.4.0

monotonically_increasing_id

monotonically_increasing_id() - Returns monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the lower 33 bits represent the record number within each partition. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. The function is non-deterministic because its result depends on partition IDs.

Examples:

sql 复制代码

> SELECT monotonically_increasing_id();
 0

Since: 1.4.0

month

month(date) - Returns the month component of the date/timestamp.

Examples:

sql 复制代码

> SELECT month('2016-07-30');
 7

Since: 1.5.0

monthname

monthname(date) - Returns the three-letter abbreviated month name from the given date.

Examples:

sql 复制代码

> SELECT monthname('2008-02-20');
 Feb

Since: 4.0.0

months_between

months_between(timestamp1, timestamp2[, roundOff]) - If timestamp1 is later than timestamp2, then the result is positive. If timestamp1 and timestamp2 are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false.

Examples:

sql 复制代码

> SELECT months_between('1997-02-28 10:30:00', '1996-10-30');
 3.94959677
> SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false);
 3.9495967741935485

Since: 1.5.0

14、N

named_struct

named_struct(name1, val1, name2, val2, ...) - Creates a struct with the given field names and values.

Examples:

sql 复制代码

> SELECT named_struct("a", 1, "b", 2, "c", 3);
 {"a":1,"b":2,"c":3}

Since: 1.5.0

nanvl

nanvl(expr1, expr2) - Returns expr1 if it's not NaN, or expr2 otherwise.

Examples:

sql 复制代码

> SELECT nanvl(cast('NaN' as double), 123);
 123.0

Since: 1.5.0

negative

negative(expr) - Returns the negated value of expr.

Examples:

sql 复制代码

> SELECT negative(1);
 -1

Since: 1.0.0

next_day

next_day(start_date, day_of_week) - Returns the first date which is later than start_date and named as indicated. The function returns NULL if at least one of the input parameters is NULL. When both of the input parameters are not NULL and day_of_week is an invalid input, the function throws SparkIllegalArgumentException if spark.sql.ansi.enabled is set to true, otherwise NULL.

Examples:

sql 复制代码

> SELECT next_day('2015-01-14', 'TU');
 2015-01-20

Since: 1.5.0

not

not expr - Logical not.

Examples:

sql 复制代码

> SELECT not true;
 false
> SELECT not false;
 true
> SELECT not NULL;
 NULL

Since: 1.0.0

now

now() - Returns the current timestamp at the start of query evaluation.

Examples:

sql 复制代码

> SELECT now();
 2020-04-25 15:49:11.914

Since: 1.6.0

nth_value

nth_value(input[, offset]) - Returns the value of input at the row that is the offsetth row from beginning of the window frame. Offset starts at 1. If ignoreNulls=true, we will skip nulls when finding the offsetth row. Otherwise, every row counts for the offset. If there is no such an offsetth row (e.g., when the offset is 10, size of the window frame is less than 10), null is returned.

Arguments:

input - the target column or expression that the function operates on.
offset - a positive int literal to indicate the offset in the window frame. It starts with 1.
ignoreNulls - an optional specification that indicates the NthValue should skip null values in the determination of which row to use.

Examples:

sql 复制代码

> SELECT a, b, nth_value(b, 2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   1
 A2 3   NULL

Since: 3.1.0

ntile

ntile(n) - Divides the rows for each window partition into n buckets ranging from 1 to at most n.

Arguments:

buckets - an int expression which is number of buckets to divide the rows in. Default value is 1.

Examples:

sql 复制代码

> SELECT a, b, ntile(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   2
 A2 3   1

Since: 2.0.0

nullif

nullif(expr1, expr2) - Returns null if expr1 equals to expr2, or expr1 otherwise.

Examples:

sql 复制代码

> SELECT nullif(2, 2);
 NULL

Since: 2.0.0

nullifzero

nullifzero(expr) - Returns null if expr is equal to zero, or expr otherwise.

Examples:

sql 复制代码

> SELECT nullifzero(0);
 NULL
> SELECT nullifzero(2);
 2

Since: 4.0.0

nvl

nvl(expr1, expr2) - Returns expr2 if expr1 is null, or expr1 otherwise.

Examples:

sql 复制代码

> SELECT nvl(NULL, array('2'));
 ["2"]

Since: 2.0.0

nvl2

nvl2(expr1, expr2, expr3) - Returns expr2 if expr1 is not null, or expr3 otherwise.

Examples:

sql 复制代码

> SELECT nvl2(NULL, 2, 1);
 1

Since: 2.0.0

15、O

octet_length

octet_length(expr) - Returns the byte length of string data or number of bytes of binary data.

Examples:

sql 复制代码

> SELECT octet_length('Spark SQL');
 9
> SELECT octet_length(x'537061726b2053514c');
 9

Since: 2.3.0

or

expr1 or expr2 - Logical OR.

Examples:

sql 复制代码

> SELECT true or false;
 true
> SELECT false or false;
 false
> SELECT true or NULL;
 true
> SELECT false or NULL;
 NULL

Since: 1.0.0

overlay

overlay(input, replace, pos[, len]) - Replace input with replace that starts at pos and is of length len.

Examples:

sql 复制代码

> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
 Spark_SQL
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
 Spark CORE
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
 Spark ANSI SQL
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
 Structured SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
 Spark_SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
 Spark CORE
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
 Spark ANSI SQL
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
 Structured SQL

Since: 3.0.0

16、O

parse_json

parse_json(jsonStr) - Parse a JSON string as a Variant value. Throw an exception when the string is not valid JSON value.

Examples:

sql 复制代码

> SELECT parse_json('{"a":1,"b":0.8}');
 {"a":1,"b":0.8}

Since: 4.0.0

parse_url

parse_url(url, partToExtract[, key]) - Extracts a part from a URL.

Examples:

sql 复制代码

> SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST');
 spark.apache.org
> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY');
 query=1
> SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query');
 1

Since: 2.0.0

percent_rank

percent_rank() - Computes the percentage ranking of a value in a group of values.

Arguments:

children - this is to base the rank on; a change in the value of one the children will trigger a change in rank. This is an internal parameter and will be assigned by the Analyser.

Examples:

sql 复制代码

> SELECT a, b, percent_rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   0.0
 A1 1   0.0
 A1 2   1.0
 A2 3   0.0

Since: 2.0.0

percentile

percentile(col, percentage [, frequency]) - Returns the exact percentile value of numeric or ANSI interval column col at the given percentage. The value of percentage must be between 0.0 and 1.0. The value of frequency should be positive integral

percentile(col, array(percentage1 [, percentage2]...) [, frequency]) - Returns the exact percentile value array of numeric column col at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. The value of frequency should be positive integral

Examples:

sql 复制代码

> SELECT percentile(col, 0.3) FROM VALUES (0), (10) AS tab(col);
 3.0
> SELECT percentile(col, array(0.25, 0.75)) FROM VALUES (0), (10) AS tab(col);
 [2.5,7.5]
> SELECT percentile(col, 0.5) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-5
> SELECT percentile(col, array(0.2, 0.5)) FROM VALUES (INTERVAL '0' SECOND), (INTERVAL '10' SECOND) AS tab(col);
 [0 00:00:02.000000000,0 00:00:05.000000000]

Since: 2.1.0

percentile_approx

percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile of the numeric or ansi interval column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value. The value of percentage must be between 0.0 and 1.0. The accuracy parameter (default: 10000) is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation. When percentage is an array, each value of the percentage array must be between 0.0 and 1.0. In this case, returns the approximate percentile array of column col at the given percentage array.

Examples:

sql 复制代码

> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), 100) FROM VALUES (0), (1), (2), (10) AS tab(col);
 [1,1,0]
> SELECT percentile_approx(col, 0.5, 100) FROM VALUES (0), (6), (7), (9), (10) AS tab(col);
 7
> SELECT percentile_approx(col, 0.5, 100) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '1' MONTH), (INTERVAL '2' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-1
> SELECT percentile_approx(col, array(0.5, 0.7), 100) FROM VALUES (INTERVAL '0' SECOND), (INTERVAL '1' SECOND), (INTERVAL '2' SECOND), (INTERVAL '10' SECOND) AS tab(col);
 [0 00:00:01.000000000,0 00:00:02.000000000]

Since: 2.1.0

percentile_cont

percentile_cont(percentage) WITHIN GROUP (ORDER BY col) - Return a percentile value based on a continuous distribution of numeric or ANSI interval column col at the given percentage (specified in ORDER BY clause).

Examples:

sql 复制代码

> SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY col) FROM VALUES (0), (10) AS tab(col);
 2.5
> SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY col) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-2

Since: 4.0.0

percentile_disc

percentile_disc(percentage) WITHIN GROUP (ORDER BY col) - Return a percentile value based on a discrete distribution of numeric or ANSI interval column col at the given percentage (specified in ORDER BY clause).

Examples:

sql 复制代码

> SELECT percentile_disc(0.25) WITHIN GROUP (ORDER BY col) FROM VALUES (0), (10) AS tab(col);
 0.0
> SELECT percentile_disc(0.25) WITHIN GROUP (ORDER BY col) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col);
 0-0

Since: 4.0.0

pi

pi() - Returns pi.

Examples:

sql 复制代码

> SELECT pi();
 3.141592653589793

Since: 1.5.0

pmod

pmod(expr1, expr2) - Returns the positive value of expr1 mod expr2.

Examples:

sql 复制代码

> SELECT pmod(10, 3);
 1
> SELECT pmod(-10, 3);
 2

Since: 1.5.0

posexplode

posexplode(expr) - Separates the elements of array expr into multiple rows with positions, or the elements of map expr into multiple rows and columns with positions. Unless specified otherwise, uses the column name pos for position, col for elements of the array or key and value for elements of the map.

Examples:

sql 复制代码

> SELECT posexplode(array(10,20));
 0  10
 1  20
> SELECT posexplode(collection => array(10,20));
 0  10
 1  20

Since: 2.0.0

posexplode_outer

posexplode_outer(expr) - Separates the elements of array expr into multiple rows with positions, or the elements of map expr into multiple rows and columns with positions. Unless specified otherwise, uses the column name pos for position, col for elements of the array or key and value for elements of the map.

Examples:

sql 复制代码

> SELECT posexplode_outer(array(10,20));
 0  10
 1  20
> SELECT posexplode_outer(collection => array(10,20));
 0  10
 1  20

Since: 2.0.0

position

position(substr, str[, pos]) - Returns the position of the first occurrence of substr in str after position pos. The given pos and return value are 1-based.

Examples:

sql 复制代码

> SELECT position('bar', 'foobarbar');
 4
> SELECT position('bar', 'foobarbar', 5);
 7
> SELECT POSITION('bar' IN 'foobarbar');
 4

Since: 2.3.0

positive

positive(expr) - Returns the value of expr.

Examples:

sql 复制代码

> SELECT positive(1);
 1

Since: 1.5.0

pow

pow(expr1, expr2) - Raises expr1 to the power of expr2.

Examples:

sql 复制代码

> SELECT pow(2, 3);
 8.0

Since: 1.4.0

power

power(expr1, expr2) - Raises expr1 to the power of expr2.

Examples:

sql 复制代码

> SELECT power(2, 3);
 8.0

Since: 1.4.0

printf

printf(strfmt, obj, ...) - Returns a formatted string from printf-style format strings.

Examples:

sql 复制代码

> SELECT printf("Hello World %d %s", 100, "days");
 Hello World 100 days

Since: 1.5.0

17、Q

quarter

quarter(date) - Returns the quarter of the year for date, in the range 1 to 4.

Examples:

sql 复制代码

> SELECT quarter('2016-08-31');
 3

Since: 1.5.0

18、R

radians

radians(expr) - Converts degrees to radians.

Arguments:

expr - angle in degrees

Examples:

sql 复制代码

> SELECT radians(180);
 3.141592653589793

Since: 1.4.0

raise_error

raise_error( expr ) - Throws a USER_RAISED_EXCEPTION with expr as message.

Examples:

sql 复制代码

> SELECT raise_error('custom error message');
 [USER_RAISED_EXCEPTION] custom error message

Since: 3.1.0

rand

rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Examples:

sql 复制代码

> SELECT rand();
 0.9629742951434543
> SELECT rand(0);
 0.7604953758285915
> SELECT rand(null);
 0.7604953758285915

Note:

The function is non-deterministic in general case.

Since: 1.5.0

randn

randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution.

Examples:

sql 复制代码

> SELECT randn();
 -0.3254147983080288
> SELECT randn(0);
 1.6034991609278433
> SELECT randn(null);
 1.6034991609278433

Note:

The function is non-deterministic in general case.

Since: 1.5.0

random

random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).

Examples:

sql 复制代码

> SELECT random();
 0.9629742951434543
> SELECT random(0);
 0.7604953758285915
> SELECT random(null);
 0.7604953758285915

Note:

The function is non-deterministic in general case.

Since: 3.0.0

randstr

randstr(length[, seed]) - Returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).

Examples:

sql 复制代码

> SELECT randstr(3, 0) AS result;
 ceV

Since: 4.0.0

range

range(start[, end[, step[, numSlices]]]) / range(end) - Returns a table of values within a specified range.

Arguments:

start - An optional BIGINT literal defaulted to 0, marking the first value generated.
end - A BIGINT literal marking endpoint (exclusive) of the number generation.
step - An optional BIGINT literal defaulted to 1, specifying the increment used when generating values.
numParts - An optional INTEGER literal specifying how the production of rows is spread across partitions.

Examples:

sql 复制代码

> SELECT * FROM range(1);
  +---+
  | id|
  +---+
  |  0|
  +---+
> SELECT * FROM range(0, 2);
  +---+
  |id |
  +---+
  |0  |
  |1  |
  +---+
> SELECT * FROM range(0, 4, 2);
  +---+
  |id |
  +---+
  |0  |
  |2  |
  +---+

Since: 2.0.0

rank

rank() - Computes the rank of a value in a group of values. The result is one plus the number of rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence.

Arguments:

children - this is to base the rank on; a change in the value of one the children will trigger a change in rank. This is an internal parameter and will be assigned by the Analyser.

Examples:

sql 复制代码

> SELECT a, b, rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   1
 A1 2   3
 A2 3   1

Since: 2.0.0

reduce

reduce(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

Examples:

sql 复制代码

> SELECT reduce(array(1, 2, 3), 0, (acc, x) -> acc + x);
 6
> SELECT reduce(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10);
 60

Since: 3.4.0

reflect

reflect(class, method[, arg1[, arg2 ...]]) - Calls a method with reflection.

Examples:

sql 复制代码

> SELECT reflect('java.util.UUID', 'randomUUID');
 c33fb387-8500-4bfa-81d2-6e0e3e930df2
> SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
 a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

Since: 2.0.0

regexp

regexp(str, regexp) - Returns true if str matches regexp, or false otherwise.

Arguments:

str - a string expression
regexp - a string expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.

Examples:

sql 复制代码

> SET spark.sql.parser.escapedStringLiterals=true;
spark.sql.parser.escapedStringLiterals  true
> SELECT regexp('%SystemDrive%\Users\John', '%SystemDrive%\\Users.*');
true
> SET spark.sql.parser.escapedStringLiterals=false;
spark.sql.parser.escapedStringLiterals  false
> SELECT regexp('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*');
true
> SELECT regexp('%SystemDrive%\\Users\\John', r'%SystemDrive%\\Users.*');
true

Note:

Use LIKE to match with simple string pattern.

Since: 3.2.0

regexp_count

regexp_count(str, regexp) - Returns a count of the number of times that the regular expression pattern regexp is matched in the string str.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Examples:

sql 复制代码

> SELECT regexp_count('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
 2
> SELECT regexp_count('abcdefghijklmnopqrstuvwxyz', '[a-z]{3}');
 8

Since: 3.4.0

regexp_extract

regexp_extract(str, regexp[, idx]) - Extract the first string in the str that match the regexp expression and corresponding to the regex group index.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
idx - an integer expression that representing the group index. The regex maybe contains multiple groups. idx indicates which regex group to extract. The group index should be non-negative. The minimum value of idx is 0, which means matching the entire regular expression. If idx is not specified, the default group index value is 1. The idx parameter is the Java regex Matcher group() method index.

Examples:

sql 复制代码

> SELECT regexp_extract('100-200', '(\\d+)-(\\d+)', 1);
 100
> SELECT regexp_extract('100-200', r'(\d+)-(\d+)', 1);
 100

Since: 1.5.0

regexp_extract_all

regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the regexp expression and corresponding to the regex group index.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
idx - an integer expression that representing the group index. The regex may contains multiple groups. idx indicates which regex group to extract. The group index should be non-negative. The minimum value of idx is 0, which means matching the entire regular expression. If idx is not specified, the default group index value is 1. The idx parameter is the Java regex Matcher group() method index.

Examples:

sql 复制代码

> SELECT regexp_extract_all('100-200, 300-400', '(\\d+)-(\\d+)', 1);
 ["100","300"]
> SELECT regexp_extract_all('100-200, 300-400', r'(\d+)-(\d+)', 1);
 ["100","300"]

Since: 3.1.0

regexp_instr

regexp_instr(str, regexp) - Searches a string for a regular expression and returns an integer that indicates the beginning position of the matched substring. Positions are 1-based, not 0-based. If no match is found, returns 0.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.

Examples:

sql 复制代码

> SELECT regexp_instr(r"\abc", r"^\\abc$");
 1
> SELECT regexp_instr('user@spark.apache.org', '@[^.]*');
 5

Since: 3.4.0

regexp_like

regexp_like(str, regexp) - Returns true if str matches regexp, or false otherwise.

Arguments:

str - a string expression
regexp - a string expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.

Examples:

sql 复制代码

> SET spark.sql.parser.escapedStringLiterals=true;
spark.sql.parser.escapedStringLiterals  true
> SELECT regexp_like('%SystemDrive%\Users\John', '%SystemDrive%\\Users.*');
true
> SET spark.sql.parser.escapedStringLiterals=false;
spark.sql.parser.escapedStringLiterals  false
> SELECT regexp_like('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*');
true
> SELECT regexp_like('%SystemDrive%\\Users\\John', r'%SystemDrive%\\Users.*');
true

Note:

Use LIKE to match with simple string pattern.

Since: 3.2.0

regexp_replace

regexp_replace(str, regexp, rep[, position]) - Replaces all substrings of str that match regexp with rep.

Arguments:

str - a string expression to search for a regular expression pattern match.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.
rep - a string expression to replace matched substrings.
position - a positive integer literal that indicates the position within str to begin searching. The default is 1. If position is greater than the number of characters in str, the result is str.

Examples:

sql 复制代码

> SELECT regexp_replace('100-200', '(\\d+)', 'num');
 num-num
> SELECT regexp_replace('100-200', r'(\d+)', 'num');
 num-num

Since: 1.5.0

regexp_substr

regexp_substr(str, regexp) - Returns the substring that matches the regular expression regexp within the string str. If the regular expression is not found, the result is null.

Arguments:

str - a string expression.
regexp - a string representing a regular expression. The regex string should be a Java regular expression.

Examples:

sql 复制代码

> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Ste(v|ph)en');
 Steven
> SELECT regexp_substr('Steven Jones and Stephen Smith are the best players', 'Jeck');
 NULL

Since: 3.4.0

regr_avgx

regr_avgx(y, x) - Returns the average of the independent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_avgx(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 2.75
> SELECT regr_avgx(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_avgx(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_avgx(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 3.0
> SELECT regr_avgx(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 3.0

Since: 3.3.0

regr_avgy

regr_avgy(y, x) - Returns the average of the dependent variable for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_avgy(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 1.75
> SELECT regr_avgy(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_avgy(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_avgy(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 1.6666666666666667
> SELECT regr_avgy(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 1.5

Since: 3.3.0

regr_count

regr_count(y, x) - Returns the number of non-null number pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 4
> SELECT regr_count(y, x) FROM VALUES (1, null) AS tab(y, x);
 0
> SELECT regr_count(y, x) FROM VALUES (null, 1) AS tab(y, x);
 0
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 3
> SELECT regr_count(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 2

Since: 3.3.0

regr_intercept

regr_intercept(y, x) - Returns the intercept of the univariate linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_intercept(y, x) FROM VALUES (1, 1), (2, 2), (3, 3), (4, 4) AS tab(y, x);
 0.0
> SELECT regr_intercept(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_intercept(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_intercept(y, x) FROM VALUES (1, 1), (2, null), (3, 3), (4, 4) AS tab(y, x);
 0.0
> SELECT regr_intercept(y, x) FROM VALUES (1, 1), (2, null), (null, 3), (4, 4) AS tab(y, x);
 0.0

Since: 3.4.0

regr_r2

regr_r2(y, x) - Returns the coefficient of determination for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 0.2727272727272727
> SELECT regr_r2(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_r2(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 0.7500000000000001
> SELECT regr_r2(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 1.0

Since: 3.3.0

regr_slope

regr_slope(y, x) - Returns the slope of the linear regression line for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_slope(y, x) FROM VALUES (1, 1), (2, 2), (3, 3), (4, 4) AS tab(y, x);
 1.0
> SELECT regr_slope(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_slope(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_slope(y, x) FROM VALUES (1, 1), (2, null), (3, 3), (4, 4) AS tab(y, x);
 1.0
> SELECT regr_slope(y, x) FROM VALUES (1, 1), (2, null), (null, 3), (4, 4) AS tab(y, x);
 1.0

Since: 3.4.0

regr_sxx

regr_sxx(y, x) - Returns REGR_COUNT(y, x) * VAR_POP(x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_sxx(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 2.75
> SELECT regr_sxx(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_sxx(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_sxx(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 2.0
> SELECT regr_sxx(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 2.0

Since: 3.4.0

regr_sxy

regr_sxy(y, x) - Returns REGR_COUNT(y, x) * COVAR_POP(y, x) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 0.75
> SELECT regr_sxy(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_sxy(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 1.0
> SELECT regr_sxy(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 1.0

Since: 3.4.0

regr_syy

regr_syy(y, x) - Returns REGR_COUNT(y, x) * VAR_POP(y) for non-null pairs in a group, where y is the dependent variable and x is the independent variable.

Examples:

sql 复制代码

> SELECT regr_syy(y, x) FROM VALUES (1, 2), (2, 2), (2, 3), (2, 4) AS tab(y, x);
 0.75
> SELECT regr_syy(y, x) FROM VALUES (1, null) AS tab(y, x);
 NULL
> SELECT regr_syy(y, x) FROM VALUES (null, 1) AS tab(y, x);
 NULL
> SELECT regr_syy(y, x) FROM VALUES (1, 2), (2, null), (2, 3), (2, 4) AS tab(y, x);
 0.6666666666666666
> SELECT regr_syy(y, x) FROM VALUES (1, 2), (2, null), (null, 3), (2, 4) AS tab(y, x);
 0.5

Since: 3.4.0

repeat

repeat(str, n) - Returns the string which repeats the given string value n times.

Examples:

sql 复制代码

> SELECT repeat('123', 2);
 123123

Since: 1.5.0

replace

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Arguments:

str - a string expression
search - a string expression. If search is not found in str, str is returned unchanged.
replace - a string expression. If replace is not specified or is an empty string, nothing replaces the string that is removed from str.

Examples:

sql 复制代码

> SELECT replace('ABCabc', 'abc', 'DEF');
 ABCDEF

Since: 2.3.0

reverse

reverse(array) - Returns a reversed string or an array with reverse order of elements.

Examples:

sql 复制代码

> SELECT reverse('Spark SQL');
 LQS krapS
> SELECT reverse(array(2, 1, 4, 3));
 [3,4,1,2]

Note:

Reverse logic for arrays is available since 2.4.0.

Since: 1.5.0

right

right(str, len) - Returns the rightmost len(len can be string type) characters from the string str,if len is less or equal than 0 the result is an empty string.

Examples:

sql 复制代码

> SELECT right('Spark SQL', 3);
 SQL

Since: 2.3.0

rint

rint(expr) - Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Examples:

sql 复制代码

> SELECT rint(12.3456);
 12.0

Since: 1.4.0

rlike

rlike(str, regexp) - Returns true if str matches regexp, or false otherwise.

Arguments:

str - a string expression
regexp - a string expression. The regex string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser, see the unescaping rules at String Literal. For example, to match "\abc", a regular expression for regexp can be "^\abc$".

There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to fallback to the Spark 1.6 behavior regarding string literal parsing. For example, if the config is enabled, the regexp that can match "\abc" is "^\abc$".

It's recommended to use a raw string literal (with the r prefix) to avoid escaping special characters in the pattern string if exists.

Examples:

sql 复制代码

> SET spark.sql.parser.escapedStringLiterals=true;
spark.sql.parser.escapedStringLiterals  true
> SELECT rlike('%SystemDrive%\Users\John', '%SystemDrive%\\Users.*');
true
> SET spark.sql.parser.escapedStringLiterals=false;
spark.sql.parser.escapedStringLiterals  false
> SELECT rlike('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*');
true
> SELECT rlike('%SystemDrive%\\Users\\John', r'%SystemDrive%\\Users.*');
true

Note:

Use LIKE to match with simple string pattern.

Since: 1.0.0

round

round(expr, d) - Returns expr rounded to d decimal places using HALF_UP rounding mode.

Examples:

sql 复制代码

> SELECT round(2.5, 0);
 3

Since: 1.5.0

row_number

row_number() - Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.

Examples:

sql 复制代码

> SELECT a, b, row_number() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
 A1 1   1
 A1 1   2
 A1 2   3
 A2 3   1

Since: 2.0.0

rpad

rpad(str, len[, pad]) - Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. If pad is not specified, str will be padded to the right with space characters if it is a character string, and with zeros if it is a binary string.

Examples:

sql 复制代码

> SELECT rpad('hi', 5, '??');
 hi???
> SELECT rpad('hi', 1, '??');
 h
> SELECT rpad('hi', 5);
 hi
> SELECT hex(rpad(unhex('aabb'), 5));
 AABB000000
> SELECT hex(rpad(unhex('aabb'), 5, unhex('1122')));
 AABB112211

Since: 1.5.0

rtrim

rtrim(str) - Removes the trailing space characters from str.

Arguments:

str - a string expression
trimStr - the trim string characters to trim, the default value is a single space

Examples:

sql 复制代码

> SELECT rtrim('    SparkSQL   ');
 SparkSQL

Since: 1.5.0