Oracle正则表达式实战(二)

2015-01-23 21:53:19 · 作者: · 浏览: 29
n a source string, how do we split it up into separatecolumns, based on changes of case and alpha-to-numeric, such that this.

给定一个元字符串,要求按照指定规则(基于字母大小写和字母到数字的变化)分割为多个列:

ArtADB1234567e9876540 

Becomes this. 分割后:

Art ADB 1234567 e 9876540

The source data is set up like this. 元数据如下:

DROP TABLE t1;
CREATE TABLE t1 (
  data VARCHAR2(50)
);
 
INSERT INTO t1 VALUES ('ArtADB1234567e9876540');
COMMIT;

The first part of the string is an initcap word, so it starts witha capital letter between "A" and "Z". We identify a singlecharacter using the "[]" operator, and ranges are represented using"-", like "A-Z", "a-z" or "0-9". So ifwe are looking for a single character that is a capital letter, we need to lookfor "[A-Z]". That needs to be followed by lower case letters, whichwe now know is "[a-z]", but we need 1 or more of them, which issignified by the "+" operator. So to find an initcap word, we need tosearch for "[A-Z][a-z]+". Since we want the first occurrence of this,we can use the following.

字符串第一部分为大写字母,可能为A-Z。我们使用[]操作符识别单个字符,至于范围则用“-”,例如“A-Z”,"a-z"或"0-9"。所以如果我们需要找大写的首字母则用“[A-Z]”。其后紧跟着的是若干小写字母,可以用+表示若干(1个或多个)。组合起来的正则表达式即为:[A-Z][a-z]+,这样拆分出的第一列方法有了。

REGEXP_SUBSTR(data, '[A-Z][a-z]+', 1, 1)

The second part of the string is a group of 1 or more uppercaseletters. We know we need to use the "[A-Z]+" pattern, but we need tomake sure we don't get the first capital letter, so we look for the secondoccurrence.

第二部分是一组包含1个或多个大写字母。我们知道需要用模式:[A-Z]+,但是为了不和第一部分冲突,我们指明匹配其第2次出现的文本。

REGEXP_SUBSTR(data, '[A-Z]+', 1, 2)

The next part is the first occurrence of a group of numbers.

下一部分是一组纯数字。

REGEXP_SUBSTR(data, '[0-9]+', 1, 1)

The next part is a group of lower case letters. We don't to pickup those from the initcap word, so we must look for the second occurrence oflower case letters.

下一部分是一组小写字母,同样考虑了不和第一部分冲突:

REGEXP_SUBSTR(data, '[a-z]+', 1, 2)

Finally, we have a group of numbers, which is the secondoccurrence of this pattern.

最后,是一组数字:

REGEXP_SUBSTR(data, '[0-9]+', 1, 2)

Putting that all together, we have the following query, whichsplits the data into separate columns.

将以上每一部分正则表达式的输出分别作为独立字段:

COLUMN col1 FORMAT A15
COLUMN col2 FORMAT A15
COLUMN col3 FORMAT A15
COLUMN col4 FORMAT A15
COLUMN col5 FORMAT A15
 
SELECT REGEXP_SUBSTR(data, '[A-Z][a-z]+', 1, 1) col1,
       REGEXP_SUBSTR(data, '[A-Z]+', 1, 2) col2,
       REGEXP_SUBSTR(data, '[0-9]+', 1, 1) col3,
       REGEXP_SUBSTR(data, '[a-z]+', 1, 2) col4,
       REGEXP_SUBSTR(data, '[0-9]+', 1, 2) col5
FROM   t1;
 
COL1          COL2          COL3            COL4        COL5
---------   ----------    ----------    -----------   ------------
Art           ADB          1234567         e             9876540
 
1 row selected.
 
SQL>

Example 3 : REGEXP_SUBSTR

We need to pull out a group of characters from a "/"delimited string, optionally enclosed by double quotes. The data looks likethis.

我们需要从一个字符串(含有分隔字符/和双引号” ”)中提取一组字符,原始数据如下:

DROP TABLE t1;
CREATE TABLE t1 (
  data VARCHAR2(50)
);
 
INSERT INTO t1 VALUES ('978/955086/GZ120804/10-FEB-12');
INSERT INTO t1 VALUES ('97/95508/BANANA/10-FEB-12');
INSERT INTO t1 VALUES ('97/95508/"APPLE"/10-FEB-12');
COMMIT;

We are looking for 1 or more characters that are not"/", which we do using "[^/]+". The "^" in thebrackets represents NOT and "+" means 1 or more. We also want toremove optional double quotes, so we add that as a character we don't want,giving us "[^/"]+". So if we want the data from the thirdcolumn, we need the third occurrence of t