Sunday, November 8, 2015

Oracle SQL: REGEX_REPLACE Function

Recently, I participated in a forum discussion (Higher Education User Group) where someone asked if there is a way (and if so how) to strip out HTML elements from a long character field. This is mainly for scenarios where a long character field is used to store rich text enabled data (which contains HTML elements). And while querying the data from this field either using PS Query or SQL the requirement is to only retrieve the "plain text".

I found that we could use a delivered Oracle SQL function called REGEX_REPLACE to pattern match and replace contents of a string.

Here are some examples of how to use this function:

If we want to mainly get rid of HTML elements then we can use the following pattern:

select REGEXP_REPLACE(DESCRLONG, '<[^>]+>|\&(nbsp;)|\&(nbsp)|(amp;)', '',1,0, 'i'), DESCRLONG from PSMSGCATDEFN where descrlong like '%<%>%';

If we additionally want to get rid of contents inside script/style elements then we can use the following pattern:

select REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(DESCRLONG, '<script[^>]*>.*</script>', '',1,0,'in'), '<style[^>]*>.*</style>', '',1,0,'in'),'<[^>]+>|\&(nbsp;)|\&(nbsp)|(amp;)', '',1,0,'i'), DESCRLONG from spsoft.PSMSGCATDEFN where descrlong like '%<%>%';

Note:
- You will need to improvise the regex pattern for those cases which are not trapped in the above sample SQLs.
- This function REGEX_REPLACE only works on Oracle databases.
- I have to admit that I am not a RegEx expert and received help from some very talented colleagues who helped me with the second (more complex) SQL.

1 comment:

  1. REGEXP_REPLACE( DESCR254_mixed, '[^[:alnum:] -]', NULL )


    SELECT DESCR254_mixed,
    REGEXP_REPLACE( DESCR254_mixed, '[^[:alnum:] -]', NULL ) AS DESCR254_mixed
    FROM PS_PO_LINE WHERE PO_ID = '3001818231';

    ReplyDelete