Needs to refactor the DESCRIPTION field parsing
Our problem
Due to the lack of formatting conventions, it is pretty hard to parse Nantes University CELCAT data.
Our old formatting technique using a split on ,
the subject field was working pretty great but not for all subjects. If the subject has on comma in his name, it will be spliced in half, for example XMS1IE020 (Développement, données et exploitation - Dev0ps et)
was treated has données et exploitation - DevOps et)
and XMS1IE020 (Développement
which wasn't very good or accurate.
What are the difficulties of this parsing ?
This issue is just the following of the issue #14 (closed), where we said for the subject parsing :
Can be parsed using a split with ", " and each items stored in a list. BUT need to check if there is no possible conflicts with subject name which might contains a ", " and broke everything.
As we have seen, it has broken everything...
So in order to have something that worked even with a subject that is containing a comma, we need to analyse again CELCAT, we will just reuse the old analysis made in the issue #14 (closed) in this file CELCAT_SCAN_RESULTS.txt.
After analysis, we know that in CELCAT subjects can take multiple forms like:
- Complé. anglais BCVA:Prépa. oral concours B (X31B272) : SUBJECT_NAME (SUBJECT_CODE)
- XMS1IE020 (Développement, données et exploitation - Dev0ps et) SUBJECT_CODE (SUBJECT_NAME)
It appears that this formatting SUBJECT_CODE (SUBJECT_NAME)
is the most frequent but in our parsing we have to take both in consideration including the potential presence of commas in the SUBJECT_NAME
part.
How did we solve this issue ?
In order to solve this issue, we used some regex combined with some Java Matcher.
To detect both cases, we simply used this regex :
(?:([A-Z\\d]+)\\s*\\(([^)]+)\\))|(?:([^()]+)\\s*\\(([^)]+)\\))
Which is just the combination of two regex:
-
(?:([A-Z\\d]+)\\s*\\(([^)]+)\\))
: DetectsSUBJECT_NAME (SUBJECT_CODE)
-
(?:([^()]+)\\s*\\(([^)]+)\\))
: DetectsSUBJECT_CODE (SUBJECT_NAME)
NOTE: There might be room for improvement, we tried to improve it but without success.
Now, with this regex, we can pass from this
"Complé. anglais BCVA:Prépa. oral concours B (X31B272), XMS1IE020 (Développement, données et exploitation - Dev0ps et), XLP5CE011 (Prévention des risques en santé sécurité), X3IA020 (Gestion des données distribuées à large échelle)"
to this
["Complé. anglais BCVA:Prépa. oral concours B (X31B272)",
", XMS1IE020 (Développement, données et exploitation - Dev0ps et)",
", XLP5CE011 (Prévention des risques en santé sécurité)",
", X3IA020 (Gestion des données distribuées à large échelle)"]
Now the biggest part of the work is done, we just need an other treatment in order to remove the unwanted part, we can do that using this regex ^(,| )*
, which will trim the beginning from unwanted characters. We just have to do replaceAll("^(,| )*", "")
to delete those unwanted characters.
Here is our final Java code :
private static List<String> parseSubject(String subjectFieldContent) {
if (subjectFieldContent == null) throw new IllegalArgumentException("The subject of the event cannot be null");
List<String> subjects = new ArrayList<>();
Pattern pattern = Pattern.compile("(?:([A-Z\\d]+)\\s*\\(([^)]+)\\))|(?:([^()]+)\\s*\\(([^)]+)\\))");
Matcher matcher = pattern.matcher(subjectFieldContent);
while (matcher.find()) {
String match = matcher.group().trim();
match = match.replaceAll("^(,| )*", "");
subjects.add(match);
}
return subjects;
}
And what we have at the end of the treatment: