Skip to content

Commit 58a64b0

Browse files
committed
Add converter to the repo
1 parent e2fd0dc commit 58a64b0

6 files changed

Lines changed: 308 additions & 9 deletions

File tree

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.idea
2+
*.iml
3+
4+
target

README.md

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,71 @@
11

2-
# FormulaCloudData
2+
# Mathematical Objetcs of Interest
33

4-
This repository contains the results of the distributional analysis of Mathematical Objects of Interest (MOI) for the datasets [arXMLiv 08/2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) and [zbMATH](https://zbmath.org/).
4+
This repository contains the results of the distributional analysis of [Mathematical Objects of Interest (MOI)](https://arxiv.org/pdf/2002.02712.pdf)
5+
for the datasets [arXMLiv 08/2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) and
6+
[zbMATH](https://zbmath.org/).
7+
8+
## Please cite as:
9+
A. Greiner-Petter, M. Schubotz, F. Müller, C. Breitinger, H. S. Cohl, A. Aizawa, and B. Gipp,
10+
*"Discovering Mathematical Objects of Interest - A Study of Mathematical Notations."*
11+
**In:** Proceedings of The Web Conference 2020 (WWW’20), April 20–24, 2020, Taipei, Taiwan.
12+
**DOI:** [10.1145/3366423.3380218](https://doi.org/10.1145/3366423.3380218)
13+
14+
The [preprint](https://arxiv.org/pdf/2002.02712.pdf) of the paper can be found on arXiv [https://arxiv.org/abs/2002.02712](https://arxiv.org/abs/2002.02712).
15+
16+
```tex
17+
@InProceedings{GreinerPetter2020,
18+
author = {Andr{\'{e}} Greiner{-}Petter and
19+
Moritz Schubotz and
20+
Fabian M\"{u}ller and
21+
Corinna Breitinger and
22+
Howard S.~Cohl and
23+
Akiko Aizawa and
24+
Bela Gipp},
25+
title = {Discovering Mathematical Objects of Interest - A Study of Mathematical Notations},
26+
booktitle = {Proceedings of The Web Conference 2020 (WWW '20), April 20--24, 2020, Taipei, Taiwan},
27+
doi = {10.1145/3366423.3380218}
28+
}
29+
```
530

631
## Download the Data
732

833
For downloading the data either use `wget` or `curl` or go to the [releases of this GitHub repository](https://github.com/ag-gipp/FormulaCloudData/releases) and download it manually.
934

10-
#### arXMLiv
11-
Unzipped data requires 6.2GB free disk space.
35+
##### arXMLiv-Large
36+
Unzipped data requires 61GB of free disk space.
37+
You can download single parts if you do not need the entire dataset.
38+
If you are not interested in high-complex expressions, one can use the small version of the dataset.
39+
```shell script
40+
...
41+
```
42+
43+
##### arXMLiv-Small
44+
Unzipped data requires 6.2GB of free disk space.
45+
This dataset contains only expressions that appear at least twice in one document.
46+
Hence, many complex expressions do not appear in this data.
1247
``` sh
1348
user@pc:~/zbmath$ wget https://github.com/ag-gipp/FormulaCloudData/releases/download/2.0-arxiv/arxmliv-distributions.zip
1449
user@pc:~/zbmath$ unzip arxmliv-distributions.zip
1550
```
1651

17-
#### zbMATH
52+
##### zbMATH
1853
Unzipped data requires 1.1GB free disk space.
19-
```sh
54+
```shell script
2055
user@pc:~/zbmath$ wget https://github.com/ag-gipp/FormulaCloudData/releases/download/1.0-zb/zbmath-distributions.zip
2156
user@pc:~/zbmath$ unzip zbmath-distributions.zip
2257
```
2358

2459
## Explore the Data
2560

2661
Each dataset contains multiple numbered files without file extensions. You simply can peek into one of the files to explore the general structure. Each entry contains the string representation (SR) of the unique expression in the dataset, the complexity value (C), the total term frequency (TF) in the dataset, and the document frequency (DF) of the expression. All files are CSV files (separated by colons). For example, if you look at the first line of the file `1` in zbMATH you would see the following
27-
``` sh
62+
```shell script
2863
user@pc:~/zbmath$ head -1 1
2964
"mfrac(mi:d,mrow(mn:1,mo:+,mi:d))";3;1;1
3065
```
3166

3267
If you want so search for specific expressions, say the mass-energy equivalence, we recommend to use `grep`. Here is an example to search for the entry in zbMATH:
33-
``` sh
68+
```shell script
3469
user@pc:~/zbmath$ grep '"mrow(mi:E,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))"' *
3570
12:"mrow(mi:E,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;63;49
3671
```
@@ -55,8 +90,16 @@ We can see there are actually 6 distinguished left-hand sides in zbMATH:
5590
6) `hv = mc^2`
5691

5792
If you have `parallel` installed, you can speed up the process. For example:
58-
```bash
93+
```shell script
5994
find . -type f | parallel 'grep "mrow(mi:E,mo:=,mrow(mi:m.*;[[:digit:]]*;..*;[3456789]$" {}' 2>/dev/null | awk '{print $1}'
6095

6196
find . -type f | xargs -n 1 -P 32 gawk 'match($0, /^"mrow\(mi:E,mo:=,mrow\(mi:m.*;[[:digit:]]+;[[:digit:]]+;[[:digit:]][[:digit:]]+$/, arr) {print $1}'
6297
```
98+
99+
### Convert String to MathML
100+
The repository also contains a converter (including source code) to convert the string representation back to the MathML representation.
101+
Simply use
102+
```shell script
103+
user@pc/~: java -jar converter.jar "mrow(msub(mi:E,mn:0),mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))"
104+
<mrow><msub><mi>E</mi><mn>0</mn></msub><mo>=</mo><mrow><mi>m</mi><mo></mo><msup><mi>c</mi><mn>2</mn></msup></mrow></mrow>
105+
```

converter.jar

693 KB
Binary file not shown.

converter/pom.xml

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0"
3+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
5+
<modelVersion>4.0.0</modelVersion>
6+
7+
<groupId>math.objects.interest</groupId>
8+
<artifactId>converter</artifactId>
9+
<version>1.0-SNAPSHOT</version>
10+
11+
<dependencies>
12+
<dependency>
13+
<groupId>org.apache.commons</groupId>
14+
<artifactId>commons-text</artifactId>
15+
<version>1.8</version>
16+
</dependency>
17+
</dependencies>
18+
19+
<build>
20+
<plugins>
21+
<plugin>
22+
<groupId>org.apache.maven.plugins</groupId>
23+
<artifactId>maven-compiler-plugin</artifactId>
24+
<version>3.8.1</version>
25+
<configuration>
26+
<release>11</release>
27+
</configuration>
28+
</plugin>
29+
<plugin>
30+
<groupId>org.apache.maven.plugins</groupId>
31+
<artifactId>maven-assembly-plugin</artifactId>
32+
<version>3.2.0</version>
33+
<executions>
34+
<execution>
35+
<id>compile-converter</id>
36+
<phase>package</phase>
37+
<goals>
38+
<goal>single</goal>
39+
</goals>
40+
<configuration>
41+
<finalName>converter</finalName>
42+
<archive>
43+
<manifest>
44+
<mainClass>
45+
moi.CLIConverter
46+
</mainClass>
47+
</manifest>
48+
</archive>
49+
<descriptorRefs>
50+
<descriptorRef>jar-with-dependencies</descriptorRef>
51+
</descriptorRefs>
52+
<outputDirectory>${project.basedir}/../</outputDirectory>
53+
<appendAssemblyId>false</appendAssemblyId>
54+
</configuration>
55+
</execution>
56+
</executions>
57+
</plugin>
58+
</plugins>
59+
</build>
60+
</project>
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
module converter.moi {
2+
exports moi;
3+
requires org.apache.commons.text;
4+
}
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
package moi;
2+
3+
import org.apache.commons.text.StringEscapeUtils;
4+
5+
import java.util.LinkedList;
6+
import java.util.regex.Matcher;
7+
import java.util.regex.Pattern;
8+
9+
/**
10+
* @author Andre Greiner-Petter
11+
*/
12+
public class CLIConverter {
13+
private CLIConverter(){}
14+
15+
public static final char FUNCTION_APPLY = '\u2061';
16+
public static final char INVISIBLE_TIMES = '\u2062';
17+
18+
private static final Pattern ANY_PATTERN = Pattern.compile("([a-z]+)([(:])");
19+
20+
/**
21+
* Converts the string formatted equation to MathML formatted equation.
22+
* @param str string formatted equation
23+
* @return MathML formatted equation
24+
*/
25+
public static String stringToMML(String str){
26+
StringBuilder sb = new StringBuilder();
27+
stringToMML(sb, str);
28+
return sb.toString();
29+
}
30+
31+
/**
32+
* Converts the string math equation to MathML and fills the given StringBuilder
33+
* with the MathML format. {@link #stringToMML(String)} might be more convenient.
34+
* @param sb StringBuilder will be filled with MathML formatted {@param str}
35+
* @param str the math equation in string format that will be formatted to MathML
36+
*/
37+
public static void stringToMML(StringBuilder sb, String str) {
38+
LinkedList<String> stack = new LinkedList<>();
39+
40+
if ( str.isEmpty() ) {
41+
return;
42+
}
43+
44+
Matcher any = ANY_PATTERN.matcher(str);
45+
while ( any.find() ) {
46+
StringBuffer restBuffer = new StringBuffer();
47+
any.appendReplacement(restBuffer, "");
48+
String prev = restBuffer.toString();
49+
50+
if ( !stack.isEmpty() && stack.getLast().equals(":mtext") ){
51+
prev = prev.substring(0, Math.max(0,prev.length()-1));
52+
53+
if ( prev.contains("(") || prev.contains(")") ) {
54+
String tmp = "";
55+
LinkedList<String> open = new LinkedList<>();
56+
for ( int i = 0; i < prev.length(); i++) {
57+
if ( (prev.charAt(i)+"").equals("(") ) {
58+
open.addLast("(");
59+
}
60+
else if ( (prev.charAt(i)+"").equals(")") ) {
61+
if ( open.isEmpty() ) {
62+
break;
63+
}
64+
open.removeLast();
65+
}
66+
tmp += prev.charAt(i);
67+
}
68+
69+
prev = prev.substring(tmp.length());
70+
sb.append(cleanContent(tmp));
71+
closeTag(sb, "mtext");
72+
stack.removeLast();
73+
} else {
74+
sb.append(cleanContent(prev));
75+
closeTag(sb, "mtext");
76+
stack.removeLast();
77+
prev = "";
78+
}
79+
}
80+
81+
if ( prev.length()>0 ) {
82+
int startIndex = 0;
83+
String buffer = "";
84+
85+
if ( !stack.isEmpty() && stack.getLast().startsWith(":") ){
86+
if ( stack.getLast().startsWith(":") && prev.matches("^,(?:[^,)].*|$)") ) {
87+
// empty mo... very special case
88+
String tag = stack.removeLast();
89+
closeTag(sb, tag.substring(1));
90+
prev = "";
91+
} else {
92+
startIndex++;
93+
buffer += prev.charAt(0);
94+
}
95+
}
96+
97+
for ( int i = startIndex; i < prev.length(); i++) {
98+
String e = ""+prev.charAt(i);
99+
if ( e.equals(")") ) {
100+
if ( !stack.isEmpty() ){
101+
String lastTag = stack.removeLast();
102+
if ( lastTag.startsWith(":") ) {
103+
sb.append(cleanContent(buffer));
104+
closeTag(sb, lastTag.substring(1));
105+
lastTag = stack.removeLast();
106+
}
107+
closeTag(sb, lastTag);
108+
}
109+
} else if ( e.equals(",") ) {
110+
String lastTag = stack.removeLast();
111+
if ( lastTag.startsWith(":") ) {
112+
sb.append(cleanContent(buffer));
113+
closeTag(sb, lastTag.substring(1));
114+
} else {
115+
stack.addLast(lastTag);
116+
}
117+
} else {
118+
buffer += e;
119+
}
120+
}
121+
}
122+
123+
String tmp = any.group(2);
124+
String tag = any.group(1);
125+
126+
if ( tmp.equals(":") ) {
127+
// is an element
128+
stack.addLast(":"+tag);
129+
openTag(sb, tag);
130+
} else {
131+
// is a parent node
132+
stack.addLast(tag);
133+
openTag(sb, tag);
134+
}
135+
}
136+
137+
StringBuffer restBuffer = new StringBuffer();
138+
any.appendTail(restBuffer);
139+
String rest = restBuffer.toString();
140+
141+
String lastElement = stack.removeLast();
142+
if ( lastElement.startsWith(":") ) {
143+
rest = rest.substring(0, Math.max(0,rest.length()-stack.size()));
144+
sb.append(cleanContent(rest));
145+
closeTag(sb, lastElement.substring(1));
146+
} else {
147+
rest = rest.substring(0, Math.max(0,rest.length()-(stack.size()+1)));
148+
sb.append(cleanContent(rest));
149+
closeTag(sb, lastElement);
150+
}
151+
152+
while ( !stack.isEmpty() ) {
153+
String tag = stack.removeLast();
154+
if ( tag.startsWith(":") ) tag = tag.substring(1);
155+
closeTag(sb, tag);
156+
}
157+
}
158+
159+
private static String cleanContent( String content ) {
160+
content = content.matches("ivt") ? "" + INVISIBLE_TIMES : content;
161+
content = content.matches("fap") ? "" + FUNCTION_APPLY : content;
162+
content = StringEscapeUtils.escapeXml11(content);
163+
return content;
164+
}
165+
166+
private static void openTag(StringBuilder sb, String tag){
167+
sb.append("<");
168+
sb.append(tag);
169+
sb.append(">");
170+
}
171+
172+
private static void closeTag(StringBuilder sb, String tag){
173+
sb.append("</");
174+
sb.append(tag);
175+
sb.append(">");
176+
}
177+
178+
public static void main(String[] args) {
179+
if ( args == null || args.length == 0 ) {
180+
System.out.println("You have to provide a string that should be translated.");
181+
return;
182+
}
183+
184+
String in = args[0];
185+
String out = stringToMML(in);
186+
System.out.println(out);
187+
}
188+
}

0 commit comments

Comments
 (0)