Skip to content

Commit 156de3b

Browse files
authored
Update README.md
1 parent 9b6b9ef commit 156de3b

1 file changed

Lines changed: 53 additions & 1 deletion

File tree

README.md

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,55 @@
1-
[![DOI](https://zenodo.org/badge/187497611.svg)](https://zenodo.org/badge/latestdoi/187497611)
21

32
# FormulaCloudData
3+
4+
This repository contains the results of the distributional analysis of Mathematical Objects of Interest (MOI) for the datasets [arXMLiv 08/2018](https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/) and [zbMATH](https://zbmath.org/).
5+
6+
## Download the Data
7+
8+
For downloading the data either use `wget` or `curl` or go to the [releases of this GitHub repository](https://github.com/ag-gipp/FormulaCloudData/releases) and download it manually.
9+
10+
#### arXMLiv
11+
Unzipped data requires 6.2GB free disk space.
12+
``` sh
13+
user@pc:~/zbmath$ wget https://github.com/ag-gipp/FormulaCloudData/releases/download/2.0-arxiv/arxmliv-distributions.zip
14+
user@pc:~/zbmath$ unzip arxmliv-distributions.zip
15+
```
16+
17+
#### zbMATH
18+
Unzipped data requires 1.1GB free disk space.
19+
```sh
20+
user@pc:~/zbmath$ wget https://github.com/ag-gipp/FormulaCloudData/releases/download/1.0-zb/zbmath-distributions.zip
21+
user@pc:~/zbmath$ unzip zbmath-distributions.zip
22+
```
23+
24+
## Explore the Data
25+
26+
Each dataset contains multiple numbered files without file extensions. You simply can peek into one of the files to explore the general structure. Each entry contains the string representation (SR) of the unique expression in the dataset, the complexity value (C), the total term frequency (TF) in the dataset, and the document frequency (DF) of the expression. All files are CSV files (separated by colons). For example, if you look at the first line of the file `1` in zbMATH you would see the following
27+
``` sh
28+
user@pc:~/zbmath$ head -1 1
29+
"mfrac(mi:d,mrow(mn:1,mo:+,mi:d))";3;1;1
30+
```
31+
32+
If you want so search for specific expressions, say the mass-energy equivalence, we recommend to use `grep`. Here is an example to search for the entry in zbMATH:
33+
``` sh
34+
user@pc:~/zbmath$ grep '"mrow(mi:E,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))"' *
35+
12:"mrow(mi:E,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;63;49
36+
```
37+
As we can see, `E=mc^2` is in file `12`, has a complexity of 4, a total term frequency of 63, and a document frequency of 49.
38+
39+
With `grep` you can also use simple regular expressions to search for patterns. Let's check if the dataset contains expressions that substitutes `E` on the left-hand side by something else.
40+
``` sh
41+
user@pc:~/zbmath$ grep '"mrow(.*,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))"' *
42+
1:"mrow(msub(mi:E,mn:0),mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;1;1
43+
11:"mrow(msup(mi:β,mrow(mo:-,mn:1)),mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;1;1
44+
12:"mrow(mi:E,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;63;49
45+
2:"mrow(mi:ℰ,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;1;1
46+
9:"mrow(mi:e,mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;3;3
47+
9:"mrow(mrow(mi:h,mo:ivt,mi:ν),mo:=,mrow(mi:m,mo:ivt,msup(mi:c,mn:2)))";4;1;1
48+
```
49+
We can see there are actually 6 distinguished left-hand sides in zbMATH:
50+
1) `E_0 = mc^2`
51+
2) `\beta^{-1} = mc^2`
52+
3) `E = mc^2`
53+
4) `\varepsilon = mc^2`
54+
5) `e = mc^2`
55+
6) `hv = mc^2`

0 commit comments

Comments
 (0)