You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/02-rucio_usage.md
+23-10Lines changed: 23 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -192,12 +192,12 @@ The following tags are available as of March 2026:
192
192
- E.g. v25.06.2 -> June 2025 software container, version 2
193
193
- **requester\_pwg**
194
194
- Defines the physics working group (PWG) that the simulated data relates to, options are:
195
-
- excl\_diff\_tagging
196
-
- inclusive
197
-
- jets\_hf
198
-
- semi\_inclusive
199
-
- ew\_bsm
200
-
- other
195
+
- edt (exclusive, diffractive and tagging)
196
+
- inclusive
197
+
- jets\_hf
198
+
- semi\_inclusive
199
+
- ew\_bsm
200
+
- other
201
201
- **Can be one or more**
202
202
- **q2\_min**
203
203
- Minumum Q2 value (GeV^2) in the simulation file, entered as a number.
@@ -213,11 +213,14 @@ The following tags are available as of March 2026:
213
213
- True/false depending upon whether sample includes any background mixing
214
214
- **ion\_species**
215
215
- Ion species in the simulation, defaults to `p`, proton, if not specified
216
-
- Typed as formatted in files, e.g. `Au197` for gold, `He3` for helium 3 etc.
216
+
- Typed as formatted in files, e.g. `Au197` for gold, `He3` for helium 3 etc.
217
+
- `Cu63`, `H2`, `Ru96` and `p` are some other options
217
218
- **generator**
218
219
- MC event generator used to generate the simulated data
219
220
- E.g. Pythia8, Herwig etc
220
-
221
+
- Entered as all lower case
222
+
- E.g. `dempgen` *not* `DEMPgen`
223
+
221
224
As noted on some items in this list, some tags are optional and may not be applied to all datasets. However, the following tags are **required** for all datasets:
222
225
223
226
- software\_release
@@ -228,18 +231,28 @@ As noted on some items in this list, some tags are optional and may not be appli
228
231
- ion\_species
229
232
- generator
230
233
234
+
Note that as mentioned for the generator, tags are entered in lower case, **with the exception of ion species**.
235
+
231
236
We can use these tags to filter through the available datasets and identify those of interest to us. For example:
232
237
233
238
```bash
234
239
rucio did list --filter 'TAG==*''scope:*'
235
240
```
236
241
237
-
So, as an example, we could all DIDs using the latest software release (v26.03.0) via:
242
+
So, as an example, we could list all DIDs with electron beam energies of 10 GeV via:
238
243
239
244
```bash
240
-
rucio did list --filter 'software_release==26.03.0*''epic:*'
245
+
rucio did list --filter 'electron_beam_energy==10''epic:*'
241
246
```
242
247
248
+
We can also combine tags and filter on several at once, e.g:
249
+
250
+
```bash
251
+
rucio did list --filter 'electron_beam_energy==10, ion_beam_energy==250''epic:*'
252
+
```
253
+
254
+
which will return only datasets with 10x250 collisions (10 GeV electron on 250 GeV ions using the standard ePIC conventions). We can keep adding filters in this manner as we like to really narrow down the DIDs we return with our query.
255
+
243
256
> ## `Exercise:`
244
257
> Using tags, find the DIDs of the **latest**:
245
258
> - DEMP events in the Q2 range of 3 to 10 for 10 GeV electrons on 250 GeV protons
Copy file name to clipboardExpand all lines: _episodes/03-use_cases.md
+84-33Lines changed: 84 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,35 +29,45 @@ They may also want to only test a small subset of data to test and develop their
29
29
To find files that meet their requirements they could utilise the following tags:
30
30
31
31
- software\_release
32
-
-physics\_process
32
+
-requester\_pwg
33
33
- electron\_beam\_energy
34
34
- ion\_beam\_energy
35
+
- ion\_species
35
36
36
37
We can use these tags to filter through the DIDs and find datasets of interest:
37
38
38
39
```bash
39
-
Example command
40
+
rucio did list --filter 'software_release==XXX, requester_pwg==YYY, electron_beam_energy==ZZ, ion_beam_energy==iii, ion_species==jjj''epic:*'
40
41
```
41
42
43
+
Where we can substitute in our chosen values for each in place of `XXX`, `YYY`, `ZZ`, `iii` and `jjj`.
44
+
42
45
> ## `Beam Energies:`
43
46
> Whilst we can enter any number for the `electron_beam_energy` and `ion_beam_energy` values, there are only certain combinations actually in use.
44
47
> `electron_beam_energy` is typically 5, 10 or 18 GeV
45
-
> `ion_beam_energy` is typically 41, 100, 130, 250 or 275 for protons.
48
+
> `ion_beam_energy` is typically 41, 100, 130, 250 or 275 GeV for protons.
46
49
> For other ion species, 110 and 166 may also be used.
47
50
{: .callout}
48
51
49
-
50
52
Once we have identified a specific dataset of interest, we can look at the files within it using:
51
53
52
54
```bash
53
-
Example command
55
+
rucio did content list scope:name
54
56
```
55
57
56
-
as we saw in the last episode. We could download this file locally using
58
+
and we can get locations of the files within the dataset via:
57
59
58
60
```bash
59
-
Example command
61
+
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_file
62
+
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_Dataset
60
63
```
64
+
as we saw in the last episode. We can get just the location of a specific file OR the location of all files within the dataest, depending upon which we specify. We could download this file locally using
65
+
66
+
```bash
67
+
xrdcp FILEPATH ./
68
+
```
69
+
70
+
where `FILEPATH` is the path to one specific file from the output of one of the rucio commands above.
61
71
62
72
> ## `Exercise:`
63
73
> Using the suggested tags, find the **latest** available datasets for:
@@ -85,46 +95,81 @@ To find files that meet their requirements they could utilise the following tags
85
95
86
96
They may also want to use the `q2\_min` ad `q2\_max` tags, along with the `ion\_species` tags to narrow down to an even more specific subset of files. They may also want to analyse files with or without background enabled.
87
97
88
-
As they want to process a large number of files, **it is unlikely (and not recommended) that they download a large number of files to process them locally**. Instead, they may want to stream their files directly in their analysis script. They could do this via
98
+
As they want to process a large number of files, **it is unlikely (and not recommended) that they download a large number of files to process them locally**. Instead, they may want to stream their files directly in their analysis script. They could do this via:
89
99
90
100
```c++
91
-
root based streaming example
92
-
Full working script
101
+
auto f = TFile::Open("FILEPATH");
102
+
auto tree = f->Get<TTree>("events");
93
103
```
94
104
95
-
or if they're using python -
105
+
or if they're using python:
96
106
97
107
```python
98
-
Python based streaming example
99
-
Full working script
108
+
import uproot
109
+
import XRootD
110
+
file_path ="FILEPATH"
111
+
root_file = uproot.open(file_path)
100
112
```
101
113
102
-
As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset via -
114
+
As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset:
> We have limited this to only pipe 5 files in the dataset to our list.
110
-
> Remove the `fragment` part of the command to instead print all lines.
111
-
> Alternatively, edit this to be the number of lines that you want.
112
-
{: .callout}
113
-
114
-
This could then be processed in the script via -
120
+
This could then be processed in the script:
115
121
116
122
```c++
117
-
root based streaming example
118
-
Full working script
123
+
voidFileListProcess(){
124
+
string line;
125
+
ifstream fstream ("FileList");
126
+
int FileCount = 0;
127
+
//TChain *AnalysisChain = new TChain("events"); // We could define a chain to process our files too and add them as we scan over our list
128
+
while(getline(fstream, line)){
129
+
if (FileCount > 5) continue; // Stop loop after 5 files, comment out to read full file
130
+
// Check file exists
131
+
TString tmpFile{line};
132
+
auto RootFile = TFile::Open(tmpFile);
133
+
if(!RootFile){ // Check file exists
134
+
cout << "File not found:"<<tmpFile << endl;
135
+
continue;
136
+
}
137
+
cout << "Found file - " << line << endl;
138
+
//AnalysisChain->Add(tmpFile) // Add to our chain if we want
139
+
FileCount++;
140
+
}
141
+
}
119
142
```
120
143
121
-
or if they're using python -
144
+
or if they're using python:
122
145
123
146
```python
124
-
Python based streaming example
125
-
Full working script
147
+
importROOT
148
+
import uproot
149
+
import XRootD
150
+
import awkward as ak
151
+
152
+
Files=[]
153
+
154
+
withopen('FileList', 'r') asfile:
155
+
lines_list =file.readlines()
156
+
for line in lines_list[:5]: # Read only lines 0:5 - remove [:5] to read all or change 5 to N where N is the number of lines you want
157
+
file_path = line.rstrip() # rstrip to remove trailing white space/new lines
158
+
try:
159
+
with uproot.open(file_path) asfile:
160
+
Files.append(file_path) # Add file path to array
161
+
print("Found file - ", file_path, "and appended to list for processing.")
162
+
exceptExceptionas e:
163
+
print(f"Could not open file: {e}")
164
+
165
+
# Use the uproot iterate method to process our list of files - See https://uproot.readthedocs.io/en/stable/uproot.behaviors.TBranch.iterate.html
166
+
#for chunk in uproot.iterate({f: "events" for f in Files}, expressions=["MCParticles.PDG"]): # Open files in array f and process events tree with branches specified
167
+
# Process each chunk - Do something
168
+
# print(ak.type(chunk))
126
169
```
127
170
171
+
Note that we have restricted these examples to only print out the first five files in the list we created. We can comment out the lines noted to process the full list (or adjust the cutoff value in the condition to process a different number).
172
+
128
173
> ## `Exercise:`
129
174
> Using the suggested tags, find the **latest** available dataset for:
130
175
> - Deeply Virtual Compton Scattering (DVCS) events from the EpIC event generator for 10 GeV electrons colliding with 130 GeV protons *without* background included
@@ -133,14 +178,20 @@ Full working script
133
178
> 3. Stream **five** of the files in this dataset in a script, check the total number of events contained in all five files.
134
179
{: .challenge}
135
180
136
-
## Detector Designer/Optimiser
181
+
## Detector Designer/Optimiser, Algorithm/Reconstruction Development
182
+
183
+
Discussion of use case based upon SIM data - To be added soon.
184
+
185
+
## Conclusion and Comments
186
+
187
+
That wraps up our introduction to using Rucio and some example use cases and scenarios.
137
188
138
-
Discussion of use case based upon SIM data
189
+
New tags may be added in the future. We're welcome to take on board any suggestions or changes as we roll out Rucio and it becomes more widely used. Get in touch via:
139
190
140
-
## Algorithm/Reconstruction Development
191
+
`stephen.kay@york.ac.uk`
141
192
142
-
Discussion of use case based upon SIM data and tags - merge with previous?
193
+
or on Mattermost with suggestions, comments and feedback.
143
194
144
-
## General Comments
195
+
Remember to consider whether you need full datasets before downloading them and keep an eye on whether your files have multiple open access copies when making file lists.
145
196
146
-
Some general comments and info. Pointers, things to avoid or recommendations etc.
197
+
Also, if you find any nice tricks or develop short scripts (maybe one which makes a file list for the latest version of a dataset based upon inputs?) then feel free to share them too!
0 commit comments