Skip to content

Commit 22f2881

Browse files
committed
Finalised episode 2. Challenges need testing once tags are active. Updated episode 3 with example code. 2/3 use cases now complete.
1 parent 374fc8a commit 22f2881

2 files changed

Lines changed: 107 additions & 43 deletions

File tree

_episodes/02-rucio_usage.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -192,12 +192,12 @@ The following tags are available as of March 2026:
192192
- E.g. v25.06.2 -> June 2025 software container, version 2
193193
- **requester\_pwg**
194194
- Defines the physics working group (PWG) that the simulated data relates to, options are:
195-
- excl\_diff\_tagging
196-
- inclusive
197-
- jets\_hf
198-
- semi\_inclusive
199-
- ew\_bsm
200-
- other
195+
- edt (exclusive, diffractive and tagging)
196+
- inclusive
197+
- jets\_hf
198+
- semi\_inclusive
199+
- ew\_bsm
200+
- other
201201
- **Can be one or more**
202202
- **q2\_min**
203203
- Minumum Q2 value (GeV^2) in the simulation file, entered as a number.
@@ -213,11 +213,14 @@ The following tags are available as of March 2026:
213213
- True/false depending upon whether sample includes any background mixing
214214
- **ion\_species**
215215
- Ion species in the simulation, defaults to `p`, proton, if not specified
216-
- Typed as formatted in files, e.g. `Au197` for gold, `He3` for helium 3 etc.
216+
- Typed as formatted in files, e.g. `Au197` for gold, `He3` for helium 3 etc.
217+
- `Cu63`, `H2`, `Ru96` and `p` are some other options
217218
- **generator**
218219
- MC event generator used to generate the simulated data
219220
- E.g. Pythia8, Herwig etc
220-
221+
- Entered as all lower case
222+
- E.g. `dempgen` *not* `DEMPgen`
223+
221224
As noted on some items in this list, some tags are optional and may not be applied to all datasets. However, the following tags are **required** for all datasets:
222225
223226
- software\_release
@@ -228,18 +231,28 @@ As noted on some items in this list, some tags are optional and may not be appli
228231
- ion\_species
229232
- generator
230233
234+
Note that as mentioned for the generator, tags are entered in lower case, **with the exception of ion species**.
235+
231236
We can use these tags to filter through the available datasets and identify those of interest to us. For example:
232237
233238
```bash
234239
rucio did list --filter 'TAG==*' 'scope:*'
235240
```
236241
237-
So, as an example, we could all DIDs using the latest software release (v26.03.0) via:
242+
So, as an example, we could list all DIDs with electron beam energies of 10 GeV via:
238243
239244
```bash
240-
rucio did list --filter 'software_release==26.03.0*' 'epic:*'
245+
rucio did list --filter 'electron_beam_energy==10' 'epic:*'
241246
```
242247
248+
We can also combine tags and filter on several at once, e.g:
249+
250+
```bash
251+
rucio did list --filter 'electron_beam_energy==10, ion_beam_energy==250' 'epic:*'
252+
```
253+
254+
which will return only datasets with 10x250 collisions (10 GeV electron on 250 GeV ions using the standard ePIC conventions). We can keep adding filters in this manner as we like to really narrow down the DIDs we return with our query.
255+
243256
> ## `Exercise:`
244257
> Using tags, find the DIDs of the **latest**:
245258
> - DEMP events in the Q2 range of 3 to 10 for 10 GeV electrons on 250 GeV protons

_episodes/03-use_cases.md

Lines changed: 84 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -29,35 +29,45 @@ They may also want to only test a small subset of data to test and develop their
2929
To find files that meet their requirements they could utilise the following tags:
3030

3131
- software\_release
32-
- physics\_process
32+
- requester\_pwg
3333
- electron\_beam\_energy
3434
- ion\_beam\_energy
35+
- ion\_species
3536

3637
We can use these tags to filter through the DIDs and find datasets of interest:
3738

3839
```bash
39-
Example command
40+
rucio did list --filter 'software_release==XXX, requester_pwg==YYY, electron_beam_energy==ZZ, ion_beam_energy==iii, ion_species==jjj' 'epic:*'
4041
```
4142

43+
Where we can substitute in our chosen values for each in place of `XXX`, `YYY`, `ZZ`, `iii` and `jjj`.
44+
4245
> ## `Beam Energies:`
4346
> Whilst we can enter any number for the `electron_beam_energy` and `ion_beam_energy` values, there are only certain combinations actually in use.
4447
> `electron_beam_energy` is typically 5, 10 or 18 GeV
45-
> `ion_beam_energy` is typically 41, 100, 130, 250 or 275 for protons.
48+
> `ion_beam_energy` is typically 41, 100, 130, 250 or 275 GeV for protons.
4649
> For other ion species, 110 and 166 may also be used.
4750
{: .callout}
4851

49-
5052
Once we have identified a specific dataset of interest, we can look at the files within it using:
5153

5254
```bash
53-
Example command
55+
rucio did content list scope:name
5456
```
5557

56-
as we saw in the last episode. We could download this file locally using
58+
and we can get locations of the files within the dataset via:
5759

5860
```bash
59-
Example command
61+
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_file
62+
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_Dataset
6063
```
64+
as we saw in the last episode. We can get just the location of a specific file OR the location of all files within the dataest, depending upon which we specify. We could download this file locally using
65+
66+
```bash
67+
xrdcp FILEPATH ./
68+
```
69+
70+
where `FILEPATH` is the path to one specific file from the output of one of the rucio commands above.
6171

6272
> ## `Exercise:`
6373
> Using the suggested tags, find the **latest** available datasets for:
@@ -85,46 +95,81 @@ To find files that meet their requirements they could utilise the following tags
8595

8696
They may also want to use the `q2\_min` ad `q2\_max` tags, along with the `ion\_species` tags to narrow down to an even more specific subset of files. They may also want to analyse files with or without background enabled.
8797

88-
As they want to process a large number of files, **it is unlikely (and not recommended) that they download a large number of files to process them locally**. Instead, they may want to stream their files directly in their analysis script. They could do this via
98+
As they want to process a large number of files, **it is unlikely (and not recommended) that they download a large number of files to process them locally**. Instead, they may want to stream their files directly in their analysis script. They could do this via:
8999

90100
```c++
91-
root based streaming example
92-
Full working script
101+
auto f = TFile::Open("FILEPATH");
102+
auto tree = f->Get<TTree>("events");
93103
```
94104

95-
or if they're using python -
105+
or if they're using python:
96106

97107
```python
98-
Python based streaming example
99-
Full working script
108+
import uproot
109+
import XRootD
110+
file_path = "FILEPATH"
111+
root_file = uproot.open(file_path)
100112
```
101113

102-
As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset via -
114+
As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset:
103115

104116
```bash
105-
Example command to pipe dataset list to a file
117+
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_Dataset > FileList
106118
```
107119

108-
> ## `Note:`
109-
> We have limited this to only pipe 5 files in the dataset to our list.
110-
> Remove the `fragment` part of the command to instead print all lines.
111-
> Alternatively, edit this to be the number of lines that you want.
112-
{: .callout}
113-
114-
This could then be processed in the script via -
120+
This could then be processed in the script:
115121

116122
```c++
117-
root based streaming example
118-
Full working script
123+
void FileListProcess(){
124+
string line;
125+
ifstream fstream ("FileList");
126+
int FileCount = 0;
127+
//TChain *AnalysisChain = new TChain("events"); // We could define a chain to process our files too and add them as we scan over our list
128+
while(getline(fstream, line)){
129+
if (FileCount > 5) continue; // Stop loop after 5 files, comment out to read full file
130+
// Check file exists
131+
TString tmpFile{line};
132+
auto RootFile = TFile::Open(tmpFile);
133+
if(!RootFile){ // Check file exists
134+
cout << "File not found:"<<tmpFile << endl;
135+
continue;
136+
}
137+
cout << "Found file - " << line << endl;
138+
//AnalysisChain->Add(tmpFile) // Add to our chain if we want
139+
FileCount++;
140+
}
141+
}
119142
```
120143

121-
or if they're using python -
144+
or if they're using python:
122145

123146
```python
124-
Python based streaming example
125-
Full working script
147+
import ROOT
148+
import uproot
149+
import XRootD
150+
import awkward as ak
151+
152+
Files=[]
153+
154+
with open('FileList', 'r') as file:
155+
lines_list = file.readlines()
156+
for line in lines_list[:5]: # Read only lines 0:5 - remove [:5] to read all or change 5 to N where N is the number of lines you want
157+
file_path = line.rstrip() # rstrip to remove trailing white space/new lines
158+
try:
159+
with uproot.open(file_path) as file:
160+
Files.append(file_path) # Add file path to array
161+
print("Found file - ", file_path, "and appended to list for processing.")
162+
except Exception as e:
163+
print(f"Could not open file: {e}")
164+
165+
# Use the uproot iterate method to process our list of files - See https://uproot.readthedocs.io/en/stable/uproot.behaviors.TBranch.iterate.html
166+
#for chunk in uproot.iterate({f: "events" for f in Files}, expressions=["MCParticles.PDG"]): # Open files in array f and process events tree with branches specified
167+
# Process each chunk - Do something
168+
# print(ak.type(chunk))
126169
```
127170

171+
Note that we have restricted these examples to only print out the first five files in the list we created. We can comment out the lines noted to process the full list (or adjust the cutoff value in the condition to process a different number).
172+
128173
> ## `Exercise:`
129174
> Using the suggested tags, find the **latest** available dataset for:
130175
> - Deeply Virtual Compton Scattering (DVCS) events from the EpIC event generator for 10 GeV electrons colliding with 130 GeV protons *without* background included
@@ -133,14 +178,20 @@ Full working script
133178
> 3. Stream **five** of the files in this dataset in a script, check the total number of events contained in all five files.
134179
{: .challenge}
135180

136-
## Detector Designer/Optimiser
181+
## Detector Designer/Optimiser, Algorithm/Reconstruction Development
182+
183+
Discussion of use case based upon SIM data - To be added soon.
184+
185+
## Conclusion and Comments
186+
187+
That wraps up our introduction to using Rucio and some example use cases and scenarios.
137188

138-
Discussion of use case based upon SIM data
189+
New tags may be added in the future. We're welcome to take on board any suggestions or changes as we roll out Rucio and it becomes more widely used. Get in touch via:
139190

140-
## Algorithm/Reconstruction Development
191+
`stephen.kay@york.ac.uk`
141192

142-
Discussion of use case based upon SIM data and tags - merge with previous?
193+
or on Mattermost with suggestions, comments and feedback.
143194

144-
## General Comments
195+
Remember to consider whether you need full datasets before downloading them and keep an eye on whether your files have multiple open access copies when making file lists.
145196

146-
Some general comments and info. Pointers, things to avoid or recommendations etc.
197+
Also, if you find any nice tricks or develop short scripts (maybe one which makes a file list for the latest version of a dataset based upon inputs?) then feel free to share them too!

0 commit comments

Comments
 (0)