-
Notifications
You must be signed in to change notification settings - Fork 734
Expand file tree
/
Copy pathstorage.cloud
More file actions
247 lines (203 loc) · 9.31 KB
/
storage.cloud
File metadata and controls
247 lines (203 loc) · 9.31 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
````markdown name=README.md
# storage.cloud — Docs & Quickstart
storage.cloud is a quick-reference resource and starter content for working with Google Cloud Storage (GCS). It includes concise how-tos, examples, and small reusable scripts for common tasks such as authenticated access, signed URLs, CORS configuration, and merging data from many objects.
Primary goals
- Provide clear, copy-pastable commands for common workflows.
- Give simple code examples (Python) for object operations and CSV merging.
- Offer troubleshooting tips and security reminders.
Contents
- index.html — site landing page (simple overview).
- docs/quickstart.md — authentication, common gsutil/gcloud/bq commands, signed URLs.
- docs/merge-data.md — guide in Malay for combining files stored in GCS.
- examples/merge_csv_gcs.py — Python example to merge CSVs in a GCS prefix while handling headers.
- cors.json — example CORS configuration.
Contributing
- Send issues or PRs with improvements.
- Keep examples minimal and focused. Prefer service accounts and least privilege for production.
License
- Add your preferred license file (e.g., MIT) to the repo root.
````
```html name=index.html
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>storage.cloud — Google Cloud Storage quickstart</title>
<meta name="viewport" content="width=device-width,initial-scale=1"/>
<style>
body { font-family: system-ui, -apple-system, Roboto, "Segoe UI", Arial; max-width: 900px; margin: 40px auto; line-height: 1.6; padding: 0 20px; }
pre{ background:#f6f8fa; padding:10px; overflow:auto; }
header {border-bottom:1px solid #eee; margin-bottom:20px;}
h1{margin:0;padding:0}
nav a{margin-right:12px}
</style>
</head>
<body>
<header>
<h1>storage.cloud</h1>
<p>Quick reference and starter content for Google Cloud Storage (GCS)</p>
<nav>
<a href="docs/quickstart.md">Quickstart</a>
<a href="docs/merge-data.md">Combine data (Gabungkan data)</a>
<a href="examples/merge_csv_gcs.py">Example scripts</a>
</nav>
</header>
<section>
<h2>Overview</h2>
<p>This site contains short guides and example code for:</p>
<ul>
<li>Authenticated access (gcloud, service accounts, access tokens)</li>
<li>Signed URLs for temporary, non-account-based access</li>
<li>CORS configuration for browser clients</li>
<li>Merging files stored in GCS (CSV, text, etc.)</li>
<li>Loading directly into BigQuery for analytics</li>
</ul>
</section>
<section>
<h2>Important links</h2>
<ul>
<li><a href="https://storage.cloud.google.com/">storage.cloud.google.com</a> — web-based browsing (requires Google sign-in)</li>
<li><a href="https://console.cloud.google.com/storage/browser">Cloud Console — Storage browser</a></li>
<li><a href="https://cloud.google.com/storage/docs">GCS docs</a></li>
</ul>
</section>
<section>
<h2>Get started</h2>
<p>Open <code>docs/quickstart.md</code> for a minimal set of commands and examples you can run locally.</p>
</section>
</body>
</html>
```
````markdown name=docs/quickstart.md
# Quickstart — Google Cloud Storage (GCS)
This page collects the most-used commands and short examples for getting started with GCS.
Authentication
- Local user ADC (Application Default Credentials) — for development:
gcloud auth application-default login
- Service account (recommended for servers/apps):
1. Create service account in Cloud Console or:
gcloud iam service-accounts create my-sa --display-name="My SA"
2. Grant needed roles (example: objectViewer):
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" --role="roles/storage.objectViewer"
3. Download key (if needed):
gcloud iam service-accounts keys create key.json --iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com
Get an access token (for Authorization header):
gcloud auth print-access-token
Common gsutil commands
- List buckets:
gsutil ls gs://
- List objects in a prefix:
gsutil ls gs://BUCKET/PREFIX/
- Download an object:
gsutil cp gs://BUCKET/OBJECT ./local-file
- Upload a file:
gsutil cp ./local-file gs://BUCKET/OBJECT
- Make an object publicly readable (not recommended for sensitive data):
gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT
Signed URLs
- Create a signed URL for temporary access (using gsutil with a service account key):
gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
- Signed URLs are valid up to 7 days when using V4 signing.
CORS example
- Use cors.json (see repo) and apply:
gsutil cors set cors.json gs://BUCKET
Uploading many files / analytics
- Load CSVs directly to BigQuery (no need to pre-concatenate):
bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
Notes & troubleshooting
- Permission denied: ensure the account has roles/storage.objectViewer (or higher) for read, or objectAdmin for writes.
- Invalid credentials: re-run `gcloud auth application-default login` or refresh service account tokens.
- CORS errors: configure the bucket CORS and allow the necessary origins and response headers.
````
````markdown name=docs/merge-data.md
# Gabungkan Semua Data (Combine all data) — Panduan ringkas
Dokumen ini ditulis dalam Bahasa Melayu untuk arahan cepat bagaimana menggabungkan (merge) fail yang disimpan di Google Cloud Storage.
Soalan awal (jawab jika perlu):
1. Fail di mana? (satu bucket / beberapa bucket)
2. Format fail? (CSV/JSON/Parquet)
3. Saiz anggaran? (MB/GB/TB)
4. Output dikehendaki? (satu fail di GCS, jadual BigQuery)
Pilihan pantas
Pilihan A — Gabungkan cepat untuk fail CSV kecil/sederhana (one‑shot)
- Menggunakan gsutil cat (berguna untuk fail kecil, ingat had memori):
gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv
- Nota: Jika setiap CSV mempunyai header, gunakan skrip untuk membuang header bahagian kedua dan seterusnya (contoh di bawah).
Pilihan B — gsutil compose (gabungkan objek tanpa muat turun)
- gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv
- Had: 32 objek setiap compose step. Untuk >32, jalankan compose berperingkat (tree compose).
Pilihan C — Muat naik terus ke BigQuery (disarankan untuk analitik besar)
- BigQuery boleh menerima wildcard CSVs:
bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
Pilihan D — Pipeline (untuk dataset besar/penukaran)
- Gunakan Dataflow (Apache Beam) atau Dataproc (Spark) untuk transformasi dan penulisan semula ke GCS / BigQuery.
Contoh skrip Python — gabung CSV dan buang header berganda
- Fail contoh: `examples/merge_csv_gcs.py` (berguna jika anda mahu kawalan penuh sebelum muat naik semula).
Perkara penting
- Pastikan service account/akaun anda mempunyai permission yang sesuai (roles/storage.objectViewer / storage.objectAdmin).
- Untuk perkongsian hasil: pertimbangkan signed URLs (maks 7 hari) atau tetapkan access controls yang sesuai.
- Untuk fail besar, elakkan memuatkan semuanya ke RAM — gunakan streaming atau gunakan Dataflow/Dataproc.
Jika anda beritahu saya:
- lokasi bucket (contoh: gs://my-bucket/data/), format fail, dan ukuran anggaran, saya boleh hasilkan skrip yang diubah suai untuk anda.
````
```python name=examples/merge_csv_gcs.py
#!/usr/bin/env python3
"""
Merge CSV files in a GCS prefix into one CSV while keeping only the first header.
Requirements:
pip install google-cloud-storage
Usage:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv
"""
import sys
import csv
from io import StringIO
from google.cloud import storage
def merge_csvs(bucket_name, prefix, output_blob_name):
client = storage.Client()
bucket = client.bucket(bucket_name)
blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')])
out_buf = StringIO()
writer = None
first = True
for blob in blobs:
print("Reading:", blob.name)
data = blob.download_as_text()
reader = csv.reader(StringIO(data))
header = next(reader, None)
if first:
writer = csv.writer(out_buf)
if header:
writer.writerow(header)
for row in reader:
writer.writerow(row)
first = False
else:
# skip header for subsequent files
for row in reader:
writer.writerow(row)
out_blob = bucket.blob(output_blob_name)
out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv')
print(f'Uploaded gs://{bucket_name}/{output_blob_name}')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB")
print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv")
sys.exit(1)
merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3])
```
```json name=cors.json
[
{
"origin": ["https://example.com"],
"method": ["GET", "HEAD", "PUT", "POST"],
"responseHeader": ["Content-Type", "x-goog-meta-custom"],
"maxAgeSeconds": 3600
}
]
```
If you want, I can:
- Generate additional localized guides (complete Malay translation).
- Produce a shell script for large-scale compose (handles >32 parts).
- Create a Dataflow (Beam) starter pipeline to merge/transform files at scale.
Which follow-up would you like?